Lesson 3. Extract raster values using vector boundaries in Python


Learning Objectives

After completing this tutorial, you will be able to:

  • Summarize tabular data using pandas in Python
  • Create a scatter plot with a one to one line in Python using Matplotlib
  • Merge two data frames in Python

What You Need

You will need a computer with internet access to complete this lesson. You will also need the data you downloaded for last week of this class: spatial-vector-lidar data subset.

Download Spatial Lidar Teaching Data Subset data

or using the earthpy package:

et.data.get_data("spatial-vector-lidar")

In this lesson series your overall goal is to compare tree height measurements taken by humans in the field to height values extracted from a lidar remote sensing canopy height model. You are working with two different types of data:

  1. Raster - the lidar canopy height model and
  2. Vector point location data

In the previous lesson, you learned how to extract raster values from an area derived by create a buffer region around each point in a shapefile. In this lesson you will summarize the human made measurements and then compare them to lidar.

To begin, load all of the required libraries.

import os
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import rasterio as rio
from rasterio.plot import plotting_extent
import geopandas as gpd
import rasterstats as rs
from earthpy import spatial as es
import earthpy as et

os.chdir(os.path.join(et.io.HOME, 'earth-analytics'))

For this lesson you will work with the Lidar Canopy Height Model created by NEON located here:

data/spatial-vector-lidar/california/neon-sjer-site/2013/lidar/SJER_lidarCHM.tif

# Load data
with rio.open('data/spatial-vector-lidar/california/neon-sjer-site/2013/lidar/SJER_lidarCHM.tif') as sjer_lidar_chm_src:
    SJER_chm_data = sjer_lidar_chm_src.read(1, masked=True)
    sjer_chm_meta = sjer_lidar_chm_src.profile
    sjer_chm_plt = plotting_extent(sjer_lidar_chm_src)

plot_buffer_path = 'data/spatial-vector-lidar/outputs/plot_buffer.shp'

# Extract zonal stats & create geodataframe
sjer_tree_heights = rs.zonal_stats(plot_buffer_path,
                                   SJER_chm_data,
                                   affine=sjer_chm_meta['transform'],
                                   geojson_out=True,
                                   copy_properties=True,
                                   nodata=0,
                                   stats="count mean max")

SJER_lidar_height_df = gpd.GeoDataFrame.from_features(sjer_tree_heights)

Is Lidar Derived Tree Height the Same As Human Measured Tree Height?

You have now done the following:

  1. You’ve opened and cleaned up some lidar canopy height model data
  2. You’ve extracted height values for the field plot locations where humans measured trees.

Next, you need to summarize the in situ collected tree height data, measured within circular plots across our study area. You will then compare the maximum measured tree height value to the maximum LiDAR derived height value for each circular plot.

For this lesson, you will use the a .csv (comma separate value) file, located in SJER/2013/insitu/veg_structure/D17_2013_SJER_vegStr.csv.

First determine the number many plots are in the tree height data. Note that the tree height data is stored in .csv format.

# Import & view insitu (field measured) data
path_insitu = 'data/spatial-vector-lidar/california/neon-sjer-site/2013/insitu/veg_structure/D17_2013_SJER_vegStr.csv'
SJER_insitu_all = pd.read_csv(path_insitu)

# View columns in data
SJER_insitu_all.columns
Index(['siteid', 'sitename', 'plotid', 'easting', 'northing', 'taxonid',
       'scientificname', 'indvidual_id', 'pointid', 'individualdistance',
       'individualazimuth', 'dbh', 'dbhheight', 'basalcanopydiam',
       'basalcanopydiam_90deg', 'maxcanopydiam', 'canopydiam_90deg',
       'stemheight', 'stemremarks', 'stemstatus', 'canopyform', 'livingcanopy',
       'inplotcanopy', 'materialsampleid', 'dbhqf', 'stemmapqf', 'plant_group',
       'common_name', 'aop_plot', 'unique_id'],
      dtype='object')

Before you go any further, you may want to select just the columns that you will need in your analysis. This will make your data a bit cleaner. In some cases you will not want to drop columns. However for this lesson, there is no reason to keep the extra data as you won’t use it in this analysis!

SJER_insitu = SJER_insitu_all[[
    "siteid", "sitename", "plotid", "stemheight", "scientificname"]]
SJER_insitu.head()
siteidsitenameplotidstemheightscientificname
0SJERSan JoaquinSJER12818.2Pinus sabiniana
1SJERSan JoaquinSJER27963.3Arctostaphylos viscida
2SJERSan JoaquinSJER2721.7Arctostaphylos viscida
3SJERSan JoaquinSJER1122.1Arctostaphylos viscida
4SJERSan JoaquinSJER2723.0Arctostaphylos viscida

Summarize Tree Height Data Using Pandas

You want to calculate a summary value of max tree height (the tallest tree measured) in each plot. You have a unique id for each plot - plotid that can be used to group the data. The tree height values themselves are located in the stemheight column.

You can calculate this by using the .groupy() method in pandas.

The steps are

  1. .groupby() - group the data by the plotid column - your unique identifier for each plot.
  2. .agg() - provide the summary statistics that you want to return for each plot. in this case max and mean.
  3. below ['stemheight]` is the name of the column that you want to summarize.

SJER_insitu.groupby('plotid').agg(['mean', 'min', 'max'])['stemheight']

insitu_stem_ht = SJER_insitu.groupby('plotid').agg(
    ['mean', 'max'])['stemheight']
insitu_stem_ht.head()
meanmax
plotid
SJER10683.86666719.3
SJER1128.22142923.9
SJER1168.21875016.0
SJER1176.51250011.0
SJER1207.6000008.8

You are almost done sumarizing your data. For expressive and reproducible reasons, add the word insitu to each column header so it’s very clear which data columns are human measured. This is important given you will MERGE this data frame with the data frame containing lidar mean, min and max values.

Notice that below you use a pythonic approach to creating for loops. Rather than looping through each column and appending the word “insitu”, you create a pythonic for loop which populates a list. You then reassign that list fo the column names for the insitu_stem_ht dataframe.

['insitu_' + col for col in insitu_stem_ht.columns]
['insitu_plotid', 'insitu_insitu_mean', 'insitu_insitu_max']

Rename each column - appending “insitu”.

# Add insitu to each column name to make your data more expressive
insitu_stem_ht.columns = ['insitu_' + col for col in insitu_stem_ht.columns]

# Reset the index (plotid)
insitu_stem_ht = insitu_stem_ht.reset_index()
insitu_stem_ht.head()
plotidinsitu_meaninsitu_max
0SJER10683.86666719.3
1SJER1128.22142923.9
2SJER1168.21875016.0
3SJER1176.51250011.0
4SJER1207.6000008.8

Merge InSitu Data With Spatial data.frame

Once you have our summarized insitu data, you can merge it into the centroids data frame. Merge requires two data.frames and the names of the columns containing the unique ID that we will merge the data on. In this case, you will merge the data on the plot_id column. Notice that it’s spelled slightly differently in both data.frames so we’ll need to tell Python what it’s called in each dataframe.

Note that if you want to merge two GeoDataFrames together, you cannot use the standard Pandas merge function. This will turn the GeoDataFrame into a regular DataFrame. Instead, you need to use the merge method of a GeoDataFrame object, like so:

# Rename columns so that we know which columns represent lidar values
SJER_lidar_height_df = SJER_lidar_height_df.rename(
    columns={'max': 'lidar_max', 'mean': 'lidar_mean', 'min': 'lidar_min'})

# Join lidar and human measured tree height data
SJER_final_height = SJER_lidar_height_df.merge(insitu_stem_ht,
                                               left_on='Plot_ID',
                                               right_on='plotid')
SJER_final_height.head()
Plot_IDPointcounteastinggeometrylidar_maxlidar_meannorthingplot_typeplotidinsitu_meaninsitu_max
0SJER1068center161255852.376POLYGON ((255872.376 4111567.818, 255872.27969...19.04999911.5443484111567.818treesSJER10683.86666719.3
1SJER112center443257406.967POLYGON ((257426.967 4111298.971, 257426.87069...24.01999910.3692774111298.971treesSJER1128.22142923.9
2SJER116center643256838.760POLYGON ((256858.76 4110819.876, 256858.663694...16.0700007.5183984110819.876grassSJER1168.21875016.0
3SJER117center245256176.947POLYGON ((256196.947 4108752.026, 256196.85069...11.0599997.6753474108752.026treesSJER1176.51250011.0
4SJER120center17255968.372POLYGON ((255988.372 4110476.079, 255988.27569...5.7400004.5911764110476.079grassSJER1207.6000008.8

Column Names Matter

Take note that while you you don’t have to rename the columns as you did above in order to successfully computer your final merged dataframe, it helps if you do because

  1. Now anyone looking at your data knows what each column represents.
  2. If you export the data to a text file, your columns are named expressively
  3. If you return to this analysis in 6 months, you will still be able to quickly understand what data are in each column if they are well named!

Plot Data (CHM vs Measured)

You’ve now merged the two dataframes together. Your are ready to create your first scatterplot of the data. You can use the pandas .plot() to create a scatterplot (or you can use matplotlib directly). The example below uses pandas plotting.

# Convert to a dataframe so you can use standard pandas plotting
SJER_final_height_df = pd.DataFrame(SJER_final_height)

fig, ax = plt.subplots(figsize=(10, 10))

SJER_final_height_df.plot('lidar_max',
                          'insitu_max',
                          kind='scatter',
                          fontsize=14, s=60, 
                          ax=ax)

ax.set(xlabel="Lidar derived max tree height",
       ylabel="Measured tree height (m)")
ax.set_title("Lidar vs Measured Max Tree Height \n NEON SJER Field Site",
             fontsize=30)
plt.show()
Scatterplot showing the relationship between lidar and measured tree height without a 1:1 line.
Scatterplot showing the relationship between lidar and measured tree height without a 1:1 line.

Next, let’s fix the plot adding a 1:1 line and making the x and y axis the same .

fig, ax = plt.subplots(figsize=(10, 10))

SJER_final_height_df.plot('lidar_max',
                          'insitu_max',
                          kind='scatter',
                          fontsize=14,
                          s=60, ax=ax)

ax.set(xlabel="Lidar Derived Max Tree Height",
       ylabel="Measured Tree Height (m)")
ax.set_title("Lidar vs Measured Max Tree Height \n NEON SJER Field Site",
             fontsize=30)

# Add 1:1 line
ax.plot((0, 1), (0, 1), 
        transform=ax.transAxes, ls='--', c='k')

# Adjust x and y axis limits
ax.set(xlim=[0, 30], ylim=[0, 30])
plt.show()
Scatterplot showing the relationship between lidar and measured tree height with a 1:1 line.
Scatterplot showing the relationship between lidar and measured tree height with a 1:1 line.

OPTIONAL - Export Results as a .csv File

You may want to export your final analysis file as a .csv file. You can use the Pandas .to_csv() method to export a dataframe. .to_csv requires a directory that exists on your computer and a file name like this:

dataframe_name.to_csv("data/spatial-vector-lidar/outputs/sjer-lidar-insitu-merge.csv")

# Export the final data frame as a csv file.
SJER_final_height_df.to_csv(
    "data/spatial-vector-lidar/outputs/sjer-lidar-insitu-merge.csv")

Create Map of Plot Locations Sized by Tree Height

Finally, you may want to create a map where points are sized according to tree height. To do that you

  1. Create or use a point geometry for each site location. In the case below we are using the data.frame that had the buffered points, and updating the geometry so that it is a point rather than the buffered polygon geometry.
  2. Then you set the point markersize using an attribute in your geodataframe. In the example below you use insitu_maxht
# Convert the geometry column to contain points
SJER_final_height['geometry'] = SJER_final_height.centroid
SJER_final_height.head()

SJER_final_height['insitu_max']
0     19.3
1     23.9
2     16.0
3     11.0
4      8.8
5     18.2
6     13.7
7     12.4
8      9.4
9     17.9
10     9.2
11    11.8
12    11.5
13    10.8
14     5.2
15    26.5
16    18.4
17     7.7
Name: insitu_max, dtype: float64

Plot the points by tree height.

fig, ax = plt.subplots(figsize=(10, 10))
ax.imshow(SJER_chm_data,
          cmap='Greys',
          extent=sjer_chm_plt)

# Plot centroids of each geometry as points so that you can control their size
SJER_final_height.centroid.plot(ax=ax,
                                marker='o',
                                markersize=SJER_final_height['insitu_max'] * 80,
                                c='purple')
plt.show()
Map showing plot locations with points sized by the height of vegetation in each plot overlayed on top of a canopy height model.
Map showing plot locations with points sized by the height of vegetation in each plot overlayed on top of a canopy height model.

Optional - Create Difference Bar Plot: Lidar vs Measured

The last comparison that you may wish to explore is the plot by plot difference between lidar and measured tree height data. This is often helpful when you are trying to troubleshoot outlier values in your data. For instance you may notice that a few plots have very large differences between lidar and measured tree height.

You may decide to either:

  1. Visit the sites if you are close to the field site or
  2. Explore imagery for the sites to see if you can figure out a good reason for why the results may be so different.

Below you do the following

  1. You first subtract field measured tree height from lidar estimates
  2. Then you create a barplot of that value
# Calculate difference
SJER_final_height["lidar_measured"] = SJER_final_height["lidar_max"] - \
    SJER_final_height["insitu_max"]

# creat bar plot
fig, ax = plt.subplots(figsize=(10, 5))
ax.bar(SJER_final_height['plotid'],
       SJER_final_height['lidar_measured'])

ax.set_title("Difference Between lidar and Measured Tree Height (m)")
ax.set(xlabel='Plot ID',
       ylabel='Difference (Lidar - Measured) meters')
plt.setp(ax.get_xticklabels(),
         rotation=45, horizontalalignment='right')
plt.show()
Barplot showing the difference between lidar and measured tree height for each plot.
Barplot showing the difference between lidar and measured tree height for each plot.

Leave a Comment