We are moving our course lessons to an improved textbook series. All of the same content will be improved and available by the end of Spring 2020. While these pages will automagically redirect, you can also visit the links below to check out our new content! Our course landing pages with associated readings and assignments will stay here so you can continue to follow along with our courses!

Lesson 2. Data Workflow Best Practices - Things to Consider When Processing Data

Learning Objectives

After completing this tutorial, you will be able to:

  • Be able to approach a coding task with a modular, systematic approach.
  • Be able to write pseudo code.
  • Be able to identify aspects of a workflow that can be modularized (i.e. ideal for functions).

What You Need

You will need a computer with internet access to complete this lesson and the data for week 10 of the course.

Download Landsat Automation Data (~600 MB)

Or use earthpy'ndvi-automation')

Design an Efficient Data Workflow

You have now identified your challenge - to plot average NDVI for every landsat scene individually across one year for two study locations:

  1. San Joaquin Experimental Range (SJER) in Southern California, United States
  2. Harvard Forest (HARV) in the Eastern United States

You have been given Landsat 8 data and a study area boundary for each site. Your next step is to write out the steps that you need to follow to get to your end goal plot.

You know you will need to do the following:

  1. Open the data
  2. Calculate NDVI
  3. Save NDVI values to pandas dataframe with an associated date and the site name for plotting.

Begin your process by pseudo coding out - or writing out steps in plain English - everything that you will need to do in order to get to your end goal. The steps below will walk you through beginning this process of writing pseudocode.

Write Pseudocode

  1. Open up the data and crop it. Landsat 8 data is provided as a series of geotiff files - one for each band and one with the quality assurance information (e.g. cloud cover and shadows).
# Steps required to process one landsat scene for a site
1. Get a list of files in the directory
2. Subset the list to just the files required to calculate NDVI
3. Open and crop each band
4. Calculate average NDVI for that scene

Great! You have now identified the steps required to process a single landsat scene. Those steps now need to be repeated across all of the scenes for each site for a year.

# Steps required to process all landsat scenes for a site 
1. Get a list of all directories containing landsat scenes for the site 
2. “Open” each directory and calculate NDVI for the data in that directory (The steps your created above cover this step!)
3. Save NDVI values and date for that directory - which represents a day when landsat was collected (there are some steps here that you need to flesh out as well) to an object that contains average NDVI for each day at this site.

OK - now you are ready to put the workflow for a single site together. Of course, remember there are some sub steps that have not been fleshed out just yet but start with the basics and build from there.

# Steps required to process all landsat scenes for a site 
1. Get a list of all directories containing landsat scenes for the site. Each directory contains one scene of data collected on a particular date.
2. Use the data in each directory to calculate NDVI and be sure to grab the date associated with the NDVI calculation (The steps your created above cover this step!)
   # Steps required to process one landsat scene
    1. Get a list of files in the directory
    2. Subset the list to just the files required to calculate NDVI
    3. Open and crop each band
    4. Calculate NDVI for that scene
3. Save NDVI values and date for that directory - which represents a day when landsat was collected (there are some steps here that you need to flesh out as well) to an object that contains ndvi for each day at this site.

Add Several Sites Worth of Data to your Workflow

Above you begin to think about the steps associated with creating a workflow for:

  1. A single landsat scene where NDVI values
  2. A set of landsat scenes for a particular site

But you want to do this for two or more sites. You are not sure if you will have more than two sites but in this case you want to design a workflow that allows for two or more sites. Add an additional layer to your pseudo code.

# Modular workflow for many sites
1. Get list of all directories and associated site names
2. Open each site directory, get landsat scenes and calculate mean NDVI for each scene for that site
3. Create summary NDVI by site by date data frame for all sites
4. Export NDVI values to csv. (optional “data product” output)

# Produce your plot!

In this example, an assumption is made that your data are nicely organized into a single directory for each site. Like this:


You’ve now designed a workflow to process several sites worth of landsat data. But of course the work is just beginning. For each of the steps above you need to flesh out how you can accomplish each task. Before you design your workflow in more detail, consider the data workflow best practices discussed below.

Data Workflow Best Practices

There are many things to keep in mind when creating a workflow. A few include:

1. Do not modify the raw data

Typically you do not want to modify the raw data, so that you can always reproduce your output from the same inputs. In some cases, this is less important, e.g., when you are not tasked with curating the raw data. For example you could redownload the landsat data provided in this class.

In other cases, this is extremely important - you wouldn’t want to recollect a summers worth of field collected data because you accidentally modified the original spreadsheet.

A good rule of thumb is to create an “outputs” directory where you store outputs. An outputs directory is helpful because:

  1. You can always delete and recreate the outputs without worrying about deleting your original data.
  2. An outputs directory is expressive and reminds anyone looking at your project (including you in the future - i.e. your future self!) where the output products are - they don’t have to dig for the data.

2. Create Expressive Intermediate File Names and Outputs

You want to carefully consider how you name your intermediate and end product outputs so that the names reflect your intention and the content. For instance - in this case you are creating an NDVI dataframe which would be exported to a csv.

Here are some naming options:

# Less than ideal name
# OK Name
# Expressive name

3. Whenever Possible Create Modular Code

As you are working on your pseudo code, consider the parts of your analysis that are well suited for functions. Ideally a Python function will have 1-3 inputs and will produce 1-2 outputs.

Keep functions small and focused. A function should do one thing well. You can then combine many functions into a wrapper function to automate and make for a nicely crafted program.

Leave a Comment