Lesson 2. Subset Time Series By Dates Python Using Pandas


In this lesson, you learn how to subset time series data into Python. You will also test the skills that you learned in the previous lessons to handle dates and missing data in Python.

Learning Objectives

After completing this tutorial, you will be able to:

  • Subset time series data by dates

What You Need

You need Python 3.x and Jupyter notebooks to complete this tutorial. Also you should have an earth-analytics directory setup on your computer with a /data directory with it.

Download Colorado Flood Teaching Data Subset data

About The NOAA Precipitation Data Used In This Lesson

To complete this lesson, you will use a slightly modified version of precipitation data downloaded through the National Centers for Environmental Information (formerly National Climate Data Center) Cooperative Observer Network (COOP) station 050843 in Boulder, CO. The data were collected : 1 January 2003 through 31 December 2013.

Your instructor has modified these data as follows for this lesson:

  • aggregated the data to represent daily sum values
  • added some noData values to allow you to practice handing missing data
  • added several columns to this data that would not usually be there if you downloaded it directly.

The added columns include:

  • Year
  • Julian day

How Is Precipitation Measured?

Precipitation can be measured using different types of gages. Some gages are manually read and emptied. Others gages automatically record the amount of precipitation collected. If the precipitation is in a frozen form (snow, hail, freezing rain) then the contents of the gage are melted to get the water equivalency for measurement. Rainfall is generally reported as the total amount of rain (millimeters, centimeters, or inches) over a given period of time.

Data Tip: Precipitation is the moisture that falls from clouds including rain, hail and snow.

Boulder, Colorado lays on the eastern edge of the Rocky Mountains where they meet the high plains. The average annual precipitation is near 20”. However, the precipitation comes in many forms including winter snow, intense summer thunderstorms, and intermittent storms throughout the year.

Use Precipitation Time Series Data in Python

You can use precipitation data to understand events like the 2013 floods that occurred in Colorado. However to work with these data in Python, you need to know how to do a few things:

  1. Open a .csv file in Python
  2. Ensure dates are read as a date/time format in python
  3. Handle missing data values appropriately

It’s also useful to know how to subset the data by time periods when analyzing and plotting.

You’ve already learned how to open a .csv and how to handle dates. In this lesson you will use these skills and learn how to subset the data by time.

Get Started With Time Series Data

Get started by loading the required python libraries into your Jupyter notebook.

# load python libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
plt.ion()

import earthpy as et 
# set working directory
os.chdir(os.path.join(et.io.HOME, 'earth-analytics'))

# set standard plot parameters for uniform plotting
plt.rcParams['figure.figsize'] = (8, 8)

# set working directory
os.chdir(os.path.join(et.io.HOME, 'earth-analytics'))

# set standard plot parameters for uniform plotting
plt.rcParams['figure.figsize'] = (10, 6)
# prettier plotting with seaborn
import seaborn as sns; 
sns.set(font_scale=1.5)
sns.set_style("whitegrid")
# read the data into python
boulder_daily_precip = pd.read_csv('data/colorado-flood/precipitation/805325-precip-dailysum-2003-2013.csv', 
                                   parse_dates=['DATE'],
                                  na_values = ['999.99'])
# view first 5 rows
boulder_daily_precip.head()
DATEDAILY_PRECIPSTATIONSTATION_NAMEELEVATIONLATITUDELONGITUDEYEARJULIAN
02003-01-010.0COOP:050843BOULDER 2 CO US1650.540.03389-105.2811120031
12003-01-05NaNCOOP:050843BOULDER 2 CO US1650.540.03389-105.2811120035
22003-02-010.0COOP:050843BOULDER 2 CO US1650.540.03389-105.28111200332
32003-02-02NaNCOOP:050843BOULDER 2 CO US1650.540.03389-105.28111200333
42003-02-030.4COOP:050843BOULDER 2 CO US1650.540.03389-105.28111200334

It’s always a good idea to explore the data before working with it.

# view structure of data
boulder_daily_precip.dtypes
DATE            datetime64[ns]
DAILY_PRECIP           float64
STATION                 object
STATION_NAME            object
ELEVATION              float64
LATITUDE               float64
LONGITUDE              float64
YEAR                     int64
JULIAN                   int64
dtype: object
# view data summary statistics for all columns
boulder_daily_precip.describe()
DAILY_PRECIPELEVATIONLATITUDELONGITUDEYEARJULIAN
count788.000000792.0792.000000792.000000792.000000792.000000
mean0.2478431650.540.033850-105.2811062007.967172175.541667
std0.4625580.00.0000450.0000053.14928798.536373
min0.0000001650.540.033800-105.2811102003.0000001.000000
25%0.1000001650.540.033800-105.2811102005.00000096.000000
50%0.1000001650.540.033890-105.2811102008.000000167.000000
75%0.3000001650.540.033890-105.2811002011.000000255.250000
max9.8000001650.540.033890-105.2811002013.000000365.000000

About the Data

Viewing the structure of these data, you can see that different types of data are included in this file.

  • STATION and STATION_NAME: Identification of the COOP station.
  • ELEVATION, LATITUDE and LONGITUDE: The spatial location of the station.
  • DATE: The date when the data were collected in the format: YYYYMMDD. Notice that DATE is a datetime64 because you used the parse_date function on the date column when the csv was first read in.
  • DAILY_PRECIP: The total precipitation in inches. Important: the meta data notes that the value 999.99 indicates missing data. Also important,hours with no precipitation are not recorded.
  • YEAR: the year the data were collected
  • JULIAN: the JULIAN DAY the data were collected.

Additional information about the data, known as metadata, is available in the PRECIP_HLY_documentation.pdf. The metadata tell us that the noData value for these data is 999.99. IMPORTANT: your instructor has modified these data a bit for ease of teaching and learning. Specifically, she aggregated the data to represent daily sum values and added some noData values to ensure you learn how to clean them.

Data Tip You can download the original complete data subset with additional documentation here.

Subset Data Temporally in Pandas

There are many ways to subset the data temporally in Python. The easiest way to do this is to use pandas. Pandas natively understands time operations if

  1. you tell it what column contains your time stamps and
  2. you set the date column of your data to be the index of the dataframe

Both of these steps can be included in your read_csv import by adding the arguments:

parse_dates and index_col

as done below. Note that for each argument, the column that contains the time stamps is specified.

# read the data into python setting date as an index
boulder_daily_precip = pd.read_csv('data/colorado-flood/precipitation/805325-precip-dailysum-2003-2013.csv', 
                                   parse_dates=['DATE'],
                                  na_values = ['999.99'],
                                  index_col = 'DATE')

Now the magic happens! Once you have specified an index column, your data frame looks like the one below when printed. Notice that the DATE column is lower visually then the other column names. It’s also on the LEFT hand side of the dataframe. This is because it is now an index.

boulder_daily_precip.head()
DAILY_PRECIPSTATIONSTATION_NAMEELEVATIONLATITUDELONGITUDEYEARJULIAN
DATE
2003-01-010.0COOP:050843BOULDER 2 CO US1650.540.03389-105.2811120031
2003-01-05NaNCOOP:050843BOULDER 2 CO US1650.540.03389-105.2811120035
2003-02-010.0COOP:050843BOULDER 2 CO US1650.540.03389-105.28111200332
2003-02-02NaNCOOP:050843BOULDER 2 CO US1650.540.03389-105.28111200333
2003-02-030.4COOP:050843BOULDER 2 CO US1650.540.03389-105.28111200334

Also notice if you look at the structure of your dataframe, that the DATE column isn’t rendered. But if you call boulder_daily_precip.index you will find it.

# your date column that is the index now is not listed below
boulder_daily_precip.dtypes
DAILY_PRECIP    float64
STATION          object
STATION_NAME     object
ELEVATION       float64
LATITUDE        float64
LONGITUDE       float64
YEAR              int64
JULIAN            int64
dtype: object
# access the index of your dataframe
boulder_daily_precip.index
DatetimeIndex(['2003-01-01', '2003-01-05', '2003-02-01', '2003-02-02',
               '2003-02-03', '2003-02-05', '2003-02-06', '2003-02-07',
               '2003-02-10', '2003-02-18',
               ...
               '2013-11-01', '2013-11-09', '2013-11-21', '2013-11-27',
               '2013-12-01', '2013-12-04', '2013-12-22', '2013-12-23',
               '2013-12-29', '2013-12-31'],
              dtype='datetime64[ns]', name='DATE', length=792, freq=None)

Once you have a dataframe setup with an index, you can begin to subset your data using the syntax:

data-frame-name[‘index -period-here’]

# below you subset all of the data for 2013 - the first 5 rows are shown
boulder_daily_precip['2013'].head()

DAILY_PRECIPSTATIONSTATION_NAMEELEVATIONLATITUDELONGITUDEYEARJULIAN
DATE
2013-01-010.0COOP:050843BOULDER 2 CO US1650.540.0338-105.281120131
2013-01-280.1COOP:050843BOULDER 2 CO US1650.540.0338-105.2811201328
2013-01-290.1COOP:050843BOULDER 2 CO US1650.540.0338-105.2811201329
2013-02-010.0COOP:050843BOULDER 2 CO US1650.540.0338-105.2811201332
2013-02-140.1COOP:050843BOULDER 2 CO US1650.540.0338-105.2811201345

You can subset within a date range using the syntax below.

# you can subset this way
boulder_daily_precip['2013-05-01':'2013-06-06']
DAILY_PRECIPSTATIONSTATION_NAMEELEVATIONLATITUDELONGITUDEYEARJULIAN
DATE
2013-05-011.4COOP:050843BOULDER 2 CO US1650.540.0338-105.28112013121
2013-05-020.1COOP:050843BOULDER 2 CO US1650.540.0338-105.28112013122
2013-05-080.4COOP:050843BOULDER 2 CO US1650.540.0338-105.28112013128
2013-05-090.5COOP:050843BOULDER 2 CO US1650.540.0338-105.28112013129
2013-05-200.1COOP:050843BOULDER 2 CO US1650.540.0338-105.28112013140
2013-05-230.1COOP:050843BOULDER 2 CO US1650.540.0338-105.28112013143
2013-05-290.1COOP:050843BOULDER 2 CO US1650.540.0338-105.28112013149
2013-06-010.0COOP:050843BOULDER 2 CO US1650.540.0338-105.28112013152
2013-06-050.1COOP:050843BOULDER 2 CO US1650.540.0338-105.28112013156

For this example, subset the data to the time period: January 1 2003 - October 31 2003.

# subset the data to a date range
precip_boulder_AugOct = boulder_daily_precip['2003-01-01':'2003-10-31']
# did it work? 
print(precip_boulder_AugOct.index.min())
print(precip_boulder_AugOct.index.max())
2003-01-01 00:00:00
2003-10-31 00:00:00

Plot Subsetted Data

Once you’ve subsetted the data, you can plot the data to focus in on the new time period. Note that you need to call .index to access the date column which is now an index rather than a regular index.

You can see that in the plot call below:

ax.scatter(precip_boulder_AugOct.index.values, 
       precip_boulder_AugOct['DAILY_PRECIP'].values, 
       color = 'purple')

You are now ready to plot your data.

# plot the data
fig, ax = plt.subplots(figsize = (8,8))
ax.scatter(precip_boulder_AugOct.index.values, 
       precip_boulder_AugOct['DAILY_PRECIP'].values, 
       color = 'purple')

# add titles and format 
ax.set_title('Daily Total Precipitation \nAug - Oct 2013 for Boulder Creek')
ax.set_xlabel('Date')
ax.set_ylabel('Precipitation (Inches)');
Scatterplot showing daily total precipitation for Boulder Creek.
Scatterplot showing daily total precipitation for Boulder Creek.

The plot above appears to be subsetted correctly in time. However the plot itself looks off. Any ideas what is going on? Complete the challenge below to test your knowledge.

Optional challenge

Using everything you’ve learned in the previous lessons:

  • Import the dataset: data/week2/precipitation/805325-precip-dailysum-2003-2013.csv
  • Clean the data by assigning noData values to nan
  • Make sure the date column is a date class
  • Plot your data

Some notes to help you along:

  • Date: be sure to take of the date format when you import the data.
  • NoData Values: You know that the no data value = 999.99. You can account for this when you read in the data. Remember how?

Your final plot should look something like the plot below.

Scatterplot of hourly precipitation for Boulder subsetted to 2003-2013.
Scatterplot of hourly precipitation for Boulder subsetted to 2003-2013.

Optional Challenge

Create plots for the following time subsets in the data before and after the flood:

Time period A: 2012-08-01 to 2012-11-01 Time period B: 2013-08-01 to 2013-11-01

When you create your plot, be sure to set the y limits to be the same for both plots so they are visually comparable..

How different was the rainfall in 2012 compared to 2013?

Comparison of precipitation data in Boulder, CO from 2012 and 2013.
Comparison of precipitation data in Boulder, CO from 2012 and 2013.

Leave a Comment