Lesson 3. Subset Time Series By Dates Python Using Pandas


Learning Objectives

  • Subset time series data using different options for time frames, including by year, month, and with a specified begin and end date.

Temporally Subset Data Using Pandas Dataframes

Sometimes a dataset contains a much larger timeframe than you need for your analysis or plot, and it can helpful to select, or subset, the data to the needed timeframe.

There are many ways to subset the data temporally in Python; one easy way to do this is to use pandas.

Pandas natively understands time operations if:

  1. you tell it what column contains your time stamps (using the parameter parse_dates) and
  2. you set the date column to be the index of the dataframe (using the parameter index_col).

On the previous page of this chapter, you already learned how to complete these steps during the read_csv() import into the pandas dataframe. On this page, you will learn how to use the datetime index to subset data from a pandas dataframe.

Import Packages and Get Data

You will use a slightly modified version of precipitation data (inches) downloaded from the National Centers for Environmental Information (formerly National Climate Data Center) Cooperative Observer Network (COOP) station 050843 in Boulder, CO. The data were collected from January 1, 2003 through December 31, 2013.

Your instructor has modified these data as follows:

  • aggregated the data to represent daily sum values.
  • added some no data values to allow you to practice handling missing data.
  • added new columns to this data that would not usually be there if you downloaded it directly:
    • Year
    • Julian day (i.e. the calendar day number)

To begin, import the necessary packages to work with pandas dataframe and download data.

You will continue to work with modules from pandas and matplotlib to plot dates more efficiently and with seaborn to make more attractive plots.

# Import necessary packages
import os
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import earthpy as et

# Handle date time conversions between pandas and matplotlib
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()

# Use white grid plot background from seaborn
sns.set(font_scale=1.5, style="whitegrid")

Download Curated Dataset From EarthPy

You will also download data from earthpy by specifiying a data key for the dataset called colorado-flood. This dataset has been curated by the earthpy team to include the precipitation data collected for 2003-2013 for Boulder, CO.

# Download the data
data = et.data.get_data('colorado-flood')

Note that when you download data using a data key in earthpy, the data are automatically downloaded to a new directory in the data directory under earth-analytics. The name of the directory will be the same as the name of the dataset, for example, colorado-flood.

For this dataset, there is a precipitation subdirectory within colorado-flood for the precipitation data.

# Set working directory
os.chdir(os.path.join(et.io.HOME, 'earth-analytics'))

# Define relative path to file with daily precip total
file_path = os.path.join("data", "colorado-flood",
                         "precipitation",
                         "805325-precip-dailysum-2003-2013.csv")

Now that you have downloaded the dataset, you can import the file for the measurement station for Boulder, CO, and specify the:

  1. no data values using the parameter na_values
  2. date column using the parameter parse_dates
  3. datetime index using the parameter index_col
# Import data using datetime and no data value
boulder_precip_2003_2013 = pd.read_csv(file_path,
                                       parse_dates=['DATE'],
                                       index_col= ['DATE'],
                                       na_values=['999.99'])

# View first few rows
boulder_precip_2003_2013.head()
DAILY_PRECIPSTATIONSTATION_NAMEELEVATIONLATITUDELONGITUDEYEARJULIAN
DATE
2003-01-010.0COOP:050843BOULDER 2 CO US1650.540.03389-105.2811120031
2003-01-05NaNCOOP:050843BOULDER 2 CO US1650.540.03389-105.2811120035
2003-02-010.0COOP:050843BOULDER 2 CO US1650.540.03389-105.28111200332
2003-02-02NaNCOOP:050843BOULDER 2 CO US1650.540.03389-105.28111200333
2003-02-030.4COOP:050843BOULDER 2 CO US1650.540.03389-105.28111200334

About the Precipitation Data

Viewing the structure of these data, you can see that different types of data are included in this file.

  • STATION and STATION_NAME: Identification of the COOP station.
  • ELEVATION, LATITUDE and LONGITUDE: The spatial location of the station.
  • DAILY_PRECIP: The total precipitation in inches. The metadata for this dataset notes that the value 999.99 indicates missing data. Also important, days with no precipitation are not included in the data.
  • YEAR: the year the data were collected
  • JULIAN: the JULIAN DAY the data were collected.

DATE is the date when the data were collected in the format: YYYY-MM-DD.

Notice that DATE is now the index value because you used the parse_date and index_col parameters when you imported the CSV file into a pandas dataframe.

Additional information about the data, known as metadata, is available in the PRECIP_HLY_documentation.pdf.

The metadata tell us that the no data value for these data is 999.99. IMPORTANT: your instructor has modified these data a bit for ease of teaching and learning. Specifically, data have been aggregated to represent daily sum values and some no data values have been added.

Data Tip You can download the original complete data subset with additional documentation here.

Even after reading documentation, it is always a good idea to explore data before working with them such as:

  1. checking out the data types
  2. calculating the summary statistics to get a sense of the data values (and make sure that “no data” values have been identified)
  3. checking out the values in the datetime index.
# View dataframe info
boulder_precip_2003_2013.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 792 entries, 2003-01-01 to 2013-12-31
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   DAILY_PRECIP  788 non-null    float64
 1   STATION       792 non-null    object 
 2   STATION_NAME  792 non-null    object 
 3   ELEVATION     792 non-null    float64
 4   LATITUDE      792 non-null    float64
 5   LONGITUDE     792 non-null    float64
 6   YEAR          792 non-null    int64  
 7   JULIAN        792 non-null    int64  
dtypes: float64(4), int64(2), object(2)
memory usage: 55.7+ KB
# View summary statistics
boulder_precip_2003_2013.describe()
DAILY_PRECIPELEVATIONLATITUDELONGITUDEYEARJULIAN
count788.000000792.0792.000000792.000000792.000000792.000000
mean0.2478431650.540.033850-105.2811062007.967172175.541667
std0.4625580.00.0000450.0000053.14928798.536373
min0.0000001650.540.033800-105.2811102003.0000001.000000
25%0.1000001650.540.033800-105.2811102005.00000096.000000
50%0.1000001650.540.033890-105.2811102008.000000167.000000
75%0.3000001650.540.033890-105.2811002011.000000255.250000
max9.8000001650.540.033890-105.2811002013.000000365.000000
# View index values of dataframe
boulder_precip_2003_2013.index
DatetimeIndex(['2003-01-01', '2003-01-05', '2003-02-01', '2003-02-02',
               '2003-02-03', '2003-02-05', '2003-02-06', '2003-02-07',
               '2003-02-10', '2003-02-18',
               ...
               '2013-11-01', '2013-11-09', '2013-11-21', '2013-11-27',
               '2013-12-01', '2013-12-04', '2013-12-22', '2013-12-23',
               '2013-12-29', '2013-12-31'],
              dtype='datetime64[ns]', name='DATE', length=792, freq=None)

Subset Pandas Dataframe By Year

Because you have a dataframe set up with an index, you can start to easily subset your data using the syntax:

df["index_date"]

Note that the datetime index value is accessed using quotation marks "" similar to how you query for text strings.

Using this syntax, you can select all of the data for the year 2013 by specifying the value that you want to select from the datetime index:

# Select 2013 data - view first few records
boulder_precip_2003_2013['2013'].head()
DAILY_PRECIPSTATIONSTATION_NAMEELEVATIONLATITUDELONGITUDEYEARJULIAN
DATE
2013-01-010.0COOP:050843BOULDER 2 CO US1650.540.0338-105.281120131
2013-01-280.1COOP:050843BOULDER 2 CO US1650.540.0338-105.2811201328
2013-01-290.1COOP:050843BOULDER 2 CO US1650.540.0338-105.2811201329
2013-02-010.0COOP:050843BOULDER 2 CO US1650.540.0338-105.2811201332
2013-02-140.1COOP:050843BOULDER 2 CO US1650.540.0338-105.2811201345
# Select 2013 data - view last few records
boulder_precip_2003_2013['2013'].tail()
DAILY_PRECIPSTATIONSTATION_NAMEELEVATIONLATITUDELONGITUDEYEARJULIAN
DATE
2013-12-040.4COOP:050843BOULDER 2 CO US1650.540.0338-105.28112013338
2013-12-220.0COOP:050843BOULDER 2 CO US1650.540.0338-105.28112013356
2013-12-230.1COOP:050843BOULDER 2 CO US1650.540.0338-105.28112013357
2013-12-290.0COOP:050843BOULDER 2 CO US1650.540.0338-105.28112013363
2013-12-310.0COOP:050843BOULDER 2 CO US1650.540.0338-105.28112013365

Note that in the previous example, you are querying the datetime index directly, not querying the values from the Year column.

Subset Pandas Dataframe By Month

Using a datetime index with pandas makes it really easy to continue to select data using additional attributes of the index such as month.

This attribute of the datetime index can be accessed as:

df.index.month == value

where the month values are numeric values ranging from 1 to 12, representing January through December.

With this attribute, you can now employ the pandas syntax to filter values in a pandas dataframe using the syntax:

df[df.index.month == value]

# Select all December data - view first few rows
boulder_precip_2003_2013[boulder_precip_2003_2013.index.month == 12].head()
DAILY_PRECIPSTATIONSTATION_NAMEELEVATIONLATITUDELONGITUDEYEARJULIAN
DATE
2003-12-010.0COOP:050843BOULDER 2 CO US1650.540.03389-105.281112003335
2004-12-010.0COOP:050843BOULDER 2 CO US1650.540.03389-105.281112004336
2004-12-220.2COOP:050843BOULDER 2 CO US1650.540.03389-105.281112004357
2004-12-240.1COOP:050843BOULDER 2 CO US1650.540.03389-105.281112004359
2005-12-010.0COOP:050843BOULDER 2 CO US1650.540.03389-105.281112005335
# Select all December data - view last few rows
boulder_precip_2003_2013[boulder_precip_2003_2013.index.month == 12].tail()
DAILY_PRECIPSTATIONSTATION_NAMEELEVATIONLATITUDELONGITUDEYEARJULIAN
DATE
2013-12-040.4COOP:050843BOULDER 2 CO US1650.540.0338-105.28112013338
2013-12-220.0COOP:050843BOULDER 2 CO US1650.540.0338-105.28112013356
2013-12-230.1COOP:050843BOULDER 2 CO US1650.540.0338-105.28112013357
2013-12-290.0COOP:050843BOULDER 2 CO US1650.540.0338-105.28112013363
2013-12-310.0COOP:050843BOULDER 2 CO US1650.540.0338-105.28112013365

Notice that head() displays December records in 2003, while tail() displays December records in 2013.

Subset Pandas Dataframe By Day of Month

Similarly, you can the attribute day of the index to select all records for a specific day of the month as follows:

df.index.month == value

where the month values are numeric values ranging from 1 to 31, representing possible days of the month.

# Select data for 1st of month - view first rows
boulder_precip_2003_2013[boulder_precip_2003_2013.index.day == 1]
DAILY_PRECIPSTATIONSTATION_NAMEELEVATIONLATITUDELONGITUDEYEARJULIAN
DATE
2003-01-010.0COOP:050843BOULDER 2 CO US1650.540.03389-105.2811120031
2003-02-010.0COOP:050843BOULDER 2 CO US1650.540.03389-105.28111200332
2003-03-010.0COOP:050843BOULDER 2 CO US1650.540.03389-105.28111200360
2003-04-010.0COOP:050843BOULDER 2 CO US1650.540.03389-105.28111200391
2003-05-010.0COOP:050843BOULDER 2 CO US1650.540.03389-105.281112003121
...........................
2013-08-010.1COOP:050843BOULDER 2 CO US1650.540.03380-105.281102013213
2013-09-010.0COOP:050843BOULDER 2 CO US1650.540.03380-105.281102013244
2013-10-010.0COOP:050843BOULDER 2 CO US1650.540.03380-105.281102013274
2013-11-010.0COOP:050843BOULDER 2 CO US1650.540.03380-105.281102013305
2013-12-010.0COOP:050843BOULDER 2 CO US1650.540.03380-105.281102013335

132 rows × 8 columns

Subset Pandas Dataframe Using Range of Dates

You can also subset the data using a specific date range using the syntax:

df["begin_index_date" : "end_index_date]

For example, you can subset the data to a desired time period such as May 1, 2005 - August 31 2005, and then save it to a new dataframe.

# Subset data to May-Aug 2005
precip_may_aug_2005 = boulder_precip_2003_2013['2005-05-01':'2005-08-31']

precip_may_aug_2005.head()
DAILY_PRECIPSTATIONSTATION_NAMEELEVATIONLATITUDELONGITUDEYEARJULIAN
DATE
2005-05-010.1COOP:050843BOULDER 2 CO US1650.540.03389-105.281112005121
2005-05-111.2COOP:050843BOULDER 2 CO US1650.540.03389-105.281112005131
2005-05-300.5COOP:050843BOULDER 2 CO US1650.540.03389-105.281112005150
2005-05-310.1COOP:050843BOULDER 2 CO US1650.540.03389-105.281112005151
2005-06-010.0COOP:050843BOULDER 2 CO US1650.540.03389-105.281112005152

Check Minimum and Maximum Values of Index

Rather than just checking the results of head() and tail(), you can actually query the min and max values of the index as follows:

# Check min value of index 
print(precip_may_aug_2005.index.min())

# Check max value of index 
print(precip_may_aug_2005.index.max())
2005-05-01 00:00:00
2005-08-23 00:00:00

Plot Temporal Subsets From Pandas Dataframe

Once you have subsetted the data and saved it, you can plot the data from the new dataframe to focus in on the desired time period.

Once again, you will use .index.values to access the datetime index values for the plot.

# Create figure and plot space
fig, ax = plt.subplots(figsize=(10, 10))

# Add x-axis and y-axis
ax.bar(precip_may_aug_2005.index.values,
       precip_may_aug_2005['DAILY_PRECIP'],
       color='purple')

# Set title and labels for axes
ax.set(xlabel="Date",
       ylabel="Precipitation (inches)",
       title="Daily Total Precipitation\nMay - Aug 2005 for Boulder Creek")

# Rotate tick marks on x-axis
plt.setp(ax.get_xticklabels(), rotation=45)

plt.show()
Bar plot showing daily total precipitation for Boulder Creek between May and Aug 2005.
Bar plot showing daily total precipitation for Boulder Creek between May and Aug 2005.

Think of New Applications and Uses of Subsetting

Given what you have learned about using df.index.month and df.index.day to select data by the month or day of the month value:

  • What would you replace month or day with, in order to select data by year or even a specific week of the year?

Leave a Comment