Lesson 4. Resample or Summarize Time Series Data in Python With Pandas - Hourly to Daily Summary


Learning Objectives

  • Resample time series data from hourly to daily, monthly, or yearly using pandas.

Resample Time Series Data Using Pandas Dataframes

Often you need to summarize or aggregate time series data by a new time period. For instance, you may want to summarize hourly data to provide a daily maximum value.

This process of changing the time period that data are summarized for is often called resampling.

Lucky for you, there is a nice resample() method for pandas dataframes that have a datetime index.

On this page, you will learn how to use this resample() method to aggregate time series data by a new time period (e.g. daily to monthly).

Import Packages and Get Data

You will use the precipitation data from the National Centers for Environmental Information (formerly National Climate Data Center) Cooperative Observer Network (COOP) that you used previously in this chapter.

This time, however, you will use the hourly data that was not aggregated to a daily sum:

805333-precip-daily-1948-2013.csv

This dataset contains the precipitation values collected hourly from the COOP station 050843 in Boulder, CO for January 1, 1948 through December 31, 2013. This means that there are sometimes multiple values collected for each day if it happened to rain throughout the day.

Before using the data, consider a few things about how it was collected:

  1. The data were collected over several decades, and the data were not always collected consistently.
  2. The data are not cleaned. You may find heading names that are not meaningful, and other issues with the data that need to be explored.

To begin, import the necessary packages to work with pandas dataframe and download data.

You will continue to work with modules from pandas and matplotlib to plot dates more efficiently and with seaborn to make more attractive plots.

# Import necessary packages
import os
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import earthpy as et

# Handle date time conversions between pandas and matplotlib
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()

# Use white grid plot background from seaborn
sns.set(font_scale=1.5, style="whitegrid")
# Download the data
data = et.data.get_data('colorado-flood')
# Set working directory
os.chdir(os.path.join(et.io.HOME, 'earth-analytics'))

# Define relative path to file with hourly precip
file_path = os.path.join("data", "colorado-flood",
                         "precipitation",
                         "805325-precip-daily-2003-2013.csv")

Just as before, when you import the file to a pandas dataframe, be sure to specify the:

  • no data values using the parameter na_values
  • date column using the parameter parse_dates
  • datetime index using the parameter index_col
# Import data using datetime and no data value
precip_2003_2013_hourly = pd.read_csv(file_path,
                                      parse_dates=['DATE'],
                                      index_col=['DATE'],
                                      na_values=['999.99'])

# View first few rows
precip_2003_2013_hourly.head()
STATIONSTATION_NAMEELEVATIONLATITUDELONGITUDEHPCPMeasurement FlagQuality Flag
DATE
2003-01-01 01:00:00COOP:050843BOULDER 2 CO US1650.540.03389-105.281110.0g
2003-02-01 01:00:00COOP:050843BOULDER 2 CO US1650.540.03389-105.281110.0g
2003-02-02 19:00:00COOP:050843BOULDER 2 CO US1650.540.03389-105.281110.2
2003-02-02 22:00:00COOP:050843BOULDER 2 CO US1650.540.03389-105.281110.1
2003-02-03 02:00:00COOP:050843BOULDER 2 CO US1650.540.03389-105.281110.1

About the Precipitation Data

The structure of the data is similar to what you saw in previous lessons. The HPCP column contains the total precipitation given in inches, recorded for the hour ending at the time specified by DATE.

There is a designated missing data value of 999.99. Note that if there is no precipitation recorded in a particular hour, then no value is recorded.

Additional information about the data, known as metadata, is available in the PRECIP_HLY_documentation.pdf.

Note, as of Sept. 2016, there is a mismatch in the data downloaded and the documentation. The differences are in the units and corresponding no data value: 999.99 for inches or 25399.75 for millimeters.

Once again, explore the data before you begin to work with it.

# View dataframe info
precip_2003_2013_hourly.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1840 entries, 2003-01-01 01:00:00 to 2013-12-31 00:00:00
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   STATION           1840 non-null   object 
 1   STATION_NAME      1840 non-null   object 
 2   ELEVATION         1840 non-null   float64
 3   LATITUDE          1840 non-null   float64
 4   LONGITUDE         1840 non-null   float64
 5   HPCP              1746 non-null   float64
 6   Measurement Flag  1840 non-null   object 
 7   Quality Flag      1840 non-null   object 
dtypes: float64(4), object(4)
memory usage: 129.4+ KB
# View summary statistics
precip_2003_2013_hourly.describe()
ELEVATIONLATITUDELONGITUDEHPCP
count1840.01840.0000001840.0000001746.000000
mean1650.540.033851-105.2811060.111856
std0.00.0000450.0000050.093222
min1650.540.033800-105.2811100.000000
25%1650.540.033800-105.2811100.100000
50%1650.540.033890-105.2811100.100000
75%1650.540.033890-105.2811000.100000
max1650.540.033890-105.2811002.200000
# View index values of dataframe
precip_2003_2013_hourly.index
DatetimeIndex(['2003-01-01 01:00:00', '2003-02-01 01:00:00',
               '2003-02-02 19:00:00', '2003-02-02 22:00:00',
               '2003-02-03 02:00:00', '2003-02-05 02:00:00',
               '2003-02-05 08:00:00', '2003-02-06 00:00:00',
               '2003-02-07 12:00:00', '2003-02-10 13:00:00',
               ...
               '2013-12-01 01:00:00', '2013-12-03 20:00:00',
               '2013-12-04 03:00:00', '2013-12-04 06:00:00',
               '2013-12-04 09:00:00', '2013-12-22 01:00:00',
               '2013-12-23 00:00:00', '2013-12-23 02:00:00',
               '2013-12-29 01:00:00', '2013-12-31 00:00:00'],
              dtype='datetime64[ns]', name='DATE', length=1840, freq=None)

Plot Hourly Precipitation Data

Plot the hourly data and notice that there are often multiple records for a single day.

# Create figure and plot space
fig, ax = plt.subplots(figsize=(10, 10))

# Add x-axis and y-axis
ax.scatter(precip_2003_2013_hourly.index.values,
           precip_2003_2013_hourly['HPCP'],
           color='purple')

# Set title and labels for axes
ax.set(xlabel="Date",
       ylabel="Precipitation (inches)",
       title="Hourly Precipitation - Boulder Station\n 2003-2013")

plt.show()
Scatter plot showing hourly precipitation for Boulder, CO from 2003 to 2013.
Scatter plot showing hourly precipitation for Boulder, CO from 2003 to 2013.

Also, notice that the plot is not displaying each individual hourly timestamp, but rather, has aggregated the x-axis labels to the year. (On the next page, you will learn how to customize these labels!)

Resample Hourly Data to Daily Data

To simplify your plot which has a lot of data points due to the hourly records, you can aggregate the data for each day using the .resample() method.

To aggregate or temporal resample the data for a time period, you can take all of the values for each day and summarize them.

In this case, you want total daily rainfall, so you will use the resample() method together with .sum().

As previously mentioned, resample() is a method of pandas dataframes that can be used to summarize data by date or time. The .sum() method will add up all values for each resampling period (e.g. for each day) to provide a summary output value for that period.

As you have already set the DATE column as the index, pandas already knows what to use for the date index.

df.resample('D').sum()

The 'D' specifies that you want to aggregate, or resample, by day.

# Resample to daily precip sum and save as new dataframe
precip_2003_2013_daily = precip_2003_2013_hourly.resample('D').sum()

precip_2003_2013_daily
ELEVATIONLATITUDELONGITUDEHPCP
DATE
2003-01-011650.540.03389-105.281110.0
2003-01-020.00.000000.000000.0
2003-01-030.00.000000.000000.0
2003-01-040.00.000000.000000.0
2003-01-050.00.000000.000000.0
...............
2013-12-270.00.000000.000000.0
2013-12-280.00.000000.000000.0
2013-12-291650.540.03380-105.281100.0
2013-12-300.00.000000.000000.0
2013-12-311650.540.03380-105.281100.0

4018 rows × 4 columns

Now that you have resampled the data, each HPCP value now represents a daily total or sum of all precipitation measured that day. Also notice that your DATE index no longer contains hourly time stamps, as you now have only one summary value or row per day.

Data Tip: You can also resample using the syntax below if you have not already set the DATE column as an index during the import process.

# Set date column as index 
precip_hourly_index = precip_hourly.set_index('DATE')

# Resample to daily sum of precip
precip_daily = precip_hourly_index.resample('D').sum()

Plot Daily Precipitation Data

Plot the aggregated dataframe for daily total precipitation and notice that the y axis has increased in range and that there is only one data point for each day (though there are still quite a lot of points!).

# Create figure and plot space
fig, ax = plt.subplots(figsize=(10, 10))

# Add x-axis and y-axis
ax.scatter(precip_2003_2013_daily.index.values,
           precip_2003_2013_daily['HPCP'],
           color='purple')

# Set title and labels for axes
ax.set(xlabel="Date",
       ylabel="Precipitation (inches)",
       title="Daily Precipitation - Boulder Station\n 2003-2013")

plt.show()
Scatter plot of daily total precipitation subsetted 2003-2013.
Scatter plot of daily total precipitation subsetted 2003-2013.

Resample Daily Data to Monthly Data

You can use the same syntax to resample the data again, this time from daily to monthly using:

df.resample('M').sum()

with 'M' specifying that you want to aggregate, or resample, by month.

# Resample to monthly precip sum and save as new dataframe
precip_2003_2013_monthly = precip_2003_2013_daily.resample('M').sum()

precip_2003_2013_monthly
ELEVATIONLATITUDELONGITUDEHPCP
DATE
2003-01-311650.540.03389-105.281110.0
2003-02-2826408.0640.54224-1684.497761.4
2003-03-3174272.51801.52505-4737.649955.2
2003-04-3028058.5680.57613-1789.778871.6
2003-05-3134660.5840.71169-2210.903313.3
...............
2013-08-3114854.5360.30420-947.529901.0
2013-09-30118836.02882.43360-7580.2392017.7
2013-10-3131359.5760.64220-2000.340902.0
2013-11-308252.5200.16900-526.405500.4
2013-12-3116505.0400.33800-1052.811000.5

132 rows × 4 columns

Once again, notice that now that you have resampled the data, each HPCP value now represents a monthly total and that you have only one summary value for each month.

Plot Monthly Precipitation Data

Plot the aggregated dataframe for monthly total precipitation and notice that the y axis has again increased in range and that there is only one data point for each month.

# Create figure and plot space
fig, ax = plt.subplots(figsize=(10, 10))

# Add x-axis and y-axis
ax.scatter(precip_2003_2013_monthly.index.values,
           precip_2003_2013_monthly['HPCP'],
           color='purple')

# Set title and labels for axes
ax.set(xlabel="Date",
       ylabel="Precipitation (inches)",
       title="Monthly Precipitation - Boulder Station\n 2003-2013")

plt.show()
Scatter plot of monthly total precipitation subsetted 2003-2013.
Scatter plot of monthly total precipitation subsetted 2003-2013.

Resample Monthly Data to Yearly Data

You can use the same syntax to resample the data one last time, this time from monthly to yearly using:

df.resample('Y').sum()

with 'Y' specifying that you want to aggregate, or resample, by year.

# Resample to monthly precip sum and save as new dataframe
precip_2003_2013_yearly = precip_2003_2013_monthly.resample('Y').sum()

precip_2003_2013_yearly
ELEVATIONLATITUDELONGITUDEHPCP
DATE
2003-12-31255827.56205.25295-16318.5720517.6
2004-12-31349906.08487.18468-22319.5953222.6
2005-12-31292138.57085.99853-18634.7564716.7
2006-12-31278934.56765.72741-17792.5075916.8
2007-12-31259128.56285.32073-16529.1342715.0
2008-12-31239322.55804.91405-15265.7609514.0
2009-12-31250876.06085.13949-16002.7274114.7
2010-12-31272332.56605.57700-17371.3815017.6
2011-12-31300391.07286.15160-19161.1602017.5
2012-12-31153496.53723.14340-9791.142309.5
2013-12-31384566.59327.87540-24530.4963033.3

After the resample, each HPCP value now represents a yearly total, and there is now only one summary value for each year.

Notice that the dates have also been updated in the dataframe as the last day of each year (e.g. 2013-12-31). This is important to note for the plot, in which the values will appear along the x axis with one value at the end of each year.

Note that you can also resample the hourly data to a yearly timestep, without first resampling the data to a daily or monthly timestep:

precip_2003_2013_yearly = precip_2003_2013_hourly.resample('Y').sum()

This helps to improve the efficiency of your code if you do not need the intermediate resampled timesteps (e.g. daily, monthly) for a different purpose.

Plot Yearly Precipitation Data

To minimize your code further, you can use precip_2003_2013_hourly.resample('Y').sum() directly in the plot code, rather than precip_2003_2013_yearly, as shown below:

# Create figure and plot space
fig, ax = plt.subplots(figsize=(10, 10))

# Add x-axis and y-axis
ax.scatter(precip_2003_2013_hourly.resample('Y').sum().index.values,
           precip_2003_2013_hourly.resample('Y').sum()['HPCP'],
           color='purple')

# Set title and labels for axes
ax.set(xlabel="Date",
       ylabel="Precipitation (inches)",
       title="Yearly Precipitation - Boulder Station\n 2003-2013")

plt.show()
Scatter plot of yearly total precipitation subsetted 2003-2013.
Scatter plot of yearly total precipitation subsetted 2003-2013.

Think of New Applications and Uses of Resampling

Given what you have learned about resampling, how would change the code df.resample('D').sum() to resample the data to a weekly interval?

How about changing the code df.resample('D').sum() calculate a mean, minimum or maximum value, rather than a sum?

Leave a Comment