- Resample time series data from hourly to daily, monthly, or yearly using pandas.
Resample Time Series Data Using Pandas Dataframes
Often you need to summarize or aggregate time series data by a new time period. For instance, you may want to summarize hourly data to provide a daily maximum value.
This process of changing the time period that data are summarized for is often called resampling.
Lucky for you, there is a nice
resample() method for pandas dataframes that have a
On this page, you will learn how to use this
resample() method to aggregate time series data by a new time period (e.g. daily to monthly).
Import Packages and Get Data
You will use the precipitation data from the National Centers for Environmental Information (formerly National Climate Data Center) Cooperative Observer Network (COOP) that you used previously in this chapter.
This time, however, you will use the hourly data that was not aggregated to a daily sum:
This dataset contains the precipitation values collected hourly from the COOP station 050843 in Boulder, CO for January 1, 1948 through December 31, 2013. This means that there are sometimes multiple values collected for each day if it happened to rain throughout the day.
Before using the data, consider a few things about how it was collected:
- The data were collected over several decades, and the data were not always collected consistently.
- The data are not cleaned. You may find heading names that are not meaningful, and other issues with the data that need to be explored.
To begin, import the necessary packages to work with pandas dataframe and download data.
You will continue to work with modules from pandas and matplotlib to plot dates more efficiently and with seaborn to make more attractive plots.
# Import necessary packages import os import matplotlib.pyplot as plt import seaborn as sns import pandas as pd import earthpy as et # Handle date time conversions between pandas and matplotlib from pandas.plotting import register_matplotlib_converters register_matplotlib_converters() # Use white grid plot background from seaborn sns.set(font_scale=1.5, style="whitegrid")
# Download the data data = et.data.get_data('colorado-flood')
# Set working directory os.chdir(os.path.join(et.io.HOME, 'earth-analytics')) # Define relative path to file with hourly precip file_path = os.path.join("data", "colorado-flood", "precipitation", "805325-precip-daily-2003-2013.csv")
Just as before, when you import the file to a pandas dataframe, be sure to specify the:
- no data values using the parameter
- date column using the parameter
- datetime index using the parameter
# Import data using datetime and no data value precip_2003_2013_hourly = pd.read_csv(file_path, parse_dates=['DATE'], index_col=['DATE'], na_values=['999.99']) # View first few rows precip_2003_2013_hourly.head()
|STATION||STATION_NAME||ELEVATION||LATITUDE||LONGITUDE||HPCP||Measurement Flag||Quality Flag|
|2003-01-01 01:00:00||COOP:050843||BOULDER 2 CO US||1650.5||40.03389||-105.28111||0.0||g|
|2003-02-01 01:00:00||COOP:050843||BOULDER 2 CO US||1650.5||40.03389||-105.28111||0.0||g|
|2003-02-02 19:00:00||COOP:050843||BOULDER 2 CO US||1650.5||40.03389||-105.28111||0.2|
|2003-02-02 22:00:00||COOP:050843||BOULDER 2 CO US||1650.5||40.03389||-105.28111||0.1|
|2003-02-03 02:00:00||COOP:050843||BOULDER 2 CO US||1650.5||40.03389||-105.28111||0.1|
About the Precipitation Data
The structure of the data is similar to what you saw in previous lessons. The
HPCP column contains the total precipitation given in inches, recorded for the hour ending at the time specified by
There is a designated missing data value of
999.99. Note that if there is no precipitation recorded in a particular hour, then no value is recorded.
Additional information about the data, known as metadata, is available in the PRECIP_HLY_documentation.pdf.
Note, as of Sept. 2016, there is a mismatch in the data downloaded and the documentation. The differences are in the units and corresponding no data value: 999.99 for inches or 25399.75 for millimeters.
Once again, explore the data before you begin to work with it.
# View dataframe info precip_2003_2013_hourly.info()
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 1840 entries, 2003-01-01 01:00:00 to 2013-12-31 00:00:00 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 STATION 1840 non-null object 1 STATION_NAME 1840 non-null object 2 ELEVATION 1840 non-null float64 3 LATITUDE 1840 non-null float64 4 LONGITUDE 1840 non-null float64 5 HPCP 1746 non-null float64 6 Measurement Flag 1840 non-null object 7 Quality Flag 1840 non-null object dtypes: float64(4), object(4) memory usage: 129.4+ KB
# View summary statistics precip_2003_2013_hourly.describe()
# View index values of dataframe precip_2003_2013_hourly.index
DatetimeIndex(['2003-01-01 01:00:00', '2003-02-01 01:00:00', '2003-02-02 19:00:00', '2003-02-02 22:00:00', '2003-02-03 02:00:00', '2003-02-05 02:00:00', '2003-02-05 08:00:00', '2003-02-06 00:00:00', '2003-02-07 12:00:00', '2003-02-10 13:00:00', ... '2013-12-01 01:00:00', '2013-12-03 20:00:00', '2013-12-04 03:00:00', '2013-12-04 06:00:00', '2013-12-04 09:00:00', '2013-12-22 01:00:00', '2013-12-23 00:00:00', '2013-12-23 02:00:00', '2013-12-29 01:00:00', '2013-12-31 00:00:00'], dtype='datetime64[ns]', name='DATE', length=1840, freq=None)
Plot Hourly Precipitation Data
Plot the hourly data and notice that there are often multiple records for a single day.
# Create figure and plot space fig, ax = plt.subplots(figsize=(10, 10)) # Add x-axis and y-axis ax.scatter(precip_2003_2013_hourly.index.values, precip_2003_2013_hourly['HPCP'], color='purple') # Set title and labels for axes ax.set(xlabel="Date", ylabel="Precipitation (inches)", title="Hourly Precipitation - Boulder Station\n 2003-2013") plt.show()
Also, notice that the plot is not displaying each individual hourly timestamp, but rather, has aggregated the x-axis labels to the year. (On the next page, you will learn how to customize these labels!)
Resample Hourly Data to Daily Data
To simplify your plot which has a lot of data points due to the hourly records, you can aggregate the data for each day using the
To aggregate or temporal resample the data for a time period, you can take all of the values for each day and summarize them.
In this case, you want total daily rainfall, so you will use the
resample() method together with
As previously mentioned,
resample() is a method of pandas dataframes that can be used to summarize data by date or time. The
.sum() method will add up all values for each resampling period (e.g. for each day) to provide a summary output value for that period.
As you have already set the
DATE column as the index, pandas already knows what to use for the date index.
'D' specifies that you want to aggregate, or resample, by day.
# Resample to daily precip sum and save as new dataframe precip_2003_2013_daily = precip_2003_2013_hourly.resample('D').sum() precip_2003_2013_daily
4018 rows × 4 columns
Now that you have resampled the data, each HPCP value now represents a daily total or sum of all precipitation measured that day. Also notice that your
DATE index no longer contains hourly time stamps, as you now have only one summary value or row per day.
Data Tip: You can also resample using the syntax below if you have not already set the DATE column as an index during the import process.
# Set date column as index precip_hourly_index = precip_hourly.set_index('DATE') # Resample to daily sum of precip precip_daily = precip_hourly_index.resample('D').sum()
Plot Daily Precipitation Data
Plot the aggregated dataframe for daily total precipitation and notice that the y axis has increased in range and that there is only one data point for each day (though there are still quite a lot of points!).
# Create figure and plot space fig, ax = plt.subplots(figsize=(10, 10)) # Add x-axis and y-axis ax.scatter(precip_2003_2013_daily.index.values, precip_2003_2013_daily['HPCP'], color='purple') # Set title and labels for axes ax.set(xlabel="Date", ylabel="Precipitation (inches)", title="Daily Precipitation - Boulder Station\n 2003-2013") plt.show()
Resample Daily Data to Monthly Data
You can use the same syntax to resample the data again, this time from daily to monthly using:
'M' specifying that you want to aggregate, or resample, by month.
# Resample to monthly precip sum and save as new dataframe precip_2003_2013_monthly = precip_2003_2013_daily.resample('M').sum() precip_2003_2013_monthly
132 rows × 4 columns
Once again, notice that now that you have resampled the data, each HPCP value now represents a monthly total and that you have only one summary value for each month.
Plot Monthly Precipitation Data
Plot the aggregated dataframe for monthly total precipitation and notice that the y axis has again increased in range and that there is only one data point for each month.
# Create figure and plot space fig, ax = plt.subplots(figsize=(10, 10)) # Add x-axis and y-axis ax.scatter(precip_2003_2013_monthly.index.values, precip_2003_2013_monthly['HPCP'], color='purple') # Set title and labels for axes ax.set(xlabel="Date", ylabel="Precipitation (inches)", title="Monthly Precipitation - Boulder Station\n 2003-2013") plt.show()
Resample Monthly Data to Yearly Data
You can use the same syntax to resample the data one last time, this time from monthly to yearly using:
'Y' specifying that you want to aggregate, or resample, by year.
# Resample to monthly precip sum and save as new dataframe precip_2003_2013_yearly = precip_2003_2013_monthly.resample('Y').sum() precip_2003_2013_yearly
After the resample, each HPCP value now represents a yearly total, and there is now only one summary value for each year.
Notice that the dates have also been updated in the dataframe as the last day of each year (e.g. 2013-12-31). This is important to note for the plot, in which the values will appear along the x axis with one value at the end of each year.
Note that you can also resample the hourly data to a yearly timestep, without first resampling the data to a daily or monthly timestep:
precip_2003_2013_yearly = precip_2003_2013_hourly.resample('Y').sum()
This helps to improve the efficiency of your code if you do not need the intermediate resampled timesteps (e.g. daily, monthly) for a different purpose.
Plot Yearly Precipitation Data
To minimize your code further, you can use
precip_2003_2013_hourly.resample('Y').sum() directly in the plot code, rather than
precip_2003_2013_yearly, as shown below:
# Create figure and plot space fig, ax = plt.subplots(figsize=(10, 10)) # Add x-axis and y-axis ax.scatter(precip_2003_2013_hourly.resample('Y').sum().index.values, precip_2003_2013_hourly.resample('Y').sum()['HPCP'], color='purple') # Set title and labels for axes ax.set(xlabel="Date", ylabel="Precipitation (inches)", title="Yearly Precipitation - Boulder Station\n 2003-2013") plt.show()
Think of New Applications and Uses of Resampling
Given what you have learned about resampling, how would change the code
df.resample('D').sum() to resample the data to a weekly interval?
How about changing the code
df.resample('D').sum() calculate a mean, minimum or maximum value, rather than a sum?