In this lesson, you learn how to subset time series data into Python. You will also test the skills that you learned in the previous lessons to handle dates and missing data in
After completing this tutorial, you will be able to:
- Subset time series data by dates
What You Need
Python 3.x and
Jupyter notebooks to complete this tutorial. Also you should have an
earth-analytics directory setup on your computer with a
/data directory with it.
About The NOAA Precipitation Data Used In This Lesson
To complete this lesson, you will use a slightly modified version of precipitation data downloaded through the National Centers for Environmental Information (formerly National Climate Data Center) Cooperative Observer Network (COOP) station 050843 in Boulder, CO. The data were collected : 1 January 2003 through 31 December 2013.
Your instructor has modified these data as follows for this lesson:
- aggregated the data to represent daily sum values
- added some
noDatavalues to allow you to practice handing missing data
- added several columns to this data that would not usually be there if you downloaded it directly.
The added columns include:
- Julian day
How Is Precipitation Measured?
Precipitation can be measured using different types of gages. Some gages are manually read and emptied. Others gages automatically record the amount of precipitation collected. If the precipitation is in a frozen form (snow, hail, freezing rain) then the contents of the gage are melted to get the water equivalency for measurement. Rainfall is generally reported as the total amount of rain (millimeters, centimeters, or inches) over a given period of time.
Data Tip: Precipitation is the moisture that falls from clouds including rain, hail and snow.
Boulder, Colorado lays on the eastern edge of the Rocky Mountains where they meet the high plains. The average annual precipitation is near 20”. However, the precipitation comes in many forms including winter snow, intense summer thunderstorms, and intermittent storms throughout the year.
Use Precipitation Time Series Data in Python
You can use precipitation data to understand events like the 2013 floods that occurred in Colorado. However to work with these data in Python, you need to know how to do a few things:
- Open a
.csvfile in Python
- Ensure dates are read as a date/time format in python
- Handle missing data values appropriately
It’s also useful to know how to subset the data by time periods when analyzing and plotting.
You’ve already learned how to open a
.csv and how to handle dates. In this lesson you will use these skills and learn how to subset the data by time.
Get Started With Time Series Data
Get started by loading the required python libraries into your Jupyter notebook.
# load python libraries import numpy as np import pandas as pd import matplotlib.pyplot as plt import os plt.ion() import earthpy as et # set working directory os.chdir(os.path.join(et.io.HOME, 'earth-analytics')) # set standard plot parameters for uniform plotting plt.rcParams['figure.figsize'] = (8, 8) # set working directory os.chdir(os.path.join(et.io.HOME, 'earth-analytics')) # set standard plot parameters for uniform plotting plt.rcParams['figure.figsize'] = (10, 6) # prettier plotting with seaborn import seaborn as sns; sns.set(font_scale=1.5) sns.set_style("whitegrid")
# read the data into python boulder_daily_precip = pd.read_csv('data/colorado-flood/precipitation/805325-precip-dailysum-2003-2013.csv', parse_dates=['DATE'], na_values = ['999.99']) # view first 5 rows boulder_daily_precip.head()
|0||2003-01-01||0.0||COOP:050843||BOULDER 2 CO US||1650.5||40.03389||-105.28111||2003||1|
|1||2003-01-05||NaN||COOP:050843||BOULDER 2 CO US||1650.5||40.03389||-105.28111||2003||5|
|2||2003-02-01||0.0||COOP:050843||BOULDER 2 CO US||1650.5||40.03389||-105.28111||2003||32|
|3||2003-02-02||NaN||COOP:050843||BOULDER 2 CO US||1650.5||40.03389||-105.28111||2003||33|
|4||2003-02-03||0.4||COOP:050843||BOULDER 2 CO US||1650.5||40.03389||-105.28111||2003||34|
It’s always a good idea to explore the data before working with it.
# view structure of data boulder_daily_precip.dtypes
DATE datetime64[ns] DAILY_PRECIP float64 STATION object STATION_NAME object ELEVATION float64 LATITUDE float64 LONGITUDE float64 YEAR int64 JULIAN int64 dtype: object
# view data summary statistics for all columns boulder_daily_precip.describe()
About the Data
Viewing the structure of these data, you can see that different types of data are included in this file.
- STATION and STATION_NAME: Identification of the COOP station.
- ELEVATION, LATITUDE and LONGITUDE: The spatial location of the station.
- DATE: The date when the data were collected in the format: YYYYMMDD. Notice that
datetime64because you used the
parse_datefunction on the date column when the csv was first read in.
- DAILY_PRECIP: The total precipitation in inches. Important: the meta data notes that the value 999.99 indicates missing data. Also important,hours with no precipitation are not recorded.
- YEAR: the year the data were collected
- JULIAN: the JULIAN DAY the data were collected.
Additional information about the data, known as metadata, is available in the PRECIP_HLY_documentation.pdf. The metadata tell us that the noData value for these data is 999.99. IMPORTANT: your instructor has modified these data a bit for ease of teaching and learning. Specifically, she aggregated the data to represent daily sum values and added some noData values to ensure you learn how to clean them.
Data Tip You can download the original complete data subset with additional documentation here.
Subset Data Temporally in Pandas
There are many ways to subset the data temporally in Python. The easiest way to do this is to use pandas. Pandas natively understands time operations if
- you tell it what column contains your time stamps and
- you set the date column of your data to be the index of the dataframe
Both of these steps can be included in your read_csv import by adding the arguments:
as done below. Note that for each argument, the column that contains the time stamps is specified.
# read the data into python setting date as an index boulder_daily_precip = pd.read_csv('data/colorado-flood/precipitation/805325-precip-dailysum-2003-2013.csv', parse_dates=['DATE'], na_values = ['999.99'], index_col = 'DATE')
Now the magic happens! Once you have specified an index column, your data frame looks like the one below when printed. Notice that the DATE column is lower visually then the other column names. It’s also on the LEFT hand side of the dataframe. This is because it is now an index.
|2003-01-01||0.0||COOP:050843||BOULDER 2 CO US||1650.5||40.03389||-105.28111||2003||1|
|2003-01-05||NaN||COOP:050843||BOULDER 2 CO US||1650.5||40.03389||-105.28111||2003||5|
|2003-02-01||0.0||COOP:050843||BOULDER 2 CO US||1650.5||40.03389||-105.28111||2003||32|
|2003-02-02||NaN||COOP:050843||BOULDER 2 CO US||1650.5||40.03389||-105.28111||2003||33|
|2003-02-03||0.4||COOP:050843||BOULDER 2 CO US||1650.5||40.03389||-105.28111||2003||34|
Also notice if you look at the structure of your dataframe, that the DATE column isn’t rendered. But if you call
boulder_daily_precip.index you will find it.
# your date column that is the index now is not listed below boulder_daily_precip.dtypes
DAILY_PRECIP float64 STATION object STATION_NAME object ELEVATION float64 LATITUDE float64 LONGITUDE float64 YEAR int64 JULIAN int64 dtype: object
# access the index of your dataframe boulder_daily_precip.index
DatetimeIndex(['2003-01-01', '2003-01-05', '2003-02-01', '2003-02-02', '2003-02-03', '2003-02-05', '2003-02-06', '2003-02-07', '2003-02-10', '2003-02-18', ... '2013-11-01', '2013-11-09', '2013-11-21', '2013-11-27', '2013-12-01', '2013-12-04', '2013-12-22', '2013-12-23', '2013-12-29', '2013-12-31'], dtype='datetime64[ns]', name='DATE', length=792, freq=None)
Once you have a dataframe setup with an index, you can begin to subset your data using the syntax:
# below you subset all of the data for 2013 - the first 5 rows are shown boulder_daily_precip['2013'].head()
|2013-01-01||0.0||COOP:050843||BOULDER 2 CO US||1650.5||40.0338||-105.2811||2013||1|
|2013-01-28||0.1||COOP:050843||BOULDER 2 CO US||1650.5||40.0338||-105.2811||2013||28|
|2013-01-29||0.1||COOP:050843||BOULDER 2 CO US||1650.5||40.0338||-105.2811||2013||29|
|2013-02-01||0.0||COOP:050843||BOULDER 2 CO US||1650.5||40.0338||-105.2811||2013||32|
|2013-02-14||0.1||COOP:050843||BOULDER 2 CO US||1650.5||40.0338||-105.2811||2013||45|
You can subset within a date range using the syntax below.
# you can subset this way boulder_daily_precip['2013-05-01':'2013-06-06']
|2013-05-01||1.4||COOP:050843||BOULDER 2 CO US||1650.5||40.0338||-105.2811||2013||121|
|2013-05-02||0.1||COOP:050843||BOULDER 2 CO US||1650.5||40.0338||-105.2811||2013||122|
|2013-05-08||0.4||COOP:050843||BOULDER 2 CO US||1650.5||40.0338||-105.2811||2013||128|
|2013-05-09||0.5||COOP:050843||BOULDER 2 CO US||1650.5||40.0338||-105.2811||2013||129|
|2013-05-20||0.1||COOP:050843||BOULDER 2 CO US||1650.5||40.0338||-105.2811||2013||140|
|2013-05-23||0.1||COOP:050843||BOULDER 2 CO US||1650.5||40.0338||-105.2811||2013||143|
|2013-05-29||0.1||COOP:050843||BOULDER 2 CO US||1650.5||40.0338||-105.2811||2013||149|
|2013-06-01||0.0||COOP:050843||BOULDER 2 CO US||1650.5||40.0338||-105.2811||2013||152|
|2013-06-05||0.1||COOP:050843||BOULDER 2 CO US||1650.5||40.0338||-105.2811||2013||156|
For this example, subset the data to the time period: January 1 2003 - October 31 2003.
# subset the data to a date range precip_boulder_AugOct = boulder_daily_precip['2003-01-01':'2003-10-31'] # did it work? print(precip_boulder_AugOct.index.min()) print(precip_boulder_AugOct.index.max())
2003-01-01 00:00:00 2003-10-31 00:00:00
Plot Subsetted Data
Once you’ve subsetted the data, you can plot the data to focus in on the new time period. Note that you need to call
.index to access the date column which is now an index rather than a regular index.
You can see that in the plot call below:
ax.scatter(precip_boulder_AugOct.index.values, precip_boulder_AugOct['DAILY_PRECIP'].values, color = 'purple')
You are now ready to plot your data.
# plot the data fig, ax = plt.subplots(figsize = (8,8)) ax.scatter(precip_boulder_AugOct.index.values, precip_boulder_AugOct['DAILY_PRECIP'].values, color = 'purple') # add titles and format ax.set_title('Daily Total Precipitation \nAug - Oct 2013 for Boulder Creek') ax.set_xlabel('Date') ax.set_ylabel('Precipitation (Inches)');
The plot above appears to be subsetted correctly in time. However the plot itself looks off. Any ideas what is going on? Complete the challenge below to test your knowledge.
Using everything you’ve learned in the previous lessons:
- Import the dataset:
- Clean the data by assigning
- Make sure the date column is a
- Plot your data
Some notes to help you along:
- Date: be sure to take of the date format when you import the data.
NoData Values: You know that the
no data value = 999.99. You can account for this when you read in the data. Remember how?
Your final plot should look something like the plot below.
Create plots for the following time subsets in the data before and after the flood:
Time period A: 2012-08-01 to 2012-11-01 Time period B: 2013-08-01 to 2013-11-01
When you create your plot, be sure to set the y limits to be the same for both plots so they are visually comparable..
How different was the rainfall in 2012 compared to 2013?