Lesson 1. Work With Datetime Format in Python - Time Series Data


Use Time Series Data in Python With Pandas - Intermediate earth data science textbook course module

Welcome to the first lesson in the Use Time Series Data in Python With Pandas module. Python provides a datetime object for storing and working with dates. Learn how to handle date fields using pandas to work with time series data in Python.

Chapter One - Introduction to Time Series Data in Python

In this chapter, you will learn how to work with the datetime object in Python, which is important for plotting and working with time series data. You will also learn how to work with “no data” values in Python.

Learning Objectives

After completing this chapter, you will be able to:

  • Import a time series dataset using pandas with dates converted to a datetime object in Python.
  • Use the datetime object to create easier-to-read time series plots and work with data across various timeframes (e.g. daily, monthly, yearly) in Python.
  • Explain the role of “no data” values and how the NaN value is used in Python to label “no data” values.
  • Set a “no data” value for a file when you import it into a pandas dataframe.

What You Need

You should have Conda setup on your computer and the Earth Analytics Python Conda environment. Follow the Set up Git, Bash, and Conda on your computer to install these tools.

Be sure that you have reviewed the Introduction to Earth Data Science textbook or are familiar with the Python programming language and the pandas package for working with dataframes.

Why Use Datetime Objects in Python

Dates can be tricky in any programming language. While you may see a date in a dataset and recognize it as something that can be quantified and related to time, a computer reads in numbers and characters. Often by default, date information is loaded as a string (i.e. a set of characters), rather than something that has an order in time.

Luckily, Python provides an easy way to work with dates using datetime objects, and you can actually convert columns in pandas dataframe that contain dates and times as strings into datetime objects.

Recall that objects are used in Python to provide a set of functionality and rules that apply to that specific object type such as:

  1. lists
  2. numpy arrays
  3. pandas dataframes and more

Just like these objects have distinct characteristics and functionality, the datetime object also provides unique functionality to help you to easily work with time series data and customize the dates on your plot, resulting in more visually appealing plots.

On this page, you will learn how to handle dates using the datetime object in Python with pandas, using a dataset of daily temperature (maximum in Fahrenheit) and total precipitation (inches) in July 2018 for Boulder, CO, provided by the National Oceanic and Atmospheric Administration (NOAA).

Import Packages and Get Data

To begin, import the necessary packages to work with pandas dataframe and download data. You will work with modules from pandas and matplotlib to plot dates more efficiently, and you will work with the seaborn package to make more attractive plots.

# Import necessary packages
import os
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import earthpy as et

# Handle date time conversions between pandas and matplotlib
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()

# Use white grid plot background from seaborn
sns.set(font_scale=1.5, style="whitegrid")
# Download csv of temp (F) and precip (inches) in July 2018 for Boulder, CO
file_url = "https://ndownloader.figshare.com/files/12948515"
et.data.get_data(url=file_url)

# Set working directory
os.chdir(os.path.join(et.io.HOME, 'earth-analytics'))

# Define relative path to file
file_path = os.path.join("data", "earthpy-downloads",
                         "july-2018-temperature-precip.csv")

# Import file into pandas dataframe
boulder_july_2018 = pd.read_csv(file_path)
Downloading from https://ndownloader.figshare.com/files/12948515
# Display first few rows
boulder_july_2018.head()
datemax_tempprecip
02018-07-01870.00
12018-07-02920.00
22018-07-0390-999.00
32018-07-04870.00
42018-07-05840.24
# View dataframe info
boulder_july_2018.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31 entries, 0 to 30
Data columns (total 3 columns):
date        31 non-null object
max_temp    31 non-null int64
precip      31 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 872.0+ bytes

View Data Types in Pandas Dataframes

Now that you have imported the data into a pandas dataframe, you can query the data types of the columns in the dataframe using the dataframe attribute .dtypes.

# View column data types
boulder_july_2018.dtypes
date         object
max_temp      int64
precip      float64
dtype: object

The .dtypes attribute indicates that the data columns in your pandas dataframe are stored as several different data types as follows:

  • date as object: A string of characters that are in quotes.
  • max_temp as int64 64 bit integer. This is a numeric value that will never contain decimal points.
  • precip as float64 - 64 bit float: This data type accepts data that are a wide variety of numeric formats including decimals (floating point values) and integers. Numeric also accept larger numbers than int will.

Investigate the data type in the date column further to see the data type or class of information it contains.

# Check data type of first value in date column
type(boulder_july_2018['date'][0])
str

Notice that while you may see this column as a date, Python stores the values as a type str or string.

Plot Dates as Strings

To understand why using datetime objects can help you to create better plots, begin by creating a standard plot using matplotlib, based on the date column (as a string) and the max_temp column.

# Create figure and plot space
fig, ax = plt.subplots(figsize=(10, 10))

# Add x-axis and y-axis
ax.plot(boulder_july_2018['date'],
        boulder_july_2018['precip'],
        color='purple')

# Set title and labels for axes
ax.set(xlabel="Date",
       ylabel="Precipitation (inches)",
       title="Daily Total Precipitation\nBoulder, Colorado in July 2018")

plt.show()
Plot of precipitation in Boulder, CO without no data values removed.
Plot of precipitation in Boulder, CO without no data values removed.

First, you will notice that there are many negative values in this dataset - these are actually “no data” values that you will handle in a later section on this page.

Now, look closely at the dates on the x-axis. When you plot a string field for the x-axis, Python gets stuck trying to plot the all of the date labels. Each value is read as a string, and it is difficult to try to fit all of those values on the x axis efficiently.

You can avoid this problem by converting the dates from strings to a datetime object during the import of data into a pandas dataframe. Once the dates are converted to a datetime object, you can more easily customize the dates on your plot, resulting in a more visually appealing plot.

Import Date Column into Pandas Dataframe As Datetime Object

To import the dates as a datetime object, you can use the parse_dates parameter of the pd.read_csv() function that allows you to indicate that a particular column should be converted to a datetime object:

parse_dates = ['date_column']

If you have a single column that contain dates in your data, you can also set dates as the index for the dataframe using the index_col parameter:

index_col = ['date_column']

You will use this index in later lessons to allow you to quickly summarize and aggregate your data by date.

Now you can recreate the dataframe using the parameter parse_datesto convert the dates from strings to datetime values and index_col to set the index of the dataframe to the datetime values.

# Import data using datetime and set index to datetime
boulder_july_2018 = pd.read_csv(file_path,
                                parse_dates=['date'],
                                index_col=['date'])

boulder_july_2018.head()
max_tempprecip
date
2018-07-01870.00
2018-07-02920.00
2018-07-0390-999.00
2018-07-04870.00
2018-07-05840.24

Notice that during the import, you converted the column date to the type datetime and you also set the index of the dataframe as that datetime object.

So rather than the index being the original RangeIndex of values from 0 to 31, the index is now a DatetimeIndex.

# View dataframe info
boulder_july_2018.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 31 entries, 2018-07-01 to 2018-07-31
Data columns (total 2 columns):
max_temp    31 non-null int64
precip      31 non-null float64
dtypes: float64(1), int64(1)
memory usage: 744.0 bytes

Recall that after you set an index from a column in a dataframe, that original column is no longer listed as a column but rather is identified as the index of the dataframe.

# View column data types
boulder_july_2018.dtypes
max_temp      int64
precip      float64
dtype: object

Don’t worry - the data are still there! To check the values of an index of a dataframe, you can use the attribute index:

# View dataframe index
boulder_july_2018.index
DatetimeIndex(['2018-07-01', '2018-07-02', '2018-07-03', '2018-07-04',
               '2018-07-05', '2018-07-06', '2018-07-07', '2018-07-08',
               '2018-07-09', '2018-07-10', '2018-07-11', '2018-07-12',
               '2018-07-13', '2018-07-14', '2018-07-15', '2018-07-16',
               '2018-07-17', '2018-07-18', '2018-07-19', '2018-07-20',
               '2018-07-21', '2018-07-22', '2018-07-23', '2018-07-24',
               '2018-07-25', '2018-07-26', '2018-07-27', '2018-07-28',
               '2018-07-29', '2018-07-30', '2018-07-31'],
              dtype='datetime64[ns]', name='date', freq=None)

Plot Dates From Pandas Dataframe Using Datetime

In matplotlib, there are slight differences in how bar and scatter plots read in data versus how line plots read in data.

When plotting with ax.bar() and ax.scatter(), numpy is used to concatenate (a fancy word for combine) an array that has been created and passed in for the x-axis and/or y-axis data. However, numpy cannot concatenate the datetime object with other values.

Thus, if you try to pass a datetime column or index directly to ax.bar() or ax.scatter(), you will receive an error.

Use Index Values Attribute to Plot Datetime

To avoid this error, you can call the attribute .values on the datetime index using:

df.index.values

Notice that here you use df.index to access the datetime column because you have assigned your date column to be an index for the dataframe.

Also, notice that the spacing on the x-axis looks better and that your x-axis date labels are easier to read, as Python knows how to only show incremental values using datetime, rather than plotting each and every date value.

# Create figure and plot space
fig, ax = plt.subplots(figsize=(10, 10))

# Add x-axis and y-axis
ax.scatter(boulder_july_2018.index.values,
        boulder_july_2018['precip'],
        color='purple')

# Set title and labels for axes
ax.set(xlabel="Date",
       ylabel="Precipitation (inches)",
       title="Daily Total Precipitation\nBoulder, Colorado in July 2018")

plt.show()
Plot of precipitation with the x-axis dates formated as datetime.
Plot of precipitation with the x-axis dates formated as datetime.
# Create figure and plot space
fig, ax = plt.subplots(figsize=(10, 10))

# Add x-axis and y-axis
ax.bar(boulder_july_2018.index.values,
        boulder_july_2018['precip'],
        color='purple')

# Set title and labels for axes
ax.set(xlabel="Date",
       ylabel="Precipitation (inches)",
       title="Daily Total Precipitation\nBoulder, Colorado in July 2018")

# Rotate tick marks on x-axis
plt.setp(ax.get_xticklabels(), rotation=45)

plt.show()
Bar plot showing daily precipitation with the x-axis dates cleaned up.
Bar plot showing daily precipitation with the x-axis dates cleaned up.

Note: you do not need to use .values when using an index that contains float values, rather than datetime objects, nor when creating a line graph using ax.plot().

However, for consistency of the code, the plot examples in this chapter will use index.values to create all plots using an index.

On later pages of this chapter, you will learn how to customize the date labels even more, such as displaying only month and day values or specifying the spacing for the time interval (i.e. frequency).

Work With No Data Values in Pandas Dataframe

As mentioned previously, you likely noticed that there are many negative values in this dataset, which are actually “no data” values.

Sometimes data are missing from a file due to errors in collection, inability to record a data point, or other reasons. For example, imagine a spreadsheet with cells that are blank. If the cells are blank, you don’t know for sure whether those data weren’t collected, or whether someone forgot to fill them in.

To account for data that are missing (not by mistake), you can put a value into those cells that represents “no data” to make it clear that these data are not usable for analysis or plotting.

Often, you’ll find a dataset that uses a specific value for “no data”. In many scientific disciplines, the value -999 is often used to indicate “no data” values.

In the example on this page, the data in july-2018-temperature-precip.csv contains “no data” values in the precip column using the value -999.

If you do not specify that the value -999 is the “no data” value, the values will be imported as real data, which will impact any statistics, calculations, and plots using the data.

For example, explore the data using describe() to see the impact of reading in those -999 values as true values.

# Both min and mean are affected by these negative, no data values
boulder_july_2018.describe()
max_tempprecip
count31.00000031.000000
mean88.129032-96.618065
std6.626925300.256388
min75.000000-999.000000
25%84.0000000.000000
50%88.0000000.000000
75%94.0000000.050000
max97.0000000.450000

The -999 values were imported as numeric values into the pandas dataframe when it was created, and thus, these values are included in the summary statistic.

To ensure nodata values are properly ignored in your summary statistics, you can specify a “no data” value during the import, so that those values are not read as true numeric values using the parameter na_values as follows:

na_values = value

For example:

na_values = -999

This will tell pandas to treat all values of -999 as no data, and thus, not to include them in the analysis or on a plot.

Now that you know how to handle no data values, recreate the dataframe once more by including a value for the parameter na_values.

# Import data using datetime and no data value
boulder_july_2018 = pd.read_csv(file_path,
                                parse_dates=['date'],
                                index_col=['date'],
                                na_values= [-999])

boulder_july_2018.head()
max_tempprecip
date
2018-07-01870.00
2018-07-02920.00
2018-07-0390NaN
2018-07-04870.00
2018-07-05840.24

Notice that the -999 value for precip on July 3rd has been replaced with NaN to indicate a no data value.

Now have a look at the summary statistics.

# Both min and mean now accurately reflect the true data
boulder_july_2018.describe()
max_tempprecip
count31.00000028.000000
mean88.1290320.065714
std6.6269250.120936
min75.0000000.000000
25%84.0000000.000000
50%88.0000000.000000
75%94.0000000.055000
max97.0000000.450000

And finally, plot the data one last time to see that the negative values -999 are no longer on the plot.

# Create figure and plot space
fig, ax = plt.subplots(figsize=(10, 10))

# Add x-axis and y-axis
ax.scatter(boulder_july_2018.index.values,
        boulder_july_2018['precip'],
        color='purple')

# Set title and labels for axes
ax.set(xlabel="Date",
       ylabel="Precipitation (inches)",
       title="Daily Total Precipitation\nBoulder, Colorado in July 2018")

plt.show()
Bar plot showing daily precipitation with the x-axis dates cleaned up and no data values removed.
Bar plot showing daily precipitation with the x-axis dates cleaned up and no data values removed.

By using the na_values parameter, you told Python to ignore those “no data” values (which are now labeled as NaN) when it performs calculations on the data and when it plots the data.

Note: if there are multiple types of missing values in your dataset, you can add multiple values in the na_values parameter as follows:

na_values=['NA', ' ', -999])

which would specify that the “no data” values are “NA” as a string, an empty space as a string, or the numeric value 999.

Now know you how to import dates into pandas dataframes using the datetime object in Python and how to deal with no data values.

On the next pages of this chapter, you will learn more about working with datetime in Python to temporally subset and resample data as well as customize your plots even more.

Leave a Comment