In this tutorial, you will learn how to work with the
datetime object in
Python, which is important for plotting and working with time series data. You will also learn how to work with “no data” values in
At the end of this activity, you will be able to:
- Import a time series dataset into Python using
pandaswith dates converted to a
- Describe how you can use the
datetimeobject to create easier-to-read time series plots in
- Explain the role of “no data” values and how the
NAvalue is used in
Pythonto account for “no data” values.
- Set a “no data” value for a file when you import it into a
What You Need
You should have completed the lessons on Setting Up the Conda Environment.. Be sure that you have a subdirectory called
data under your
Begin Working With Datetime Object in Python
Dates can be tricky in any programming language. While you may see a date and recognize it as something that can be quantified and related to time, a computer reads in numbers and characters, and often by default, loads date information as a string (i.e. a set of characters), rather than something that has an order in time.
In this lesson, you will learn how to handle dates in
pandas using a dataset of daily temperature (maximum) and precipitation in July 2018 for Boulder, CO.
Begin by importing the necessary
Python packages to set the working directory and download the file.
# import necessary packages #import numpy as np import os import urllib.request import pandas as pd import matplotlib.pyplot as plt import earthpy as et # make figures plot inline plt.ion() # set working directory os.chdir(os.path.join(et.io.HOME, 'earth-analytics')) # set standard plot parameters for uniform plotting plt.rcParams['figure.figsize'] = (10, 6) # prettier plotting with seaborn import seaborn as sns; sns.set(font_scale=1.5) sns.set_style("whitegrid")
file_path = "data/colorado-flood/downloads/july-2018-temperature-precip.csv" # download file from Earth Lab Figshare repository urllib.request.urlretrieve(url='https://ndownloader.figshare.com/files/12948515', filename= file_path)
('data/colorado-flood/downloads/july-2018-temperature-precip.csv', <http.client.HTTPMessage at 0x11b89e7b8>)
Next, import the data from
data/july-2018-temperature-precip.csv into a
pandas dataframe and query the data types using the attribute
# import file into pandas dataframe boulder_july = pd.read_csv(file_path) # view first few rows of the data` boulder_july.head()
# view data types boulder_july.dtypes
date object max_temp int64 precip float64 dtype: object
Data Types in Pandas Dataframes
.dtypes attribute indicates that the data columns in your
DataFrame are stored as several different data types as follows:
- date as object: A string, characters that are in quotes.
- max_temp as int64 64 bit integer. This is a numeric value that will never contain decimal points.
- precip as float64 - 64 bit float: This data type accepts data that are a wide variety of numeric formats including decimals (floating point values) and integers. Numeric also accept larger numbers than int will.
One Data Type Per Dataframe Column
pandas dataframe column can only store one data type. This means that a column can not store both numbers and strings. If a column contains a list comprised of all numbers and one character string, then every value in that column will be stored as a string.
Storing variables using different data types is a strategic decision by
Python (and other programming languages) that optimizes processing and storage. It allows:
- data to be processed more quickly & efficiently.
- the program (
Python) to minimize the storage size.
Objects are used in
Python to provide a set of functionality and rules that apply to that specific object type such as:
pandas dataframesand more
Python provides a
datetime object for storing and working with dates, and you can convert columns in
pandas dataframe containing dates and times as strings into
Investigate the data type in the
date column further to see the data type or class of information it contains.
# query the data type for date column type(boulder_july['date'])
Notice that while you may see this column as a date,
Python classifies it as a type
str or string.
You can easily convert the dates from strings to a
datetime object during the import process, which you will see later in the lesson. Once the dates are converted to a
datetime object, you can more easily customize the dates on your plot, resulting in a more visually appealing plot.
Plot Dates as Strings
To understand why using
datetime objects can help you to create better plots, begin by creating a plot using
matplotlib, based on the
date column (as a string) and the
# create the plot space upon which to plot the data fig, ax = plt.subplots(figsize = (10,10)) # add the x-axis and the y-axis to the plot ax.plot(boulder_july['date'], boulder_july['precip'], color = 'red') # rotate tick labels plt.setp(ax.get_xticklabels(), rotation=45) # set title and labels for axes ax.set(xlabel="Date", ylabel="Temperature (Fahrenheit)", title="Precipitation\nBoulder, Colorado in July 2018");
Look closely at the dates on the x-axis. When you plot a string field for the x-axis,
Python gets stuck trying to plot the all of the date labels. Each value is read as a string, and it is difficult to try to fit all of those values on the x axis efficiently.
You can avoid this problem by importing the data using a parameter of the
read_csv() that allows you to indicate that a particular column should be converted to a
parse_dates = ['date_column_name']
If you have a single column that contain dates in your data, you also want to set dates as the index column. You will use this in later lessons. The index column will allow you to quickly summarize and aggregate your data by date. To set the index use the argument:
index_col = ['date_column_name']
Import Date Column As Datetime
# import file into pandas dataframe, identifying the date column to be converted to datetime boulder_july_datetime = pd.read_csv(file_path, parse_dates = ['date'], index_col = ['date']) # view data index boulder_july_datetime.index
DatetimeIndex(['2018-07-01', '2018-07-02', '2018-07-03', '2018-07-04', '2018-07-05', '2018-07-06', '2018-07-07', '2018-07-08', '2018-07-09', '2018-07-10', '2018-07-11', '2018-07-12', '2018-07-13', '2018-07-14', '2018-07-15', '2018-07-16', '2018-07-17', '2018-07-18', '2018-07-19', '2018-07-20', '2018-07-21', '2018-07-22', '2018-07-23', '2018-07-24', '2018-07-25', '2018-07-26', '2018-07-27', '2018-07-28', '2018-07-29', '2018-07-30', '2018-07-31'], dtype='datetime64[ns]', name='date', freq=None)
Once your date column is set to be both
- datetime64 and
- an index You will notice that the dataframe prints with that column on the left. Notice that the word “date” which represents the column header, is LOWER than the other two column headings.
Also notice that the date no longer appears when you call dtypes. Don’t worry - the column is still there!
max_temp int64 precip float64 dtype: object
Plot Dates Using Datetime
To plot your data as a bar or scatter plot in
matplotlib, you will get an error if you pass a
pandas dataframe column of
datetime directly into the plot function.
This is because when plotting with these methods,
numpy is used to concatenate (a fancy word for combine) the array that has been passed in for the
x-axis and the array that has been passed in for
numpy cannot concatenate the
datetime object with other values.
Use Values Attribute to Plot Datetime
To avoid this error, you can call the attribute
.values on the
datetime column using:
Notice that here you use dataframe.index to access the datetime column. This is because you have assigned your date column to be an index for the dataframe. Also notice that the spacing on the x-axis looks better and that your x-axis date labels are easier to read, as
Python knows how to only show incremental values rather than each and every date value.
# create the plot space upon which to plot the data fig, ax= plt.subplots() # add the x-axis and the y-axis to the plot ax.plot(boulder_july_datetime.index.values, boulder_july_datetime['precip'], color = 'red') # rotate tick labels plt.setp(ax.get_xticklabels(), rotation=45) # set title and labels for axes ax.set(xlabel="Date", ylabel="Temperature (Fahrenheit)", title="Precipitation\nBoulder, Colorado in July 2018");
# create the plot space upon which to plot the data fig, ax= plt.subplots() # add the x-axis and the y-axis to the plot ax.bar(boulder_july_datetime.index.values, boulder_july_datetime['precip'], color = 'blue') # rotate tick labels plt.setp(ax.get_xticklabels(), rotation=45) # set title and labels for axes ax.set(xlabel="Date", ylabel="Precipitation (in)", title="Precipitation \nBoulder, Colorado in July 2018");
NOTE: you do not need to use
.values when using a column that contains
float objects rather than
datetime objects, nor when creating a line graph. However, for consistency, the plot examples above use the same code to employ
.values and create the plots.
You may have observed that the above plots did not look right. Explore the data further using the
describe() dataframe method. Do you see any values that are questionable?
# notice any values that may seem off in the summary statistics below? boulder_july_datetime.describe()
No Data Values
Sometimes data are missing from a file due to errors in collection, inability to record a data point, or other reasons.
Imagine a spreadsheet in Microsoft Excel with cells that are blank. If the cells are blank, you don’t know for sure whether those data weren’t collected, or something someone forgot to fill in. To account for data that are missing (not by mistake), you can put a value into those cells that represents “no data”.
Customize NoData Values
Often, you’ll find a dataset that uses a specific value for “no data”. In many scientific disciplines, the value
-999 is often used to indicate “no data” values. The data in
july-2018-temperature-precip.csv contains “no data” values in the
precip column using the value
If you do not specify that the value
-999 is the “no data” value, the values will be imported as real data, which will impact any statistics or calculations run on that column.
When you used the
describe method above, the
-999 values were imported as numeric values into the
pandas dataframe when it was created, and thus, these values are included in the summary statistic. To ensure nodata values are properly ignored in your summary statistics, you can specify a “no data” value during the import, so that they are not read as numeric values, using the function argument:
na_values = no-data-value-here
# import file into pandas dataframe, with a no data value specified boulder_july_datetime_nodata = pd.read_csv(file_path, parse_dates=['date'], na_values=['-999'])
Now have a look at the summary statistics.
# calculate mean of columns in dataframe boulder_july_datetime_nodata.describe()
And finally, plot the data one last time.
# create the plot space upon which to plot the data fig, ax= plt.subplots() # add the x-axis and the y-axis to the plot ax.bar(boulder_july_datetime_nodata.index.values, boulder_july_datetime_nodata['precip'], color = 'purple') # rotate tick labels plt.setp(ax.get_xticklabels(), rotation=45) # set title and labels for axes ax.set(xlabel="Date", ylabel="Precipitation (Inches)", title="Precipitation\nBoulder, Colorado in July 2018");
By using the
na_values parameter, you told
Python to ignore those “no data” values when it performs calculations on the data.
Note: if there are multiple types of missing values in your dataset, you can extend what
Python considers a missing value using multiple values in the
na_values parameter as follows:
na_values=['NA', ' ', '-999'])
In this example, the “no data” values are specified to be “NA”, an empty space, or the value
Python skills to plot data using
- Use data that you previously downloaded to your
datadirectory in the
Import the data as
pandas dataframeindicating the appropriate column for the
dtypesattribute and create a bar plot of the precipitation data.