Lesson 3. Manipulate and Plot Pandas Dataframes


In this lesson, you will write Python code in Jupyter Notebook to describe, manipulate and plot data in pandas dataframes.

Learning Objectives

After completing this lesson, you will be able to:

  • Run functions that are inherent to pandas dataframes (i.e. methods)
  • Query automatically generated characteristics about pandas dataframes (i.e. attributes)
  • Create a plot using data in pandas dataframes

What You Need

Be sure you have completed the lesson on Importing CSV Files Into Pandas Dataframes.

The code below is available in the ea-bootcamp-day-5 repository that you cloned to earth-analytics-bootcamp under your home directory.

Methods and Attributes

Methods

Previous lessons have introduced the concept of functions as commands that can take inputs that are used to produce output. For example, you have used many functions, including the print() function to display the results of your code and to write messages about the results.

print("Message as text string goes here")

You have also used functions provided by Python packages such as numpy to run calculations on numpy arrays.

For example, you used np.mean() to calculate the average value of specified numpy array. In these numpy functions, you explicitly provided the name of the variable as an input parameter.

print("Mean Value: ", np.mean(arrayname))

In Python, data structures, such as pandas dataframes, can provide built-in functions that are referred to as methods. Each data structure has its own set of methods, based on how the data is organized and the types of operations supported by the data structure .

A method can be called by adding the .function() after the name of the data structure (e.g. structurename.function()), rather than providing the name as an input parameter (e.g. function(structurename)).

In this lesson, you will explore some methods that are provided with the pandas dataframe data structure.

Attributes

In addition to functions, you have also unknowingly worked with attributes, which are automatically created characteristics (i.e. metadata) about the data structure or object that you are working with.

For example, you used .shape to get the dimensions of a specific numpy array (e.g. arrayname.shape), which is an attribute that is automatically generated about the numpy array when it is created.

In this lesson, you will use attributes to get more information about pandas dataframes and run functions (i.e. methods) inherent to the pandas dataframes data structure to learn about the benefits of working with pandas dataframes.

Begin Writing Your Code

From previous lessons, you know how to import the necessary Python packages to set your working directory and download the needed datasets using the os and urllib packages.

To work with pandas dataframes, you will also need to import the pandas package with the alias pd, and you will need to import the matplotlib.pyplot module with the alias plt to plot data. Begin by reviewing these tasks.

Import Packages

# import necessary Python packages
import os
import urllib.request
import pandas as pd
import matplotlib.pyplot as plt

# print message after packages imported successfully
print("import of packages successful")
import of packages successful

Set Working Directory

Remember that you can check the current working directory using os.getcwd() and set the current working directory using os.chdir().

# set the working directory to the `earth-analytics-bootcamp` directory
# replace `jpalomino` with your username here and all paths in this lesson
os.chdir("/home/jpalomino/earth-analytics-bootcamp/")

# print the current working directory
os.getcwd()
'/home/jpalomino/earth-analytics-bootcamp'

Download Data

Recall that you can use the urllib package to download data from the Earth Lab Figshare.com repository.

For this lesson, you will download a .csv file containing the average monthly precipitation data for Boulder, CO, and another .csv file containing monthly precipitation for Boulder, CO in 2002 and 2013.

# use `urllib` download files from Earth Lab figshare repository

# download .csv containing monthly average precipitation for Boulder, CO
urllib.request.urlretrieve(url = "https://ndownloader.figshare.com/files/12710618", 
                           filename = "data/avg-precip-months-seasons.csv")

# download .csv containing monthly precipitation for Boulder, CO in 2002 and 2013
urllib.request.urlretrieve(url = "https://ndownloader.figshare.com/files/12710621", 
                           filename = "data/precip-2002-2013-months-seasons.csv")

# print message that data downloads were successful
print("datasets downloaded successfully")
datasets downloaded successfully

Import Tabular Data Into Pandas Dataframes

You also learned how to import CSV files into pandas dataframes.

# import the monthly average precipitation values as a pandas dataframe
avg_precip = pd.read_csv("/home/jpalomino/earth-analytics-bootcamp/data/avg-precip-months-seasons.csv")

# import the monthly precipitation values in 2002 and 2013 as a pandas dataframe
precip_2002_2013 = pd.read_csv("/home/jpalomino/earth-analytics-bootcamp/data/precip-2002-2013-months-seasons.csv")

View Contents of Pandas Dataframes

Rather than seeing all of the data at once, you can choose to see the first few rows or the last few rows using the pandas dataframe methods .head() or .tail() (e.g. dataframe.tail()).

This capability can be very useful for large datasets which cannot easily be displayed within Jupyter Notebook.

# check the first few rows in `avg_precip`
avg_precip.head()
monthsprecipseasons
0Jan0.70Winter
1Feb0.75Winter
2Mar1.85Spring
3Apr2.93Spring
4May3.05Spring

Describe Contents of Pandas Dataframes

You can use the method .info() to get more details, or metadata, about a pandas dataframe (e.g. dataframe.info()) such as the number of rows and columns and the column names.

# check the metadata about `avg_precip`
avg_precip.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 3 columns):
months     12 non-null object
precip     12 non-null float64
seasons    12 non-null object
dtypes: float64(1), object(2)
memory usage: 368.0+ bytes

The output of the .info() method shows you the number of rows (or entries) and the number of columns, as well as the columns names and the types of data they contain (e.g. float64 which is the default decimal type in Python).

You can use other methods to produce summarized results about data values contained within the pandas dataframes.

For example, you can use the method .describe() to run summary statistics about the numeric columns in pandas dataframe (e.g. dataframe.describe()), such as the count, mean, minimum and maximum values.

# run summary statistics on `avg_precip`
avg_precip.describe()
precip
count12.000000
mean1.685833
std0.764383
min0.700000
25%1.192500
50%1.730000
75%1.952500
max3.050000

Recall that in the lessons on numpy arrays, you ran multiple functions to get the mean, minimum and maximum values of numpy arrays. This fast calculation of summary statistics is a clear benefit of using pandas dataframes over numpy arrays.

The .describe() method also provides the standard deviation (i.e. a measure of the amount of variation across the data) as well as the quantiles of the pandas dataframe, which tell us how the data are distributed between the minimum and maximum values (e.g. the 25% quantile indicates the cut-off for the lowest 25% values in the data).

Sort Data Values in Pandas Dataframes

Recall that in the lessons on numpy arrays, you can only identify the value that is the minimum or maximum, but not the month in which the value occurred. This is because precip and months are not connected in an easy way that would allow you to determine the month that matches the values.

Using pandas dataframes, you can sort the values with the method .sort_values(), providing the column name and a parameter for ascending (e.g. dataframe.sort_values(by="columname", ascending = True)).

Sort by the values in the precip column in descending order (ascending = False) to find the maximum value and its corresponding month.

# sort values in descending order to identify the month with maximum value for `precip` within `precip_df`
avg_precip.sort_values(by="precip", ascending = False)
monthsprecipseasons
4May3.05Spring
3Apr2.93Spring
5June2.02Summer
6July1.93Summer
2Mar1.85Spring
8Sept1.84Fall
7Aug1.62Summer
10Nov1.39Fall
9Oct1.31Fall
11Dec0.84Winter
1Feb0.75Winter
0Jan0.70Winter

Run Calculations on Columns Within Pandas Dataframes

You can easily recalculate the values of a column within a pandas dataframe setting the column equal to the result of the desired calculation (e.g. dataframe.column = dataframe.column + 4, which would add the number 4 to each value in the column).

You can use this capability to easily convert the values in the precip column from inches to millimeters (where one inch is equal to 25.4 millimeters).

# multiply the values in `precip` column to convert from inches to millimeters
avg_precip.precip = avg_precip.precip * 25.4

# print the values in `avg_precip`
avg_precip
monthsprecipseasons
0Jan17.780Winter
1Feb19.050Winter
2Mar46.990Spring
3Apr74.422Spring
4May77.470Spring
5June51.308Summer
6July49.022Summer
7Aug41.148Summer
8Sept46.736Fall
9Oct33.274Fall
10Nov35.306Fall
11Dec21.336Winter

Plot Pandas Dataframes

In the previous lessons, you saw that it is easy to use multiple numpy arrays within the same plot but you have to make sure that the dimensions of the numpy arrays are compatible.

Pandas dataframes make it even easier to plot the data because the tabular structure is already built-in.

In fact, you do not have to create any new variables to plot data from pandas dataframes.

You can simply reuse your matplotlib.pyplot code from the numpy arrays lesson, using the dataframe and column names to plot data (e.g. dataframe.column) along each axis.

# set plot size for all plots that follow
plt.rcParams["figure.figsize"] = (8, 8)

# create the plot space upon which to plot the data
fig, ax = plt.subplots()

# add the x-axis and the y-axis to the plot
ax.bar(avg_precip.months, avg_precip.precip, color="grey")

# set plot title
ax.set(title="Average Monthly Precipitation in Boulder, CO")

# add labels to the axes
ax.set(xlabel="Month", ylabel="Precipitation (mm)");

png

Congratulations! You have now learned how to run methods and query attributes of pandas dataframes. You also recalculated values and created plots from pandas dataframes.

Optional Challenge 1

Test your Python skills to:

  1. Convert the precip_2002 column in precip_2002_2013 to millimeters (one inch = 25.4 millimeters).

  2. Create a blue line plot of monthly precipitation for Boulder, CO in 2002. Be sure to include a title and labels for the axes. If needed, refer to the lesson on Plot Data in Python with Matplotlib..

png

Optional Challenge 2

Test your Python skills to:

  1. Convert the precip_2013 column in precip_2002_2013 to millimeters (one inch = 25.4 millimeters).

  2. Create a blue scatter plot of monthly precipitation for Boulder, CO in 2013. Be sure to include a title and labels for the axes. If needed, refer to the lesson on Plot Data in Python with Matplotlib..

  3. Compare your plot for 2013 to the one for 2002.

    • Does the maximum precipitation occur in the same month?
    • What do you notice about the y-axis of the 2013, as compared to the 2002 plot?

png

Leave a Comment