Lesson 1. Data Wrangling With Pandas


Data Wrangling With Pandas Dataframes and Numpy Arrays in Python - Earth analytics bootcamp course module

Welcome to the first lesson in the Data Wrangling With Pandas Dataframes and Numpy Arrays in Python module. This tutorial walks you through wrangling data (e.g. subselect, combine and update) using pandas dataframes and numpy arrays.

In this lesson, you will learn how update, filter and group values in pandas dataframes and how to append new data to pandas dataframes.

Learning Objectives

After completing this lesson, you will be able to:

  • Run a function on a column within a pandas dataframe
  • Filter values in a pandas dataframe
  • Group values in a pandas dataframe
  • Append new data to a pandas dataframe

Data Wrangling With Pandas

Pandas dataframes offer many ways to update and select data. In this lesson, you will learn about some common data tasks that can be very useful to:

  1. update the values in specific columns using functions
  2. create new dataframes from selections and from grouping data
  3. create new dataframes by combining existing dataframes.

Begin by downloading and importing two pandas dataframes that contain temperature (Fahrenheit) for each month in 2010 to 2013 and 2014 to 2017.

# import necessary packages
import os
import pandas as pd
import urllib.request

# set the working directory to the `earth-analytics-bootcamp` directory
# replace `jpalomino` with your username here and all paths in this lesson
os.chdir("/home/jpalomino/earth-analytics-bootcamp/")

# download .csv containing monthly temperature in 2010 to 2013 for Boulder, CO
urllib.request.urlretrieve(url = "https://ndownloader.figshare.com/files/12815417", 
                           filename = "data/monthly-temp-2010-to-2013.csv")

# download .csv containing monthly temperature in 2010 to 2013 for Boulder, CO
urllib.request.urlretrieve(url = "https://ndownloader.figshare.com/files/12815420", 
                           filename = "data/monthly-temp-2014-to-2017.csv")

# import monthly temperature values for 2010 to 2013 as a pandas dataframe
monthly_temp_2010_to_2013 = pd.read_csv("/home/jpalomino/earth-analytics-bootcamp/data/monthly-temp-2010-to-2013.csv")

# import monthly temperature values for 2014 to 2017 as a pandas dataframe
monthly_temp_2014_to_2017 = pd.read_csv("/home/jpalomino/earth-analytics-bootcamp/data/monthly-temp-2014-to-2017.csv")

Review Structure of Dataframe

Recall that with the .head() method, you can see the structure of the data without having to print the entire dataframe.

Print .head() for each dataframe, and note their structures (i.e. how the data are organized).

# print .head() for 2010 to 2013
monthly_temp_2010_to_2013.head()
MonthYearTempSeasonYear_AvgYear_Avg_Label
0January201033.0Winter51.575avg
1February201030.1Winter51.575avg
2March201042.7Spring51.575avg
3April201048.8Spring51.575avg
4May201053.9Spring51.575avg
# print .head() for 2014 to 2017
monthly_temp_2014_to_2017.head()
MonthYearTempSeasonYear_AvgYear_Avg_Label
0January201434.6Winter51.2917avg
1February201432.0Winter51.2917avg
2March201443.6Spring51.2917avg
3April201449.8Spring51.2917avg
4May201456.6Spring51.2917avg

Apply Function to Column in Dataframe

Just like you can apply a function to a numpy array, you can apply a function to a specific column in a pandas dataframe using dataframe["column_name"].apply(function_name).

You specify a specific column in a pandas dataframe because you may have some columns for which the function cannot produce output.

For example, review this function to convert input values from Fahrenheit to Celsius.

# define function to convert value(s) from Fahrenheit to Celsius
def fah_to_cel(x):
    
    # convert values from Fahrenheit to Celsius using Celsius = ((Fahrenheit - 32) / 1.8)
    # can take single value, single value variable, or numpy array as input
    x = (x - 32) / 1.8
    
    # returns value(s) converted from Fahrenheit to Celsius
    return(x)    

As this function runs on numeric values, you cannot apply this function to a column with text strings.

For the monthly_temp_2010_to_2013 dataframe, you can run fah_to_cel on the column with temperature values by specifying the column name that contains these values: "Temp".

You can also replace the existing values in Temp column by setting the column equal to the output of the applied function.

# Replace values in `Temp` with the calculated values from the applied function `fah_to_cel`
monthly_temp_2010_to_2013["Temp"] = monthly_temp_2010_to_2013["Temp"].apply(fah_to_cel)

# print .head() for 2010 to 2013 to see updated values
monthly_temp_2010_to_2013.head()
MonthYearTempSeasonYear_AvgYear_Avg_Label
0January20100.555556Winter51.575avg
1February2010-1.055556Winter51.575avg
2March20105.944444Spring51.575avg
3April20109.333333Spring51.575avg
4May201012.166667Spring51.575avg

Filter Values in Pandas Dataframe

Another useful data wrangling option is the ability to filter data from an existing pandas dataframe.

Filtering data is easily done using dataframe.column_name == "value". Your output will contain all rows that meet the criteria.

Again, you can also save the output to a new dataframe by creating a new variable and setting it equal to the output of the filter.

For example, you can filter for the values in the Month column that are equal to January in monthly_temp_2010_to_2013 and save the output to a new dataframe.

# # create new dataframe from filter on values in the `Month` column that are equal to `January`
jan_temp_2010_to_2013 = monthly_temp_2010_to_2013[monthly_temp_2010_to_2013.Month == "January"]

# print new dataframe
jan_temp_2010_to_2013 
MonthYearTempSeasonYear_AvgYear_Avg_Label
0January20100.555556Winter51.5750avg
12January20110.611111Winter51.6583avg
24January20123.833333Winter54.6083warm
36January20130.555556Winter51.6583avg

You can also filter using a comparison operator on numeric values. For example, you can select all rows from the dataframe that have temperature greater than 20 degrees Celsius by filtering on the Temp column.

# create new dataframe from filter on values in the `Temp` column greater than 20 degrees Celsius
gt60_temp_2010_to_2013 = monthly_temp_2010_to_2013[monthly_temp_2010_to_2013.Temp > 20]

# print new dataframe
gt60_temp_2010_to_2013
MonthYearTempSeasonYear_AvgYear_Avg_Label
6July201022.500000Summer51.5750avg
7August201022.444444Summer51.5750avg
18July201123.055556Summer51.6583avg
19August201123.944444Summer51.6583avg
29June201223.444444Summer54.6083warm
30July201223.777778Summer54.6083warm
31August201222.888889Summer54.6083warm
41June201321.055556Summer51.6583avg
42July201322.333333Summer51.6583avg
43August201322.333333Summer51.6583avg

Group Values in Pandas Dataframe

In addition to filtering data by specific values, you can also group data that share a common value, in order to summarize the grouped data by another column.

For example, in the example dataframes, you could group the data by the Month and then run the describe() method on Temp. This would run .describe() on the temperature values for each month of data as a separate group.

To do this, you can use the following syntax: dataframe.groupby(['label_column'])[["value_column"]].describe().

In this example, the label_column on which you want to group data is Month and the value_column that you want to summarize is Temp.

# create new dataframe from the .describe() output of `Temp` based on the groupby on `Month` column
month_temp_2014_2017_summary = monthly_temp_2014_to_2017.groupby(["Month"])[["Temp"]].describe()

# print new dataframe
month_temp_2014_2017_summary
Temp
countmeanstdmin25%50%75%max
Month
April4.049.4500.59160848.948.97549.4049.87550.1
August4.069.7750.78475069.069.15069.8070.42570.5
December4.033.8501.86993832.032.82533.5034.52536.4
February4.037.9504.64937332.035.45038.7541.25042.3
January4.034.3501.76729532.233.62534.3535.07536.5
July4.072.6001.74164770.371.72573.0573.92574.0
June4.068.4251.76517266.267.77568.5069.15070.5
March4.045.7503.31712743.043.45044.8547.15050.3
May4.054.7001.84932452.453.67554.9055.92556.6
November4.043.5254.70203938.340.17544.1547.50047.5
October4.055.4503.02269251.554.35055.7556.85058.8
September4.065.4002.73374063.563.72564.3566.02569.4

In addition to running .describe() using a groupby, you can also run individual statistics such as .count() to get the number of rows belonging to a specific group (i.e. month) and other summary statistics such as .median(), .sum(), .mean(), etc, to calculate these summary statistics by a chosen group.

In the example below, .median() is being executed on Temp using a groupby on Month.

# create new dataframe from groupby calculation of .median() on `Temp`
month_temp_2014_2017_median = monthly_temp_2014_to_2017.groupby(["Month"])[["Temp"]].median()

# print new dataframe
month_temp_2014_2017_median
Temp
Month
April49.40
August69.80
December33.50
February38.75
January34.35
July73.05
June68.50
March44.85
May54.90
November44.15
October55.75
September64.35

Note the structure of the new dataframe. Does it look different than other dataframes you have seen?

Run the .info() to see that the index is now the month names, which you recall means that Month is no longer a column.

# print .info() on new dataframe
month_temp_2014_2017_median.info()
<class 'pandas.core.frame.DataFrame'>
Index: 12 entries, April to September
Data columns (total 1 columns):
Temp    12 non-null float64
dtypes: float64(1)
memory usage: 192.0+ bytes

Reset Index After Group-by

You can easily reset the index after running a groupby using the syntax: dataframe.reset_index().

For example, running this syntax on month_temp_2014_2017_median will remove the month names as the index and recreate the column called Month.

Now the data is back to a familiar format but retains the values calculated with the groupby.

# create new dataframe with reset of index
month_temp_2014_2017_median_reset = month_temp_2014_2017_median.reset_index()

# print new dataframe
month_temp_2014_2017_median_reset
MonthTemp
0April49.40
1August69.80
2December33.50
3February38.75
4January34.35
5July73.05
6June68.50
7March44.85
8May54.90
9November44.15
10October55.75
11September64.35

Append More Data to Pandas Dataframe

You may also want to combine multiple pandas dataframes into one pandas dataframe, which is easily accomplished using .append().

To combine dataframes, you first want to make sure that your dataframes are compatible. You want to make sure that the column and row structures match and that the values are in the same units, etc.

Check Compatibility of Dataframes

Two easy ways to check compatibility are to review the output of .info() and .head() for each dataframe.

For example, check monthly_temp_2010_to_2013 and monthly_temp_2014_to_2017 to see if you can combine them into one dataframe for 2010 to 2017.

# print .info()
monthly_temp_2010_to_2013.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48 entries, 0 to 47
Data columns (total 6 columns):
Month             48 non-null object
Year              48 non-null int64
Temp              48 non-null float64
Season            48 non-null object
Year_Avg          48 non-null float64
Year_Avg_Label    48 non-null object
dtypes: float64(2), int64(1), object(3)
memory usage: 2.3+ KB
# print .info()
monthly_temp_2014_to_2017.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48 entries, 0 to 47
Data columns (total 6 columns):
Month             48 non-null object
Year              48 non-null int64
Temp              48 non-null float64
Season            48 non-null object
Year_Avg          48 non-null float64
Year_Avg_Label    48 non-null object
dtypes: float64(2), int64(1), object(3)
memory usage: 2.3+ KB
# print .head()
monthly_temp_2010_to_2013.head()
MonthYearTempSeasonYear_AvgYear_Avg_Label
0January20100.555556Winter51.575avg
1February2010-1.055556Winter51.575avg
2March20105.944444Spring51.575avg
3April20109.333333Spring51.575avg
4May201012.166667Spring51.575avg
# print .head()
monthly_temp_2014_to_2017.head()
MonthYearTempSeasonYear_AvgYear_Avg_Label
0January201434.6Winter51.2917avg
1February201432.0Winter51.2917avg
2March201443.6Spring51.2917avg
3April201449.8Spring51.2917avg
4May201456.6Spring51.2917avg

Notice that while the row and column structures match, the units of monthly_temp_2014_to_2017 have not yet been converted to Celsius.

Run the the function fah_to_cel on the Temp column of monthly_temp_2014_to_2017 to convert the units.

# Replace values in `Temp` with the calculated values from the applied function `fah_to_cel`
monthly_temp_2014_to_2017["Temp"] = monthly_temp_2014_to_2017["Temp"].apply(fah_to_cel)

# print .head() to see updated values
monthly_temp_2014_to_2017.head()
MonthYearTempSeasonYear_AvgYear_Avg_Label
0January20141.444444Winter51.2917avg
1February20140.000000Winter51.2917avg
2March20146.444444Spring51.2917avg
3April20149.888889Spring51.2917avg
4May201413.666667Spring51.2917avg

Append Dataframe

Now, that the dataframes have compatible units, you can to append monthly_temp_2014_to_2017 to the bottom of monthly_temp_2010_to_2013, so that the data is in order.

To do this, use the following syntax: combined_dataframe = dataframe_1.append(dataframe_2, ignore_index=True).

You are adding the parameter ignore_index=True to automatically assign the index of the new values to match the dataframe to which they are appended.

In this example, the index of monthly_temp_2010_to_2013 is the row index, starting at 0 and ending with 47. Thus, each row appended from monthly_temp_2014_to_2017 will be assigned a new row index, starting at 48 to 95.

# append monthly_temp_2014_to_2017 to the bottom of monthly_temp_2010_to_2013
monthly_temp_2010_to_2017 = monthly_temp_2010_to_2013.append(monthly_temp_2014_to_2017, ignore_index=True)

# print .info() to see new number of rows
monthly_temp_2010_to_2017.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96 entries, 0 to 95
Data columns (total 6 columns):
Month             96 non-null object
Year              96 non-null int64
Temp              96 non-null float64
Season            96 non-null object
Year_Avg          96 non-null float64
Year_Avg_Label    96 non-null object
dtypes: float64(2), int64(1), object(3)
memory usage: 4.6+ KB

The new combined dataframe now contains 96 entries (i.e. rows), as each input dataframe contained 48 entries.

Again, you can check that the original monthly_temp_2010_to_2013 dataframe was not changed.

# print .info() to see number of rows
monthly_temp_2010_to_2013.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48 entries, 0 to 47
Data columns (total 6 columns):
Month             48 non-null object
Year              48 non-null int64
Temp              48 non-null float64
Season            48 non-null object
Year_Avg          48 non-null float64
Year_Avg_Label    48 non-null object
dtypes: float64(2), int64(1), object(3)
memory usage: 2.3+ KB

You have now learned the basics of data wrangling with pandas dataframes to update, filter and group values in pandas dataframes and to combine multiple pandas dataframes.

Optional Challenge 1

Test your Python skills to:

  1. Run the function fah_to_cel to recalculate the values in the column Year_Avg within monthly_temp_2010_to_2017.

  2. Print the first few rows of the dataframe using the .head() method.

MonthYearTempSeasonYear_AvgYear_Avg_Label
0January20100.555556Winter10.875avg
1February2010-1.055556Winter10.875avg
2March20105.944444Spring10.875avg
3April20109.333333Spring10.875avg
4May201012.166667Spring10.875avg

Optional Challenge 2

Test your Python skills to:

  1. Group the values in monthly_temp_2010_to_2017 by Month and calculate the mean of the tempertature values (Temp). Save to a new dataframe and print the new dataframe.
Temp
Month
April9.555556
August21.944444
December0.972222
February1.548611
January1.347222
July22.736111
June20.576389
March7.375000
May13.076389
November6.215278
October11.965278
September18.520833

Optional Challenge 3

Test your Python skills to:

  1. Group the values in monthly_temp_2010_to_2017 by Year_Avg_Label and calculate the count of the temperature values (Temp) (i.e. how many rows have each value of Year_Avg_Label). Save to a new dataframe.

  2. Reset the index of the new dataframe, so that Year_Avg_Label returns to being a column. Save to a new dataframe and print the new dataframe.

  3. What does this count actually mean?

Year_Avg_LabelTemp
0avg48
1warm48

Leave a Comment