Lesson 1. Data Wrangling With Pandas
Data Wrangling With Pandas Dataframes and Numpy Arrays in Python - Earth analytics bootcamp course module
Welcome to the first lesson in the Data Wrangling With Pandas Dataframes and Numpy Arrays in Python module. This tutorial walks you through wrangling data (e.g. subselect, combine and update) using pandas dataframes and numpy arrays.In this lesson, you will learn how update, filter and group values in pandas dataframes
and how to append new data to pandas dataframes
.
Learning Objectives
After completing this lesson, you will be able to:
- Run a function on a column within a
pandas dataframe
- Filter values in a
pandas dataframe
- Group values in a
pandas dataframe
- Append new data to a
pandas dataframe
Data Wrangling With Pandas
Pandas dataframes
offer many ways to update and select data. In this lesson, you will learn about some common data tasks that can be very useful to:
- update the values in specific columns using functions
- create new dataframes from selections and from grouping data
- create new dataframes by combining existing dataframes.
Begin by downloading and importing two pandas dataframes
that contain temperature (Fahrenheit) for each month in 2010 to 2013 and 2014 to 2017.
# import necessary packages
import os
import pandas as pd
import urllib.request
# set the working directory to the `earth-analytics-bootcamp` directory
# replace `jpalomino` with your username here and all paths in this lesson
os.chdir("/home/jpalomino/earth-analytics-bootcamp/")
# download .csv containing monthly temperature in 2010 to 2013 for Boulder, CO
urllib.request.urlretrieve(url = "https://ndownloader.figshare.com/files/12815417",
filename = "data/monthly-temp-2010-to-2013.csv")
# download .csv containing monthly temperature in 2010 to 2013 for Boulder, CO
urllib.request.urlretrieve(url = "https://ndownloader.figshare.com/files/12815420",
filename = "data/monthly-temp-2014-to-2017.csv")
# import monthly temperature values for 2010 to 2013 as a pandas dataframe
monthly_temp_2010_to_2013 = pd.read_csv("/home/jpalomino/earth-analytics-bootcamp/data/monthly-temp-2010-to-2013.csv")
# import monthly temperature values for 2014 to 2017 as a pandas dataframe
monthly_temp_2014_to_2017 = pd.read_csv("/home/jpalomino/earth-analytics-bootcamp/data/monthly-temp-2014-to-2017.csv")
Review Structure of Dataframe
Recall that with the .head()
method, you can see the structure of the data without having to print the entire dataframe.
Print .head()
for each dataframe, and note their structures (i.e. how the data are organized).
# print .head() for 2010 to 2013
monthly_temp_2010_to_2013.head()
Month | Year | Temp | Season | Year_Avg | Year_Avg_Label | |
---|---|---|---|---|---|---|
0 | January | 2010 | 33.0 | Winter | 51.575 | avg |
1 | February | 2010 | 30.1 | Winter | 51.575 | avg |
2 | March | 2010 | 42.7 | Spring | 51.575 | avg |
3 | April | 2010 | 48.8 | Spring | 51.575 | avg |
4 | May | 2010 | 53.9 | Spring | 51.575 | avg |
# print .head() for 2014 to 2017
monthly_temp_2014_to_2017.head()
Month | Year | Temp | Season | Year_Avg | Year_Avg_Label | |
---|---|---|---|---|---|---|
0 | January | 2014 | 34.6 | Winter | 51.2917 | avg |
1 | February | 2014 | 32.0 | Winter | 51.2917 | avg |
2 | March | 2014 | 43.6 | Spring | 51.2917 | avg |
3 | April | 2014 | 49.8 | Spring | 51.2917 | avg |
4 | May | 2014 | 56.6 | Spring | 51.2917 | avg |
Apply Function to Column in Dataframe
Just like you can apply a function to a numpy array
, you can apply a function to a specific column in a pandas dataframe
using dataframe["column_name"].apply(function_name)
.
You specify a specific column in a pandas dataframe
because you may have some columns for which the function cannot produce output.
For example, review this function to convert input values from Fahrenheit to Celsius.
# define function to convert value(s) from Fahrenheit to Celsius
def fah_to_cel(x):
# convert values from Fahrenheit to Celsius using Celsius = ((Fahrenheit - 32) / 1.8)
# can take single value, single value variable, or numpy array as input
x = (x - 32) / 1.8
# returns value(s) converted from Fahrenheit to Celsius
return(x)
As this function runs on numeric values, you cannot apply this function to a column with text strings.
For the monthly_temp_2010_to_2013
dataframe, you can run fah_to_cel
on the column with temperature values by specifying the column name that contains these values: "Temp"
.
You can also replace the existing values in Temp
column by setting the column equal to the output of the applied function.
# Replace values in `Temp` with the calculated values from the applied function `fah_to_cel`
monthly_temp_2010_to_2013["Temp"] = monthly_temp_2010_to_2013["Temp"].apply(fah_to_cel)
# print .head() for 2010 to 2013 to see updated values
monthly_temp_2010_to_2013.head()
Month | Year | Temp | Season | Year_Avg | Year_Avg_Label | |
---|---|---|---|---|---|---|
0 | January | 2010 | 0.555556 | Winter | 51.575 | avg |
1 | February | 2010 | -1.055556 | Winter | 51.575 | avg |
2 | March | 2010 | 5.944444 | Spring | 51.575 | avg |
3 | April | 2010 | 9.333333 | Spring | 51.575 | avg |
4 | May | 2010 | 12.166667 | Spring | 51.575 | avg |
Filter Values in Pandas Dataframe
Another useful data wrangling option is the ability to filter data from an existing pandas dataframe
.
Filtering data is easily done using dataframe.column_name == "value"
. Your output will contain all rows that meet the criteria.
Again, you can also save the output to a new dataframe by creating a new variable and setting it equal to the output of the filter.
For example, you can filter for the values in the Month
column that are equal to January
in monthly_temp_2010_to_2013
and save the output to a new dataframe.
# # create new dataframe from filter on values in the `Month` column that are equal to `January`
jan_temp_2010_to_2013 = monthly_temp_2010_to_2013[monthly_temp_2010_to_2013.Month == "January"]
# print new dataframe
jan_temp_2010_to_2013
Month | Year | Temp | Season | Year_Avg | Year_Avg_Label | |
---|---|---|---|---|---|---|
0 | January | 2010 | 0.555556 | Winter | 51.5750 | avg |
12 | January | 2011 | 0.611111 | Winter | 51.6583 | avg |
24 | January | 2012 | 3.833333 | Winter | 54.6083 | warm |
36 | January | 2013 | 0.555556 | Winter | 51.6583 | avg |
You can also filter using a comparison operator on numeric values. For example, you can select all rows from the dataframe that have temperature greater than 20 degrees Celsius by filtering on the Temp
column.
# create new dataframe from filter on values in the `Temp` column greater than 20 degrees Celsius
gt60_temp_2010_to_2013 = monthly_temp_2010_to_2013[monthly_temp_2010_to_2013.Temp > 20]
# print new dataframe
gt60_temp_2010_to_2013
Month | Year | Temp | Season | Year_Avg | Year_Avg_Label | |
---|---|---|---|---|---|---|
6 | July | 2010 | 22.500000 | Summer | 51.5750 | avg |
7 | August | 2010 | 22.444444 | Summer | 51.5750 | avg |
18 | July | 2011 | 23.055556 | Summer | 51.6583 | avg |
19 | August | 2011 | 23.944444 | Summer | 51.6583 | avg |
29 | June | 2012 | 23.444444 | Summer | 54.6083 | warm |
30 | July | 2012 | 23.777778 | Summer | 54.6083 | warm |
31 | August | 2012 | 22.888889 | Summer | 54.6083 | warm |
41 | June | 2013 | 21.055556 | Summer | 51.6583 | avg |
42 | July | 2013 | 22.333333 | Summer | 51.6583 | avg |
43 | August | 2013 | 22.333333 | Summer | 51.6583 | avg |
Group Values in Pandas Dataframe
In addition to filtering data by specific values, you can also group data that share a common value, in order to summarize the grouped data by another column.
For example, in the example dataframes, you could group the data by the Month
and then run the describe()
method on Temp
. This would run .describe()
on the temperature values for each month of data as a separate group.
To do this, you can use the following syntax: dataframe.groupby(['label_column'])[["value_column"]].describe()
.
In this example, the label_column
on which you want to group data is Month
and the value_column
that you want to summarize is Temp
.
# create new dataframe from the .describe() output of `Temp` based on the groupby on `Month` column
month_temp_2014_2017_summary = monthly_temp_2014_to_2017.groupby(["Month"])[["Temp"]].describe()
# print new dataframe
month_temp_2014_2017_summary
Temp | ||||||||
---|---|---|---|---|---|---|---|---|
count | mean | std | min | 25% | 50% | 75% | max | |
Month | ||||||||
April | 4.0 | 49.450 | 0.591608 | 48.9 | 48.975 | 49.40 | 49.875 | 50.1 |
August | 4.0 | 69.775 | 0.784750 | 69.0 | 69.150 | 69.80 | 70.425 | 70.5 |
December | 4.0 | 33.850 | 1.869938 | 32.0 | 32.825 | 33.50 | 34.525 | 36.4 |
February | 4.0 | 37.950 | 4.649373 | 32.0 | 35.450 | 38.75 | 41.250 | 42.3 |
January | 4.0 | 34.350 | 1.767295 | 32.2 | 33.625 | 34.35 | 35.075 | 36.5 |
July | 4.0 | 72.600 | 1.741647 | 70.3 | 71.725 | 73.05 | 73.925 | 74.0 |
June | 4.0 | 68.425 | 1.765172 | 66.2 | 67.775 | 68.50 | 69.150 | 70.5 |
March | 4.0 | 45.750 | 3.317127 | 43.0 | 43.450 | 44.85 | 47.150 | 50.3 |
May | 4.0 | 54.700 | 1.849324 | 52.4 | 53.675 | 54.90 | 55.925 | 56.6 |
November | 4.0 | 43.525 | 4.702039 | 38.3 | 40.175 | 44.15 | 47.500 | 47.5 |
October | 4.0 | 55.450 | 3.022692 | 51.5 | 54.350 | 55.75 | 56.850 | 58.8 |
September | 4.0 | 65.400 | 2.733740 | 63.5 | 63.725 | 64.35 | 66.025 | 69.4 |
In addition to running .describe()
using a groupby, you can also run individual statistics such as .count()
to get the number of rows belonging to a specific group (i.e. month) and other summary statistics such as .median()
, .sum()
, .mean()
, etc, to calculate these summary statistics by a chosen group.
In the example below, .median()
is being executed on Temp
using a groupby on Month
.
# create new dataframe from groupby calculation of .median() on `Temp`
month_temp_2014_2017_median = monthly_temp_2014_to_2017.groupby(["Month"])[["Temp"]].median()
# print new dataframe
month_temp_2014_2017_median
Temp | |
---|---|
Month | |
April | 49.40 |
August | 69.80 |
December | 33.50 |
February | 38.75 |
January | 34.35 |
July | 73.05 |
June | 68.50 |
March | 44.85 |
May | 54.90 |
November | 44.15 |
October | 55.75 |
September | 64.35 |
Note the structure of the new dataframe. Does it look different than other dataframes you have seen?
Run the .info()
to see that the index is now the month names, which you recall means that Month
is no longer a column.
# print .info() on new dataframe
month_temp_2014_2017_median.info()
<class 'pandas.core.frame.DataFrame'>
Index: 12 entries, April to September
Data columns (total 1 columns):
Temp 12 non-null float64
dtypes: float64(1)
memory usage: 192.0+ bytes
Reset Index After Group-by
You can easily reset the index after running a groupby
using the syntax: dataframe.reset_index()
.
For example, running this syntax on month_temp_2014_2017_median
will remove the month names as the index and recreate the column called Month
.
Now the data is back to a familiar format but retains the values calculated with the groupby
.
# create new dataframe with reset of index
month_temp_2014_2017_median_reset = month_temp_2014_2017_median.reset_index()
# print new dataframe
month_temp_2014_2017_median_reset
Month | Temp | |
---|---|---|
0 | April | 49.40 |
1 | August | 69.80 |
2 | December | 33.50 |
3 | February | 38.75 |
4 | January | 34.35 |
5 | July | 73.05 |
6 | June | 68.50 |
7 | March | 44.85 |
8 | May | 54.90 |
9 | November | 44.15 |
10 | October | 55.75 |
11 | September | 64.35 |
Append More Data to Pandas Dataframe
You may also want to combine multiple pandas dataframes
into one pandas dataframe
, which is easily accomplished using .append()
.
To combine dataframes, you first want to make sure that your dataframes are compatible. You want to make sure that the column and row structures match and that the values are in the same units, etc.
Check Compatibility of Dataframes
Two easy ways to check compatibility are to review the output of .info()
and .head()
for each dataframe.
For example, check monthly_temp_2010_to_2013
and monthly_temp_2014_to_2017
to see if you can combine them into one dataframe for 2010 to 2017.
# print .info()
monthly_temp_2010_to_2013.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48 entries, 0 to 47
Data columns (total 6 columns):
Month 48 non-null object
Year 48 non-null int64
Temp 48 non-null float64
Season 48 non-null object
Year_Avg 48 non-null float64
Year_Avg_Label 48 non-null object
dtypes: float64(2), int64(1), object(3)
memory usage: 2.3+ KB
# print .info()
monthly_temp_2014_to_2017.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48 entries, 0 to 47
Data columns (total 6 columns):
Month 48 non-null object
Year 48 non-null int64
Temp 48 non-null float64
Season 48 non-null object
Year_Avg 48 non-null float64
Year_Avg_Label 48 non-null object
dtypes: float64(2), int64(1), object(3)
memory usage: 2.3+ KB
# print .head()
monthly_temp_2010_to_2013.head()
Month | Year | Temp | Season | Year_Avg | Year_Avg_Label | |
---|---|---|---|---|---|---|
0 | January | 2010 | 0.555556 | Winter | 51.575 | avg |
1 | February | 2010 | -1.055556 | Winter | 51.575 | avg |
2 | March | 2010 | 5.944444 | Spring | 51.575 | avg |
3 | April | 2010 | 9.333333 | Spring | 51.575 | avg |
4 | May | 2010 | 12.166667 | Spring | 51.575 | avg |
# print .head()
monthly_temp_2014_to_2017.head()
Month | Year | Temp | Season | Year_Avg | Year_Avg_Label | |
---|---|---|---|---|---|---|
0 | January | 2014 | 34.6 | Winter | 51.2917 | avg |
1 | February | 2014 | 32.0 | Winter | 51.2917 | avg |
2 | March | 2014 | 43.6 | Spring | 51.2917 | avg |
3 | April | 2014 | 49.8 | Spring | 51.2917 | avg |
4 | May | 2014 | 56.6 | Spring | 51.2917 | avg |
Notice that while the row and column structures match, the units of monthly_temp_2014_to_2017
have not yet been converted to Celsius.
Run the the function fah_to_cel
on the Temp
column of monthly_temp_2014_to_2017
to convert the units.
# Replace values in `Temp` with the calculated values from the applied function `fah_to_cel`
monthly_temp_2014_to_2017["Temp"] = monthly_temp_2014_to_2017["Temp"].apply(fah_to_cel)
# print .head() to see updated values
monthly_temp_2014_to_2017.head()
Month | Year | Temp | Season | Year_Avg | Year_Avg_Label | |
---|---|---|---|---|---|---|
0 | January | 2014 | 1.444444 | Winter | 51.2917 | avg |
1 | February | 2014 | 0.000000 | Winter | 51.2917 | avg |
2 | March | 2014 | 6.444444 | Spring | 51.2917 | avg |
3 | April | 2014 | 9.888889 | Spring | 51.2917 | avg |
4 | May | 2014 | 13.666667 | Spring | 51.2917 | avg |
Append Dataframe
Now, that the dataframes have compatible units, you can to append monthly_temp_2014_to_2017
to the bottom of monthly_temp_2010_to_2013
, so that the data is in order.
To do this, use the following syntax: combined_dataframe = dataframe_1.append(dataframe_2, ignore_index=True)
.
You are adding the parameter ignore_index=True
to automatically assign the index of the new values to match the dataframe to which they are appended.
In this example, the index of monthly_temp_2010_to_2013
is the row index, starting at 0
and ending with 47
. Thus, each row appended from monthly_temp_2014_to_2017
will be assigned a new row index, starting at 48
to 95
.
# append monthly_temp_2014_to_2017 to the bottom of monthly_temp_2010_to_2013
monthly_temp_2010_to_2017 = monthly_temp_2010_to_2013.append(monthly_temp_2014_to_2017, ignore_index=True)
# print .info() to see new number of rows
monthly_temp_2010_to_2017.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96 entries, 0 to 95
Data columns (total 6 columns):
Month 96 non-null object
Year 96 non-null int64
Temp 96 non-null float64
Season 96 non-null object
Year_Avg 96 non-null float64
Year_Avg_Label 96 non-null object
dtypes: float64(2), int64(1), object(3)
memory usage: 4.6+ KB
The new combined dataframe now contains 96 entries (i.e. rows), as each input dataframe contained 48 entries.
Again, you can check that the original monthly_temp_2010_to_2013
dataframe was not changed.
# print .info() to see number of rows
monthly_temp_2010_to_2013.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48 entries, 0 to 47
Data columns (total 6 columns):
Month 48 non-null object
Year 48 non-null int64
Temp 48 non-null float64
Season 48 non-null object
Year_Avg 48 non-null float64
Year_Avg_Label 48 non-null object
dtypes: float64(2), int64(1), object(3)
memory usage: 2.3+ KB
You have now learned the basics of data wrangling with pandas dataframes
to update, filter and group values in pandas dataframes
and to combine multiple pandas dataframes
.
Optional Challenge 1
Test your Python
skills to:
Run the function
fah_to_cel
to recalculate the values in the columnYear_Avg
withinmonthly_temp_2010_to_2017
.Print the first few rows of the dataframe using the
.head()
method.
Month | Year | Temp | Season | Year_Avg | Year_Avg_Label | |
---|---|---|---|---|---|---|
0 | January | 2010 | 0.555556 | Winter | 10.875 | avg |
1 | February | 2010 | -1.055556 | Winter | 10.875 | avg |
2 | March | 2010 | 5.944444 | Spring | 10.875 | avg |
3 | April | 2010 | 9.333333 | Spring | 10.875 | avg |
4 | May | 2010 | 12.166667 | Spring | 10.875 | avg |
Optional Challenge 2
Test your Python
skills to:
- Group the values in
monthly_temp_2010_to_2017
byMonth
and calculate the mean of the tempertature values (Temp
). Save to a new dataframe and print the new dataframe.
Temp | |
---|---|
Month | |
April | 9.555556 |
August | 21.944444 |
December | 0.972222 |
February | 1.548611 |
January | 1.347222 |
July | 22.736111 |
June | 20.576389 |
March | 7.375000 |
May | 13.076389 |
November | 6.215278 |
October | 11.965278 |
September | 18.520833 |
Optional Challenge 3
Test your Python
skills to:
Group the values in
monthly_temp_2010_to_2017
byYear_Avg_Label
and calculate the count of the temperature values (Temp
) (i.e. how many rows have each value ofYear_Avg_Label
). Save to a new dataframe.Reset the index of the new dataframe, so that
Year_Avg_Label
returns to being a column. Save to a new dataframe and print the new dataframe.What does this count actually mean?
Year_Avg_Label | Temp | |
---|---|---|
0 | avg | 48 |
1 | warm | 48 |
Share on
Twitter Facebook Google+ LinkedIn
Leave a Comment