Lesson 4. Selections From Pandas Dataframes


In this lesson, you will write Python code in Jupyter Notebook to select data from pandas dataframes using specific values as well as indexing.

Learning Objectives

After completing this lesson, you will be able to:

  • Describe the two types of indexing for pandas dataframes: location-based and label-based
  • Select data from pandas dataframes using location-based indexing
  • Select data from pandas dataframes using specific values
  • Select data from pandas dataframes using label-based indexing

What You Need

Be sure you have completed the lesson on Importing CSV Files Into Pandas Dataframes and Manipulate and Plot Pandas Dataframes.

The code below is available in the ea-bootcamp-day-5 repository that you cloned to earth-analytics-bootcamp under your home directory.

Selections From Pandas Dataframes

In the lesson introducing pandas dataframes, you learned that these data structures have an inherent tabular structure (i.e. rows and columns with header names) that support selecting data with indexing, such as selecting individual cells identified by their location at the intersection of rows and columns.

You can also select data from pandas dataframes without knowing the location of that data within the pandas dataframe, using specific values such as a column name or data value.

In this lesson, you will review how indexing works for pandas dataframes and you will learn how to select data from pandas dataframes using specific values as well as indexing.

Indexing For Pandas Dataframes

There are two kinds of indexing in pandas dataframes: location-based and label-based.

After working with indexing for Python lists and numpy arrays, you are familiar with location-based indexing.

You already know that Python location-based indexing begins with [0], and you have learned how to use location-based indexing to query data within Python lists or numpy arrays.

You can also use location-based indexing to query pandas dataframes using .iloc and providing the row and column selection as ranges (i.e. start and stop locations along the rows and columns).

The range provided is inclusive of the first value, but not the second value. This means that you need to use the range [0:1] to select the first index, so your selection begins at [0] but does not include [1].

For example, you can select the first row and the first column of a pandas dataframes providing the range [0:1] for the row selection and then providing the range [0:1] for the column selection.

dataframe.iloc[0:1, 0:1]

In addition to location-based indexing, pandas dataframes can also be queried using label-based indexing. This feature of pandas dataframes is very useful because you can create an index for pandas dataframes based on a list of values that you want to use for organizing and querying your data.

For example, you can create an index from a specific column of values, and then use .loc to select data from the pandas dataframes using a value that is found in that index.


dataframe.setindex("column")
dataframe.loc["value"]

In this lesson, you will use both location and label indexing to select data from pandas dataframes.

Begin Writing Your Code

By now, you are familiar with importing the necessary Python packages to set your working directory and download the needed datasets using the os and urllib packages. You also need to import the pandas package with the alias pd.

Begin by reviewing these tasks.

Import Packages

# import necessary Python packages
import os
import urllib.request
import pandas as pd

# print message after packages imported successfully
print("import of packages successful")
import of packages successful

Set Working Directory

Remember that you can check the current working directory using os.getcwd() and set the current working directory using os.chdir().

# set the working directory to the `earth-analytics-bootcamp` directory
# replace `jpalomino` with your username here and all paths in this lesson
os.chdir("/home/jpalomino/earth-analytics-bootcamp/")

# print the current working directory
os.getcwd()
'/home/jpalomino/earth-analytics-bootcamp'

Download Data

Recall that you can use the urllib package to download data from the Earth Lab Figshare.com repository.

For this lesson, you will use the files from the prevous lesson: a .csv file containing the average monthly precipitation data for Boulder, CO, and another .csv file containing monthly precipitation for Boulder, CO in 2002 and 2013.

# use `urllib` download files from Earth Lab figshare repository

# download .csv containing monthly average precipitation for Boulder, CO
urllib.request.urlretrieve(url = "https://ndownloader.figshare.com/files/12710618", 
                           filename = "data/avg-precip-months-seasons.csv")

# download .csv containing monthly precipitation for Boulder, CO in 2002 and 2013
urllib.request.urlretrieve(url = "https://ndownloader.figshare.com/files/12710621", 
                           filename = "data/precip-2002-2013-months-seasons.csv")

# print message that data downloads were successful
print("datasets downloaded successfully")
datasets downloaded successfully

Import Tabular Data Into Pandas Dataframes

You also know how to import CSV files into pandas dataframes.

# import the monthly average precipitation values as a pandas dataframe
avg_precip = pd.read_csv("/home/jpalomino/earth-analytics-bootcamp/data/avg-precip-months-seasons.csv")

# import the monthly precipitation values in 2002 and 2013 as a pandas dataframe
precip_2002_2013 = pd.read_csv("/home/jpalomino/earth-analytics-bootcamp/data/precip-2002-2013-months-seasons.csv")

Selections Using Location Index (.iloc)

You can use .iloc to select individual rows and columns or a series of rows and columns by providing the range (i.e. start and stop locations along the rows and columns) that you want to select.

Recall that in Python indexing begins with [0] and that the range you provide is inclusive of the first value, but not the second value.

This means that you can use dataframe.iloc[0:1, 0:1] to select the cell value at the intersection of the first row and first column of the dataframe.

# select first row and first column
avg_precip.iloc[0:1, 0:1]
months
0Jan

You can expand the range for either the row index or column index to select more data.

For example, you can select the first two rows of the first column using dataframe.iloc[0:2, 0:1] or the first columns of the first row using dataframe.iloc[0:1, 0:2].

# select first two rows and first column
avg_precip.iloc[0:2, 0:1]
months
0Jan
1Feb
# select first row and first two columns
avg_precip.iloc[0:1, 0:2]
monthsprecip
0Jan0.7

You can also use iloc to select an entire row or an entire column by leaving the other range without values.

For example, you can use dataframe.iloc[0:1, :] to select the first row of a dataframe and all of the columns, or dataframe.iloc[ :, 0:1] to select the first column of a dataframe and all of the rows.

# select first row with all columns
avg_precip.iloc[0:1, :]
monthsprecipseasons
0Jan0.7Winter
# select first column with all rows
avg_precip.iloc[:, 0:1]
months
0Jan
1Feb
2Mar
3Apr
4May
5June
6July
7Aug
8Sept
9Oct
10Nov
11Dec

Selections Using Specific Criteria Values

In addition to location-based indexing, you can also select data from pandas dataframes using specific values.

For example, you can select all data from a specific column in pandas dataframes using dataframe.[["columnname"]].

avg_precip[["months"]]
months
0Jan
1Feb
2Mar
3Apr
4May
5June
6July
7Aug
8Sept
9Oct
10Nov
11Dec

Notice that your results are also a pandas dataframe.

You can also select all data from multiple columns in pandas dataframes using dataframe.[["columnname", "columnname"]]. As the results of your selection are also a pandas dataframe, you can assign the results to a new pandas dataframe.

For example, you can create a new pandas dataframe that only contains the months and seasons columns, effectively dropping the precip values.

avg_precip_text = avg_precip[['months', 'seasons']]

avg_precip_text
monthsseasons
0JanWinter
1FebWinter
2MarSpring
3AprSpring
4MaySpring
5JuneSummer
6JulySummer
7AugSummer
8SeptFall
9OctFall
10NovFall
11DecWinter

If you call your original pandas dataframe, you will see it is unchanged.

avg_precip
monthsprecipseasons
0Jan0.70Winter
1Feb0.75Winter
2Mar1.85Spring
3Apr2.93Spring
4May3.05Spring
5June2.02Summer
6July1.93Summer
7Aug1.62Summer
8Sept1.84Fall
9Oct1.31Fall
10Nov1.39Fall
11Dec0.84Winter

You can also select data based on specific values within a column using dataframe[dataframe.columnname == "value"]. This will return all rows containing that value within the specified column.

If you are selecting data using a text string column, you need to provide the value within parentheses (e.g. "textvalue").

For example, you can select all rows that have a seasons value of Summer.

avg_precip[avg_precip.seasons == "Summer"]
monthsprecipseasons
5June2.02Summer
6July1.93Summer
7Aug1.62Summer

You can also select data based on numeric values; these selections do not require the use of parentheses.

For example, you can select all rows that have a specific value in precip such as 1.62.

avg_precip[avg_precip.precip == 1.62]
monthsprecipseasons
7Aug1.62Summer

Selections Using Label Index (.loc)

In addition to selecting data based on specific values, you can also create new index based on a list of values that you want to use for organizing and querying your data.

For example, you can create an index from a specific column of values using dataframe.setindex("column").

Data Tip: Creating a new index will restructure the data by replacing the default location indexing (i.e. [0]) with the new index. This also means the column used to create the index is no longer a functional column, but rather an index of the dataframe.

# create index using the values in the column `months`
avg_precip = avg_precip.set_index("months")

# print the data in precip_df to see the new structure
avg_precip
precipseasons
months
Jan0.70Winter
Feb0.75Winter
Mar1.85Spring
Apr2.93Spring
May3.05Spring
June2.02Summer
July1.93Summer
Aug1.62Summer
Sept1.84Fall
Oct1.31Fall
Nov1.39Fall
Dec0.84Winter

Test that months no longer functions as a column by atempting to select all data using that column name, as you did previously when months was still a column.

# Uncomment and run the line below to see the error message that appears
# avg_precip[["months"]]

Notice that the error message indicates that the value months is not in the index. This is because months is actually now the index!

After setting an index, you can use .loc to select data from the pandas dataframes using a value that is found in that index.

# select data for `Aug` using the new index `months` 
avg_precip.loc[["Aug"]]
precipseasons
months
Aug1.62Summer

Congratulations! You have now learned how to select data from pandas dataframes using specific values as well as location-based and label-based indexing.

Optional Challenge

Test your Python skills to:

  1. Use the .describe() method to summarize the precipitation values in precip_2002_2013. Note the maximum values in 2002 and 2013.

  2. Use indexing to create two new dataframes:
    • one containing the month with the maximum value in 2002
    • one containing the month with the maximum value in 2013
  3. Compare these new dataframes.
    • Do they occur in the same season?
    • What do you notice about the precipitation value for the maximum month in 2013 (Sept), as compared to that same month in 2002?
precip_2002precip_2013
count12.00000012.000000
mean1.1566672.845833
std0.9611014.953130
min0.0200000.270000
25%0.3800000.582500
50%1.1250001.265000
75%1.5050002.345000
max3.20000018.160000
monthsprecip_2002precip_2013seasons
4May3.22.66Spring
monthsprecip_2002precip_2013seasons
8Sept1.5218.16Fall

Leave a Comment