Lesson 2. Import CSV Files Into Pandas Dataframes


In this lesson, you will write Python code in Jupyter Notebook to import tabular data from text files (.csv) into pandas dataframes.

Learning Objectives

After completing this lesson, you will be able to:

  • Explain how .csv files are used to store and organize tabular data
  • Import tabular data from .csv files to pandas dataframes

What You Need

Be sure you have completed the lessons on Importing Text Files Into Numpy Arrays and on Intro to Pandas Dataframes.

Be sure that you have a subdirectory called data under your earth-analytics-bootcamp directory. For help with this task, please see the challenge for the lesson on Intro to Shell.

The code below is available in the ea-bootcamp-day-5 repository that you cloned to earth-analytics-bootcamp under your home directory.

Pandas Dataframes

In the lesson introducing pandas dataframes, you learned that these data structures that are inherently tabular, meaning that all values (or cells) have a row index and a column index, even if the data only has one row and/or one column.

You also learned that unlike numpy arrays, pandas dataframes are two-dimensional by default and are composed of rows and columns. Each column in a pandas dataframe can have a labeled name (i.e. header name) and can contain a different type of data from its neighboring columns.

You also learned that due to its inherent tabular structure, you can query and run calculations on pandas dataframes across an entire row, an entire column, or a specific cell or series of cells based on either location and attribute values.

In this lesson, you will learn how to import tabular data from text files (.csv) into pandas dataframes, so you can take advantage of the benefits of working with pandas dataframes.

CSV Files For Tabular Data

In this lesson, you will work with tabular data that orignate from comma delimited (.csv), or CSV files. As you learned in the lessons on numpy arrays, CSV files are a very common file format used to collect and organize scientific data.

You also learned that unlike plain-text files which simply list out the values on separate lines without any symbols or delimiters, CSV files use commas (or some other delimiter like tab spaces or semi-colons) to indicate separate values.

CSV files also support labeled names for the columns, referred to as headers. This means that CSV files can easily support multiple columns of related data.

Furthermore, these columns are data do not all have to be of the same type (i.e. all numeric or text strings).

For example, data for the average monthly precipitation data for Boulder, CO and the month names can actually be stored together in a comma delimited (.csv) file.

months, precip
Jan, 0.70
Feb, 0.75
Mar, 1.85
Apr, 2.93
May, 3.05
June, 2.02
July, 1.93
Aug, 1.62
Sept, 1.84
Oct, 1.31
Nov, 1.39
Dec, 0.84

Due to its tabular structure with headers, CSV files are very useful for collecting and organizing datasets that contain related data of different types and across multiple locations and/or timeframes.

In this lesson, you will import tabular data from:

  • a csv file containing the average monthly precipitation data for Boulder, CO and the month and season names

Begin Writing Your Code

From previous lessons, you know how to import the necessary Python packages to set your working directory and download the needed datasets using the os and urllib packages.

To work with pandas dataframes, you will also need to import the pandas package with the alias pd. Begin by reviewing these tasks.

Import Packages

# import necessary Python packages
import os
import urllib.request
import pandas as pd

# print message after packages imported successfully
print("import of packages successful")
import of packages successful

Set Working Directory

Remember that you can check the current working directory using os.getcwd() and set the current working directory using os.chdir().

# set the working directory to the `earth-analytics-bootcamp` directory
# replace `jpalomino` with your username here and all paths in this lesson
os.chdir("/home/jpalomino/earth-analytics-bootcamp/")

# print the current working directory
os.getcwd()
'/home/jpalomino/earth-analytics-bootcamp'

Download Data

Recall that you can use the urllib package to download data from the Earth Lab Figshare.com repository.

For this lesson, you will download a .csv file containing the average monthly precipitation data for Boulder, CO and the month and season names.

# use `urllib` download files from Earth Lab figshare repository

# download .csv containing monthly average precipitation for Boulder, CO
urllib.request.urlretrieve(url = "https://ndownloader.figshare.com/files/12710618", 
                           filename = "data/avg-precip-months-seasons.csv")

# print message that data downloads were successful
print("datasets downloaded successfully")
datasets downloaded successfully

Import Tabular Data Into Pandas Dataframes

Using the read_csv() function from the pandas package, you can import tabular data from CSV files into pandas dataframe by specifying a parameter value for the file name (e.g. pd.read_csv("filename.csv")).

Remember that you gave pandas alias of pd, so you will use pd to call pandas functions. Be sure to update the path to the CSV file to your home directory.

# import the monthly average precipitation values as a pandas dataframe
avg_precip = pd.read_csv("/home/jpalomino/earth-analytics-bootcamp/data/avg-precip-months-seasons.csv")

Recall from previous lessons that you can check the type of any data structure using type(variablename).

For pandas dataframes, you can also easily see the data by simply calling the name of the pandas dataframe. No print() function needed.

# print the type for the pandas dataframe
print(type(avg_precip))

# print the values in the pandas dataframe
avg_precip
<class 'pandas.core.frame.DataFrame'>
monthsprecipseasons
0Jan0.70Winter
1Feb0.75Winter
2Mar1.85Spring
3Apr2.93Spring
4May3.05Spring
5June2.02Summer
6July1.93Summer
7Aug1.62Summer
8Sept1.84Fall
9Oct1.31Fall
10Nov1.39Fall
11Dec0.84Winter

As you can see, the months and precip data can exist together in the same pandas dataframe, which differs from numpy arrays. You can see that there is also a column for seasons containing text strings.

You can also see that the indexing still begins with [0], as it does for Python lists and numpy arrays.

Notice that you did not have to use the print() function to see the contents of the pandas dataframe, and that it is displayed with clear tabular formatting.

Congratulations! You have now learned how to import tabular data from CSV files into pandas dataframes.

Optional Challenge

Test your Python skills to:

  1. Download a .csv file containing the monthly precipitation for Boulder, CO for the years 2002 and 2013 and the month and season names (precip-2002-2013-months-seasons.csv)from https://ndownloader.figshare.com/files/12710621.
    • Be sure to assign a useful variable name that is short but indicative of what it contains (e.g. precip_2002_2013).
  2. Import the data from this .csv file into a pandas dataframe.

  3. Print the data type of your new pandas dataframe as well as its contents.
<class 'pandas.core.frame.DataFrame'>
monthsprecip_2002precip_2013seasons
0Jan1.070.27Winter
1Feb0.441.13Winter
2Mar1.501.72Spring
3Apr0.204.14Spring
4May3.202.66Spring
5June1.180.61Summer
6July0.091.03Summer
7Aug1.441.40Summer
8Sept1.5218.16Fall
9Oct2.442.24Fall
10Nov0.780.29Fall
11Dec0.020.50Winter

Leave a Comment