Lesson 2. Use Tabular Data for Earth Data Science


Learning Objectives

At the end of this activity, you will be able to:

  • Define the structure of tabular data.
  • Describe the difference between the two common types of tabular text file formats: txt and csv files.
  • Be able to list some commonly used scientific data types that often are downloaded in a tabular format.

What is Tabular Data?

Tabular data are data that are stored in a row / column format. Columns (and sometimes rows) are often identified by headers, which if named correctly, explain what is in that row or column. You may already be familiar with spreadsheet tools such as Excel and Google Sheets that can be used to open tabular data.

Tabular Data Structure

In the example below, you see a table of values that represent precipitation for 3 days. The headers in the data include

  • day and
  • precipitation-mm
dayprecipitation-mm
monday0
tuesday1
wednesday5

The tabular data above contains 4 rows - the first of which (row 1) is a header row and subsequent rows contain data. The table also has 2 columns.

Common Tabular Data File Types: .csv and .txt

Tabular data can be downloaded in many different file formats. Spreadsheet formats include .xls and xlsx which can be directly opened in Microsoft Excel. When you are downloading Earth and Environmental data, you will often see tablular data stored in file formats including:

  • .csv: Comma Separated Values - This file has each column separated (delimited) by a comma.
  • .txt: A basic text file. In a txt file, often the delimiter (the thing that separates out each column) can vary. Delimiters are discussed below in more detail.

These formats are text based and often can be opened in a text editor like Atom or Notepad. They can be then imported into Python using Pandas for further exploration and processing.

Data Tip: The challenge with graphical user interface (GUI) based tools like Excel is that they often have limitations when it comes to working with larger files. Further, it becomes difficult to recreate workflows implemented in Excel because you are often pressing buttons rather than scripting workflows. You can use Open Source Python to implement any workflow you might implement in Excel and that workflow can become fully sharable and reproducible!

Text Files & Delimiters

A delimiter refers to the character that defines the boundary for different sets of information. In a text file, the delimiter defines the boundary between columns. A line break (a return) defines each row.

Below you will find an example of a space delimited text file. In the example below, each column of data is separated by a comma ,. The data also include a header row which is also separated by commas.

site_code, year, month, day, hour, minute, second, time_decimal, value, value_std_dev  
BRW,1973,1,1,0,0,0,1973.0,-999.99,-99.99
BRW,1973,2,1,0,0,0,1973.0849315068492,-999.99,-99.99 
BRW,1973,3,1,0,0,0,1973.1616438356164,-999.99,-99.99 

Here is an example of a space delimited text file. In the example below, each column of data are separated by a single space.

site_code year month day hour minute second time_decimal value value_std_dev  
BRW 1973 1 1 0 0 0 1973.0 -999.99 -99.99
BRW 1973 2 1 0 0 0 1973.0849315068492 -999.99 -99.99 
BRW 1973 3 1 0 0 0 1973.1616438356164 -999.99 -99.99 

There are many different types of delimiters including:

  • tabs
  • commas
  • 1 (or more) spaces

Sometimes you will find other characters used as delimiters but the above-listed options are the most common.

Data Tip: The .csv file format is most often delimited by a comma. Hence the name:

Earth and Environmental Data That Are Stored In Text File Format

There are many different types of data that are stored in text and tabular file formats. Below you will see a few different examples of data that are provided in this format. You will also explore some of the cleanup steps that you need to import and begin to work with the data.

Data Tip: Not all text files store tabular text (character) based data. The .asc file format is a text based format that stores spatial raster data.

# Import packages
import os
import matplotlib.pyplot as plt
import pandas as pd

If you have a url that directly links directly to a file online, you can open it directly using pandas .read_csv(). Have a look at the data below - and notice that is has:

  • 3 columns: months, precip and seasons
  • 12 rows: notice that the first row is numered as 0. This is because indexing in Python always starts at 0 rather than 1.

Data Tip: You can learn more about zero-based indexing in the chapter on lists in this textbook

# Download and open the .csv file using Pandas
avg_monthly_precip = pd.read_csv(
    "https://ndownloader.figshare.com/files/12710618")

# View the data that you just downloaded and opened
avg_monthly_precip
monthsprecipseasons
0Jan0.70Winter
1Feb0.75Winter
2Mar1.85Spring
3Apr2.93Spring
4May3.05Spring
5June2.02Summer
6July1.93Summer
7Aug1.62Summer
8Sept1.84Fall
9Oct1.31Fall
10Nov1.39Fall
11Dec0.84Winter

In Pandas, this table format is referred to as a dataframe. You can view some stats about the dataframe including the number of columns and rows in the data using .info().

avg_monthly_precip.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   months   12 non-null     object 
 1   precip   12 non-null     float64
 2   seasons  12 non-null     object 
dtypes: float64(1), object(2)
memory usage: 416.0+ bytes

Finally, you can easily plot the data using .plot().

# Plot the data
f, ax = plt.subplots()
avg_monthly_precip.plot(x="months",
                        y="precip",
                        title="Precipitation (mm) for One Year",
                        ax=ax)
ax.set(xlabel='Month',
       ylabel='Precipitation (mm)')
plt.show()

## Challenge

  • Use Python to determine the type of data stored in avg_monthly_precip

HINT: you learned how to determine the type of an object in the variables lessons last week.

## Challenge

In most programming languages, you can customize the options for how a function runs by using parameters. Examples of parameters in the plot above include:

  • x="months" - tell python which data to place on the x-axis
  • y="precip" - tell python which data to place on the y-axis

Above you created a line plot. You can use the kind="" parameter to modify the type of plot that pandas created. You can use the color="" parameter to specify a color that you wish to use for each bar in the plot.

Do the following:

  1. copy the code below.
  2. Add kind="bar" to the .plot() method.
  3. Specify the color of each bar using the color="" parameter.

Run your code and see what the final plot looks like. You can select any color that you wish to complete your plot.

Use this link to find a list of colors (open it in a new browser tab!) https://het.as.utexas.edu/HET/Software/Matplotlib/api/colors_api.html

f, ax = plt.subplots()
avg_monthly_precip.plot(x="months",
                        y="precip",
                        title="Precipitation (mm) for One Year",
                        ax=ax)
ax.set(xlabel='Month',
       ylabel='Precipitation (mm)')
plt.show()

Cleaning Tabular Text Files So You Can Open Them in Python

Missing Data Values & Headers in Text Files

Not all text files are as simple as the example above. Many text files have several lines of header text above the data that provide you with useful information about the data itself. This data is referred to as metadata.

Also, often times, there are data missing from the data that were collected. These missing values will be identified using a specific value that is hopefully documented in the metadata for that file.

Next you will explore some temperature data that need to be cleaned up.

Data Tip: You can visit the NOAA NCDC website to learn more about the data you are using below.

  • Miami, Florida CSV: https://www.ncdc.noaa.gov/cag/city/time-series/USW00012839-tmax-12-12-1895-2020.csv
  • Seattle, Washington CSV: https://www.ncdc.noaa.gov/cag/city/time-series/USW00013895-tmax-1-5-1895-2020.csv
# Open temperature data for Miami, Florida
miami_temp_url = "https://www.ncdc.noaa.gov/cag/city/time-series/USW00012839-tmax-12-12-1895-2020.csv" 

miami_temp = pd.read_csv(miami_temp_url)
miami_temp
MiamiFloridaMaximum TemperatureJanuary-December
0Units: Degrees FahrenheitNaNNaNNaN
1Missing: -99NaNNaNNaN
2DateValueNaNNaN
319481284.5NaNNaN
419491283.9NaNNaN
...............
7020151285.6NaNNaN
7120161284.5NaNNaN
7220171285.1NaNNaN
7320181285.0NaNNaN
7420191285.6NaNNaN

75 rows × 4 columns

Notice that the data above contain a few extra rows of information. This information however is important for you to understand.

  • Missing: -99 – this is the value that represents the “no data” value. Misisng data might occur if a sensor stops working or a measurement isn’t recorded. You will want to remove any missing data values.
  • Units: Degrees Fahrenheit – it’s always important to first understand the units of the data before you try to interpret what the data are showing!

Below you will use all of the information stored in the header to import your data. You will also remove the first few rows of data because they don’t actually contain any data values. These rows contain metadata.

Function Parameters in Python

A parameter refers to an option that you can specify when running a function in Python. You can adjust the parameters associated with importing your data in the same way that you adjusted the plot type and colors above.

Below you use:

  • skiprows=: to tell Python to skip the first 3 rows of your data
  • na_values=: to tell Python to reassign any missing data values to “NA”

NA refers to missing data. When you specify a value as NA (NaN or Not a Number in Python), it will not be included in plots or any mathematical operations.

Data Tip: You can learn more about no data values in Pandas in the intermediate earth data science textbook

# Open the Miami data skipping the first 3 rows and setting no data values
miami_temp = pd.read_csv(miami_temp_url,
                        skiprows=3,
                        na_values=-99)

# View the first 5 rows of the data
miami_temp.head()
DateValue
019481284.5
119491283.9
219501283.2
319511283.6
419521284.2

## Challenge

Now that you have imported temperature data for Miami, plot the data using the code example above!! In your plot code, set Date as your x-axis value and Value column as your y-axis value.

## Challenge

Use the link below to open and plot temperature data for Seattle, Washington.

https://www.ncdc.noaa.gov/cag/city/time-series/USW00013895-tmax-1-5-1895-2020.csv

## Challenge – OPTIONAL

Copy the code below into your code. Run the code. It should download and open a new dataset that has CO2 emissions for a field site in Barrow, Alaska (brw).

The code below will download your data into your working directory. You should be able to open it using the filename co2-emissions-barrow.csv.

# Download greenhouse gas CO2 data
import urllib.request

greenhouse_gas_url = "ftp://aftp.cmdl.noaa.gov/data/trace_gases/co2/in-situ/surface/brw/co2_brw_surface-insitu_1_ccgg_MonthlyData.txt"

urllib.request.urlretrieve(url=greenhouse_gas_url,
                           filename="co2-emissions-barrow.csv")

Once you have downloaded the data

  1. Read the data in using pandas read_file()
  2. The data has some additional rows of information stored as metadata. You will need to use the skiprows= parameter to skip those metadata rows and properly import the data. HINT: remember when you use skiprows to consider 0-based indexing.
  3. Finally plot the data using pandas. plot the "time_decimal" column on the x-axis and "value" on the y-axis.

Leave a Comment