Lesson 4. Work with tabular spreadsheet data in Python
This lesson introduces the data.frame
which is very similar to working with a spreadsheet in Python
.
Learning Objectives
At the end of this activity, you will be able to:
- Open
.csv
or text file containing tabular (spreadsheet) formatted data inPython
. - Quickly plot the data using the Pandas function
.plot()
- Quickly plot the data using Matplotlib plotting
What You Need
You need Python 3.x
and Jupyter notebooks
to complete this tutorial. Also you should have an earth-analytics
directory setup on your computer with a /data
directory with it.
In the homework from week 1, you used the code below to create a report with in Jupyter Notebooks
.
import numpy as np
import pandas as pd
import urllib
import os
from matplotlib import pyplot as plt
# Force notebooks to plot figures inline (in the notebook)
plt.ion()
# be sure to set your working directory using os.chdir() put your file path in the parenthesis
Be sure to set your working directory
os.chdir("path-to-you-dir-here/earth-analytics/data")
# download data from figshare (note - we did this in a previous lesson)
urllib.request.urlretrieve(url='https://ndownloader.figshare.com/files/7010681',
filename= 'data/boulder-precip.csv')
('data/boulder-precip.csv', <http.client.HTTPMessage at 0x110028390>)
Remember that earlier in the lessons you learned that the code above urllib.request.urlretrieve()
is used to download a datafile. In this case, the data are stored on Figshare - a popular data repository that is free to use if your data are cumulatively smaller than 20gb.
Remember that urllib.request.urlretrieve()
function has two function ARGUMENTS:
- url: this is the path to the data file that you wish to download
- filename: this is the location on your computer (in this case:
/data
) and name of the file when saved (in this case: boulder-precip.csv). So you downloaded a file from a url on figshare do your data directory. You named that fileboulder-precip.csv
.
Next, you read in the data using the function: pd.read_csv()
.
boulder_precip = pd.read_csv('data/colorado-flood/boulder-precip.csv')
boulder_precip.head()
Unnamed: 0 | DATE | PRECIP | |
---|---|---|---|
0 | 756 | 2013-08-21 | 0.1 |
1 | 757 | 2013-08-26 | 0.1 |
2 | 758 | 2013-08-27 | 0.1 |
3 | 759 | 2013-09-01 | 0.0 |
4 | 760 | 2013-09-09 | 0.1 |
# view the structure of the data.frame
boulder_precip.dtypes
Unnamed: 0 int64
DATE object
PRECIP float64
dtype: object
Optional challenge
What is the format associated with each column for the boulder_precip
data.frame? Describe the attributes of each format. Can you perform math on each column? Why or why not?
Introduction to the DataFrame
When you read data into Python using pd.read_csv()
it imports it into a DataFrame format. A DataFrame is a 2-dimensional data structure that can store data of different types (including characters, integers, floating point values and more) in columns. It is similar to a spreadsheet or an SQL database table.
A dataframe is composed of columns and rows. Each column in a dataFrame object has the same number of rows. Each cell in a dataframe is located or defined by a x,y (column, row) index value.
Remember that in python, this index value begins at 0!
A data frame can be created manually, however most commonly they are generated when you important a text file or spreadsheet into Python using the Pandas function pd.read_csv
.
Extracting / Specifying “columns” By Name
You can extract one single column from a dataFrame using the syntax:
data_frame_name['columnNameHere']
as follows:
# view the date column of the data frame using its name (or header)
boulder_precip['DATE']
0 2013-08-21
1 2013-08-26
2 2013-08-27
3 2013-09-01
4 2013-09-09
5 2013-09-10
6 2013-09-11
7 2013-09-12
8 2013-09-13
9 2013-09-15
10 2013-09-16
11 2013-09-22
12 2013-09-23
13 2013-09-27
14 2013-09-28
15 2013-10-01
16 2013-10-04
17 2013-10-11
Name: DATE, dtype: object
# view the precip column
boulder_precip['PRECIP']
0 0.1
1 0.1
2 0.1
3 0.0
4 0.1
5 1.0
6 2.3
7 9.8
8 1.9
9 1.4
10 0.4
11 0.1
12 0.3
13 0.3
14 0.1
15 0.0
16 0.9
17 0.1
Name: PRECIP, dtype: float64
View Structure of a Data Frame
You can explore the format of your data frame too. For instance, you can see how many rows and columns your dataframe has using the shape attribute. Here the shape of the object is returned as an array containing two numbers (# of rows, # of columns)
# view the number of rows and columns in your dataframe
boulder_precip.shape
(18, 3)
<div class="notice--warning" markdown="1">
## <i class="fa fa-pencil-square-o" aria-hidden="true"></i> ## Optional challenge
Using your DataFrame `boulder_precip`, try out the attributes & methods below to see what they return.
* boulder_precip.columns
* boulder_precip.shape
Take note of the output of shape - what format does it return the shape of the DataFrame in? HINT: [More on tuples, here.](https://docs.python.org/3/tutorial/datastructures.html)
* `boulder_precip.head()`
* What does `boulder_precip.head(15)` do?
* `boulder_precip.tail()`
</div>
# view the data structure of the data frame
boulder_precip.dtypes
Unnamed: 0 int64
DATE object
PRECIP float64
dtype: object
Calculate dataframe statistics
You can quickly calculate summary statistics too. First let’s explore the column names using .columns.values
. You can use the .describe()
function to get summary statistics about numeric columns in your data.
# view column names
boulder_precip.columns.values
array(['Unnamed: 0', 'DATE', 'PRECIP'], dtype=object)
# view summary statistics - for all columns
boulder_precip.describe()
Unnamed: 0 | PRECIP | |
---|---|---|
count | 18.000000 | 18.000000 |
mean | 764.500000 | 1.055556 |
std | 5.338539 | 2.288905 |
min | 756.000000 | 0.000000 |
25% | 760.250000 | 0.100000 |
50% | 764.500000 | 0.200000 |
75% | 768.750000 | 0.975000 |
max | 773.000000 | 9.800000 |
# view summary statistics - for just the precip column
boulder_precip['PRECIP'].describe()
count 18.000000
mean 1.055556
std 2.288905
min 0.000000
25% 0.100000
50% 0.200000
75% 0.975000
max 9.800000
Name: PRECIP, dtype: float64
# view a list of just the unique precipitation values
pd.unique(boulder_precip['PRECIP'])
array([0.1, 0. , 1. , 2.3, 9.8, 1.9, 1.4, 0.4, 0.3, 0.9])
Plot data
You can quickly plot your data too. Note that you are using the .plot()
function, which comes from Pandas.
# setup the plot
boulder_precip.plot('DATE', 'PRECIP', color = 'purple');

boulder_precip.plot.bar('DATE', 'PRECIP', color = 'purple');

# convert DATE field to a datetime structure
boulder_precip['DATE'] = pd.to_datetime(boulder_precip['DATE'])
boulder_precip.dtypes
Unnamed: 0 int64
DATE datetime64[ns]
PRECIP float64
dtype: object
Let’s now work with the matplotlib and take a little time to customize your plots. Below you add the following arguments to your plot:
- title: add a title to your plot
- legend = False: turn off the legend for the plot
- kind = bar: create a bar plot
fig, ax = plt.subplots()
ax.bar(boulder_precip['DATE'].values, boulder_precip['PRECIP'].values, data=boulder_precip, color = 'purple')
ax.set_title("Total Daily Precipitation")
ax.set_xlabel("Hour", fontsize=12)
ax.set_ylabel("Total Precip (inches)", fontsize=12)
plt.setp(ax.get_xticklabels(), rotation=45);

Share on
Twitter Facebook Google+ LinkedIn
Leave a Comment