Lesson 3. Manipulate, Summarize and Plot Numpy Arrays


In this lesson, you will write Python code in Jupyter Notebook to manipulate and summarize numpy arrays using the numpy package. You will also plot numpy arrays using matplotlib.pyplot.

Learning Objectives

After completing this hands-on exercise, you will be able to:

  • Explain the difference between one-dimensional and two-dimensional numpy arrays
  • Use indexing to select data from these numpy arrays
  • Run arithmetic (e.g. addition, multiplication) operations on these numpy arrays
  • Summarize one-dimensional numpy arrays (e.g. averages, maximum values)
  • Create plots using one-dimensional numpy arrays

What You Need

Be sure that you have completed the previous lesson on Import Text Data Into Numpy Arrays.

The code below is available in the ea-bootcamp-day-4 repository that you cloned to earth-analytics-bootcamp under your home directory.

Indexing For Numpy Arrays

In the lessons on lists, you learned that Python indexing begins with [0], and that you can use indexing to query the value of items within Python lists.

You can also access elements (i.e. values) in numpy arrays using indexing.

One-dimensional Numpy Arrays

For one-dimensional numpy arrays, you only need to specific one index value to access the elements in the numpy array (e.g. arrayname[index,]).

The example below is an one-dimensional array that has 3 elements, or values.

avg_monthly_precip = numpy.array([0.70, 0.75, 1.85])

You can use avg_monthly_precip[2,] to get the third element in (1.85) from this one-dimensional numpy array.

Recall that you are using use the index [2] for the third place because Python indexing begins with [0], not with [1].

Two-dimensional Numpy Arrays

With two-dimensional arrays, you need to specify to both a row index and a column index.

The example below is a two-dimensional array with 2 rows and 3 columns.

precip_2002_2013 = numpy.array([[1.07, 0.44, 1.5],
                              [0.27, 1.13, 1.72]])

You can use precip_2002_2013[1, 2] to get the element in the second row, third column (1.72) of this two-dimensional numpy array.

Just like you saw for the one-dimensional numpy array, you use the index [1,2] for the second row and third column because Python indexing begins with [0], not with [1]

In this lesson, you will use indexing to select elements within one-dimensional and two-dimensional numpy arrays, and you will learn how to manipulate, summarize, and plot these numpy arrays.

You will use the same datasets from the previous lesson on importing text data:

  • a .txt file containing the average monthly precipitation data for Boulder, CO
  • a .csv file containing the monthly precipitation for Boulder, CO for the years 2002 and 2013

Begin Writing Your Code

Import Packages

From the previous lesson, you have already learned how to import the necessary packages to set the working directory and download the needed datasets using the os and urllib packages.

To work with numpy arrays, you will also need to import the numpy package with the alias np, and you will need to import the matplotlib.pyplot module with the alias plt to plot data. Begin by reviewing these tasks.

# import necessary Python packages
import os
import numpy as np
import urllib.request
import matplotlib.pyplot as plt

# print message after packages imported successfully
print("import of packages successful")
import of packages successful

Set Working Directory

Remember that you can check the current working directory using os.getcwd() and set the current working directory using os.chdir().

# set the working directory to the `earth-analytics-bootcamp` directory
# replace `jpalomino` with your username here and all paths in this lesson
os.chdir("/home/jpalomino/earth-analytics-bootcamp/")

# print the current working directory
os.getcwd()
'/home/jpalomino/earth-analytics-bootcamp'

Download Data

In the previous lesson, you used the urllib package to download data from the Earth Lab Figshare.com repository. You will use these same datasets in this lesson.

# use `urllib` download files from Earth Lab figshare repository

# download .txt containing monthly average precipitation for Boulder, CO
urllib.request.urlretrieve(url = "https://ndownloader.figshare.com/files/12565616", 
                           filename = "data/avg-monthly-precip.txt")

# download .txt containing month names
urllib.request.urlretrieve(url = "https://ndownloader.figshare.com/files/12565619", 
                           filename = "data/months.txt")

# download .csv containing monthly average precipitation for Boulder, CO
urllib.request.urlretrieve(url = "https://ndownloader.figshare.com/files/12707792", 
                           filename = "data/monthly-precip-2002-2013.csv")


# print message that data downloads were successful
print("datasets downloaded successfully")
datasets downloaded successfully

Import Data Into Numpy Arrays

You also already learned how to import data from text files into numpy arrays. Be sure to update the paths for the files to your home directory.

# import the monthly average values from `avg-monthly-precip.txt` as a numpy array
avg_monthly_precip = np.loadtxt(fname = "/home/jpalomino/earth-analytics-bootcamp/data/avg-monthly-precip.txt")

# import the names of the months from month.txt as a numpy array
months = np.genfromtxt("/home/jpalomino/earth-analytics-bootcamp/data/months.txt", dtype='str')

# import the monthly average values from `monthly-precip-2002-2013.csv` as a numpy array
precip_2002_2013 = np.loadtxt(fname= "/home/jpalomino/earth-analytics-bootcamp/data/monthly-precip-2002-2013.csv", delimiter = ",")

Describe Contents of Numpy Arrays

To begin working with numpy arrays, it is helpful to get some more details about the contents of data, such as the number of rows and columns in the data.

You can use .shape after the variable name of the numpy array (e.g. variablename.shape) to get its dimensions (i.e. number of rows and columns).

# print the dimensions of months
print(months.shape)
(12,)

Use .shape to compare the dimensions of avg_monthly_precip versus precip_2002_2013.

# print the dimensions of avg_monthly_precip
print(avg_monthly_precip.shape)

# print the dimensions of precip_2002_2013
print(precip_2002_2013.shape)
(12,)
(2, 12)

The output for avg_monthly_precip indicates that it is composed of 12 elements along one-dimension. In fact, this numpy arrays is one-dimensional, meaning that all values exist within a single vector or list.

The output for precip_2002_2013 indicates that it is composed of 2 rows and 12 columns. This is two-dimensional numpy array that has two observations - one for the year 2002 and another for the year 2013 - and 12 measurements for observation - one for each month of the year.

Use Indexing to Query Numpy Arrays

One-dimensional Numpy Arrays

By listing the dimensions of avg_monthly_precip using .shape, you know that it contains 12 elements along one dimension (e.g. [12,]).

As this numpy array is one-dimensional, you can leave the second parameter blank when use indexing to access elements in this numpy array (e.g. precip[X,]).

For example, because indexing in Python begins with [0], you can use the index [11,] to query the last element in avg_monthly_precip.

# select the last element in `avg_monthly_precip` using the index [11,]
avg_monthly_precip[11,]
0.84

Check what happens when you query for an index location that does not exist in the array, say the index [12,].

# change the value below from 11 to 12 to check what happens when you query for an index location that does not exist
avg_monthly_precip[11,]
0.84

You can also select a series of values from one-dimensional numpy arrays such as the third, fourth and fifth values.

Note that the index structure is inclusive of the first index value, but not the second index value. You are providing a start index value for the selection and an end index value that is not included in the selection.

avg_monthly_precip[2:5]
array([1.85, 2.93, 3.05])

Two-dimensional Numpy Arrays

Using .shape, you also saw that precip_2002_2013 has row count of 2 with a column count of 12.

Because precip_2002_2013 is a two-dimensional numpy array, you need to specify both a row index and a column index to select elements in the numpy array

For example, because indexing in Python begins with [0], you can use the index [0,0] to query the first element in precip_2002_2013 (i.e. first row, first column).

# select the element in the first row, first column in the array
precip_2002_2013[0,0]
1.07

Or, use the index [1,11] to query the last element in precip_2002_2013 (i.e. last row, last column).

# select the element in the last row, last column
precip_2002_2013[1,11]
0.5

For two-dimensional numpy arrays, you can also use a series for the row index and/or column index to select multiple elements using the index structure [rowindex : rowindex, columnindex : columnindex].

Like with the one-dimensional arrays, the index structure is inclusive of the first index, but not the second index. Again, you are providing a start index value for the selection and an end index value that is not included in the selection.

For example, you can use the index [0:1, 0:3] to select the first row and the first three columns (again because Python indexing begins with [0]).

# select the first row and the first three columns
precip_2002_2013[0:1, 0:3]
array([[1.07, 0.44, 1.5 ]])

If you wanted to include the second row and fourth column, you would need to use the index [0:2, 0:4].

# select the first two rows and the first four columns
precip_2002_2013[0:2, 0:4]
array([[1.07, 0.44, 1.5 , 0.2 ],
       [0.27, 1.13, 1.72, 4.14]])

You can also store selected data as a new numpy array.

For example, you can create a new numpy array for the precipitation data in 2002 by selecting the first row of values from precip_2002_2013.

# select the first row and all twelve columns of monthly values
precip_2002 = precip_2002_2013[0:1, 0:12]

# print data in `precip_2002`
precip_2002
array([[1.07, 0.44, 1.5 , 0.2 , 3.2 , 1.18, 0.09, 1.44, 1.52, 2.44, 0.78,
        0.02]])

You can check the .shape of the new array to see that it has remained a two-dimensional array, but it only has one row of data, not two like precip_2002_2013.

# print dimensions of `precip_2002`
precip_2002.shape
(1, 12)

Run Calculations on Numpy Arrays

Numpy arrays calculations highlight the major differences between Python lists and numpy arrays.

Recall that in lessons on variables and lists, you created separate variables for each monthly average precipitation value to convert it to millimeters (e.g. jan = 0.70 * 25.4), and then you created a new list containing all of these converted monthly values.

Numpy arrays make it easy to run calculations on data as needed, while Python lists do not support these kinds of calculations.

Numpy arrays support mathematical operations on an element-by-element basis, meaning that you can actually run one operation (e.g. * 25.4) on the entire array with a single line of code.

Review this primary difference betweens lists and numpy arrays below.

# Uncomment the code below to run it. Note: this code will result in an error, as you cannot run this operation on a list
#preciplist = [0.70, 0.75, 1.85, 2.93, 3.05, 2.02, 1.93, 1.62, 1.84, 1.31, 1.39, 0.84]
#preciplist = preciplist * 25.4
# print the values in the array `avg_monthly_precip`
print(avg_monthly_precip)

# multiply each element in the array `avg_monthly_precip` by 25.4
# assign the results to a new array also called `avg_monthly_precip`
avg_monthly_precip = avg_monthly_precip * 25.4

# print the values in the new array `avg_monthly_precip`
print(avg_monthly_precip)
[0.7  0.75 1.85 2.93 3.05 2.02 1.93 1.62 1.84 1.31 1.39 0.84]
[17.78  19.05  46.99  74.422 77.47  51.308 49.022 41.148 46.736 33.274
 35.306 21.336]

See how easy these calculations can be with numpy arrays! These arithmetic calculations will work on any numpy array, including multi-dimensional numpy arrays.

Recall the previous lessons on variables and lists. Instead of creating separate variables for each month to run these calculations, you can now create a single numpy array imported from avg-monthly-precip.txt and run a single multiplication operation on the entire numpy array to the convert the values from inches to millimeters.

Summarize Data in Numpy Arrays

Another great feature of numpy arrays is the ability to run summary statistics (e.g. calculating averages, finding min or max values) across the entire array of values. Lists do not support this functionality either.

For example, you can use the mean() function in numpy to calculate the average value across an array (e.g. np.mean(arrayname)). You can also store results as a new variable.

# calculate the mean and store the result as a new variable
mean_avg_precip = np.mean(avg_monthly_precip)

# you can expand the print statement to include a text string to label the data output
print("mean of average monthly precipitation:", mean_avg_precip)
mean of average monthly precipitation: 42.820166666666665

Similarly, we can use min() and max() to find the minimum and maximum values in an array.

# find the min value and store the result as a new variable
min_avg_precip = np.min(avg_monthly_precip)

# find the max value and store the result as a new variable
max_avg_precip = np.max(avg_monthly_precip)

# print these values along with a message that labels each result
print("minimum of average monthly precipitation:", min_avg_precip)
print("maximum of average monthly precipitation:", max_avg_precip)
minimum of average monthly precipitation: 17.779999999999998
maximum of average monthly precipitation: 77.46999999999998

Notice that in this code, you can only identify the value that is the minimum or maximum but not the month in which the value occurred. This is because precip and months are not connected in an easy way that would allow you to determine the month that matches the values.

You could use indexing to determine the index location of the maximum value in precip and then query that same index location in months, but rest assured, there is an easier way to do this!

In future lessons on pandas dataframes, you will learn how to work with data in a tabular structure, so that precip values are linked with their corresponding month names.

Plot Numpy Arrays

Since you have now completed an easy calculation to convert the precipitation values using numpy array calculations, you can use this numpy array to plot the precipitation data, rather than relying on Python lists.

In order to use multiple numpy arrays within the same plot, you need to make sure that the dimensions of the arrays are compatible.

You have already done this by checking the .shape of avg_monthly_precip and months, which indicates that both have 12 elements along one dimension ((12,)).

You can re-use your matplotlib code from the lesson on plotting with matplotlib to create the same plot of average monthly precipitation in Boulder, CO using numpy arrays. Recall that you can set the color in the plot (e.g. grey).

# set plot size for all plots that follow
plt.rcParams["figure.figsize"] = (8, 8)

# create the plot space upon which to plot the data
fig, ax = plt.subplots()

# add the x-axis and the y-axis to the plot
ax.bar(months, avg_monthly_precip, color="grey")

# set plot title
ax.set(title="Average Monthly Precipitation in Boulder, CO")

# add labels to the axes
ax.set(xlabel="Month", ylabel="Precipitation (mm)");

png

Note that precip_2002 is still two dimensional array, so you cannot use it to plot data against months, which is a one-dimensional array.

In future lessons, you will learn how to convert two-dimensional numpy arrays to one-dimensional numpy arrays.

Congratulations! You have learned how to use indexing to select data from one-dimensional and two-dimensional numpy arrays, and how to run calculations and summary statistics on these numpy arrays. You also learned how to plot data from one-dimensional numpy arrays.

Optional Challenge

Test your Python skills to:

  1. Convert the data values in precip_2002_2013 from inches to millimeters (one inch = 25.4 millimeters).

  2. Create a new numpy array for 2013 by selecting all data values in the last row in precip_2002_2013 (i.e. data for the year 2013).

  3. Calculate the minimum, mean, and maximum values for 2013.

  4. Print these values along with a message that labels each result (e.g. mean precipitation in 2013:).

array([[  6.858,  28.702,  43.688, 105.156,  67.564,  15.494,  26.162,
         35.56 , 461.264,  56.896,   7.366,  12.7  ]])
minimum precipitation in 2013: 6.858
mean precipitation in 2013: 72.28416666666665
maximum precipitation in 2013: 461.26399999999995

Leave a Comment