Lesson 1. Intro to Pandas Dataframes


Intro to Working With Pandas Dataframes in Python - Earth analytics bootcamp course module

Welcome to the first lesson in the Intro to Working With Pandas Dataframes in Python module. This tutorial walks you through importing tabular data (.csv) to pandas dataframes as well as summarizing, plotting, and running calculations on pandas dataframes.

In this lesson, you will learn about another data structure commonly used for tabular scientific data - pandas dataframes - and the key characteristics that distinguish this data structure from numpy arrays.

Learning Objectives

After completing this lesson, you will be able to:

  • Describe the data structure of pandas dataframes
  • Explain how pandas dataframes differ from numpy arrays

What You Need

Be sure that you have completed the lessons on Numpy Arrays.

Pandas Dataframes

In the lessons introducing Python lists and numpy arrays, you learn that both of these data structures are complex, meaning that they can store collections of values, instead of just single values.

You also learned that while Python lists are flexible and can store data items of various types (e.g. integers, floats, text strings), numpy arrays require all data elements to be of the same type.

However, because of this requirement, numpy arrays can provide more functionality for running calculations such as element-by-element arithmetic operations (e.g. multiplication of each element in the numpy array by the same value) that Python lists do not support.

In today’s lessons, you will learn about another commonly used data structure for scientific data - pandas dataframes - which provide even more functionality for working with tabular data (i.e. data organized using rows and columns).

Pandas dataframes are data structures that are composed of rows and columns that can have header names, and the columns in pandas dataframes can be different types (e.g. the first column containing integers and the second column containing text strings).

monthsprecip
January0.70
February0.75
March1.85

Each value in pandas dataframe is referred to as a cell that has a specific row index and column index within the tabular structure.

Distinguishing Characteristics of Pandas Dataframes

These characteristics (i.e. tabular format with header names for columns or rows) make pandas dataframes very versatile for not only storing different types, but for maintaining the relationships between cells across the same row and/or column.

Recall that in the lessons on numpy arrays, you could not easily connect the values across precip and months using numpy arrays. Within pandas dataframes, the relationship between the value January in the months column and the value 0.70 in the precip column is maintained.

These two values (January and 0.70) are considered the same record, representing the same observation in the pandas dataframe.

In addition, pandas dataframes differ from numpy arrays in other key ways:

  1. Unlike numpy arrays, each column in a pandas dataframe can have a labeled name (i.e. header name such as months) and can contain a different type of data from its neighboring columns.

  2. Cells within the pandas dataframe can be identified by its combined row and column index (e.g. [row index, column index]). All cells have both a row index and a column index, even if there is only one row and/or one column in the pandas dataframe.

  3. In addition to indexing by location, you can also query for data within pandas dataframes based on specific values or attributes.

  4. Because of this tabular indexing, you can query and run calculations on pandas dataframes across an entire row, an entire column, or a specific cell or series of cells based on either location and attribute values.

  5. Due to its inherent tabular structure, pandas dataframes also allow for cells to have null or blank values.

In the lessons that follow, you will review these benefits of working with pandas dataframes.

Leave a Comment