In this lesson, you will learn about another data structure commonly used for tabular scientific data -
pandas dataframes - and the key characteristics that distinguish this data structure from
After completing this lesson, you will be able to:
- Describe the data structure of
- Explain how
pandas dataframesdiffer from
What You Need
Be sure that you have completed the lessons on Numpy Arrays.
In the lessons introducing
Python lists and
numpy arrays, you learn that both of these data structures are complex, meaning that they can store collections of values, instead of just single values.
You also learned that while
Python lists are flexible and can store data items of various types (e.g. integers, floats, text strings),
numpy arrays require all data elements to be of the same type.
However, because of this requirement,
numpy arrays can provide more functionality for running calculations such as element-by-element arithmetic operations (e.g. multiplication of each element in the
numpy array by the same value) that
Python lists do not support.
In today’s lessons, you will learn about another commonly used data structure for scientific data -
pandas dataframes - which provide even more functionality for working with tabular data (i.e. data organized using rows and columns).
Pandas dataframes are data structures that are composed of rows and columns that can have header names, and the columns in
pandas dataframes can be different types (e.g. the first column containing integers and the second column containing text strings).
Each value in
pandas dataframe is referred to as a cell that has a specific row index and column index within the tabular structure.
Distinguishing Characteristics of Pandas Dataframes
These characteristics (i.e. tabular format with header names for columns or rows) make
pandas dataframes very versatile for not only storing different types, but for maintaining the relationships between cells across the same row and/or column.
Recall that in the lessons on
numpy arrays, you could not easily connect the values across
numpy arrays. Within
pandas dataframes, the relationship between the value
January in the
months column and the value
0.70 in the
precip column is maintained.
These two values (
0.70) are considered the same record, representing the same observation in the
pandas dataframes differ from
numpy arrays in other key ways:
numpy arrays, each column in a
pandas dataframecan have a labeled name (i.e. header name such as
months) and can contain a different type of data from its neighboring columns.
Cells within the
pandas dataframecan be identified by its combined row and column index (e.g.
[row index, column index]). All cells have both a row index and a column index, even if there is only one row and/or one column in the
In addition to indexing by location, you can also query for data within
pandas dataframesbased on specific values or attributes.
Because of this tabular indexing, you can query and run calculations on
pandas dataframesacross an entire row, an entire column, or a specific cell or series of cells based on either location and attribute values.
Due to its inherent tabular structure,
pandas dataframesalso allow for cells to have
nullor blank values.
In the lessons that follow, you will review these benefits of working with