# Lesson 4. Handle Missing Data in R Clean coding tidyverse intro

## Learning objectives

At the end of this activity, you will be able to:

• Understand why it is important to make note of missing data values.
• Be able to define what a NA value is in R and how it is used in a vector.

## What you need

In the previous lesson you attempted to plot the first file’s worth of data by time. However, the plot did you turn out as planned. There were at least two values that likely represent missing data values:

• missing and
• 999.99

In this lesson, you will learn how to handle missing data values in R using readr and some basic data exploration approaches.

## Missing Data Values

Sometimes, your data are missing values. Imagine a spreadsheet in Microsoft Excel with cells that are blank. If the cells are blank, you don’t know for sure whether those data weren’t collected, or someone forgot to fill them in. To indicate that data are missing (not by mistake) you can put a value in those cells that represents no data.

The R programming language uses NA to represent missing data values.

Lucky for us, readr makes it easy to deal with missing data values too. To account for these, we use the argument:

na = "value_to_change_to_na_here"

You can also send na a vector of missing data values, like this: na = c("value1", "value2")

# load libraries
library(ggplot2)
library(dplyr)


Let’s go through our workflow again but this time account for missing values. First, let’s have a look at the unique values contained in our HPCP column

# import data using readr
# grab first url from the file
first_csv <- all_paths$url[1] # open data year_one <- read_csv(first_csv) # view unique vales in HPCP field unique(year_one$HPCP)
## [1] "0"       "0.2"     "0.1"     "999.99"  "missing" "0.3"     "0.9"
## [8] "0.5"


Next, we can create a vector of missing data values. We can see that we have 999.99 and missing as possible NA values.

# define all missing data values in a vector
na_values <- c("missing", "999.99")

# use the na argument to read in the csv
na = na_values)
unique(year_one\$HPCP)
## [1] 0.0 0.2 0.1  NA 0.3 0.9 0.5


Once you have specified possible missing data values, try to plot again.

year_one %>%
ggplot(aes(x = DATE, y = HPCP)) +
geom_point() +
theme_bw() +
labs(x = "Date",
y = "Precipitation",
title = "Precipitation Over Time")


Note that when ggplot encounters missing data values, it tells you with a warning message:

Warning message:
Removed 3 rows containing missing values (geom_point).


The mutate() function allows you to add a new column to a data.frame. And the month() function in the lubridate package, will convert a datetime object to a month value (1-12) as follows

mutate(the_month = month(date_field_here))

Create a plot that summarizes total precipitation by month for the first csv file that we have worked with through this lesson. Use everything that you have learned so far to do this.

Your final plot should look like the one below:

HINTS:

The bar plot was created using the following ggplot elements:

geom_bar(stat = "identity", fill = "darkorchid4") + theme_bw()