In these materials we will learn how to implement our data transform steps using the tidyverse libraries within R.
This page provides a succinct example for each of our data transforms. Additional material on R Tidyverse, including logging into our class server and uploading files, is available on the Tidyverse Extras page in the Accessing Rstudio section.
Excluding rows: filter
In R Tidyverse when we are excluding rows based on conditions, we are performing a filter.
When we provide a list in R it is done with a little c operator. Characters go inside double quotes, but numbers can be written without double quotes. Like this:
c("California", "Maharashtra")c(3, 6, 2, 999)
We can also reverse the whole expression using the ! operator. This usually comes right at the start, just after the filter.
R Tidyverse uses select to allow the user to choose which columns to keep. For example, if we only want to show the age and state columns we can do:
customers |>select(customer_age, customer_state)
But we can equally say “everything other than customer_name” which we write using the ! exclamation point, meaning “not customer_name”
customers |>select(!customer_name)
select has a whole set of helper functions, like starts_with("a") or contains("b") which let you sub-set the columns based on the column headers.
New columns: mutate
We can create new columns using mutate. A basic mutate will take each value in a column and do something with it. For example we could apply a 10% discount to the price of our sporting goods store.
We can also use mutate together with str_glue to flexibly combine columns and other pieces of text. For example, we could add a $ sign at the front of each value in price_in_dollars along with .00 at the end.
str_glue can reference multiple columns, flexibly combining text from them. Column names are used inside curly brackets (e.g., {product_name} will be replaced with each value from a column).
products |>mutate(product_label =str_glue("{product_name} just ${price_in_dollars}.00. Buy today!"))
A syntax note: the Pipe character
R tidyverse links actions taken on the whole dataframe together. The little bit of code at the end of the line is the “pipe” character. When we read it aloud, though, we just say “then”. So I would read the code above as
Take the products data frame, then do a mutate to add a product_label column by gluing product_name and price_in dollars with some template text
In R tidyverse you will see two forms of the pipe character; both are used widely. The older form is %>% and the newer form is |>. In these notes I’ll mostly use |> but you will likely see my use %>% from time to time. You will definitely see %>% on websites; I bet GenAI uses it too.
Extract/separate data from within cells
We will use the separate_wider_delim. Here is data that has the product numbers:
R tidyverse provides a case_when function that is used inside mutate. The format is condition ~ output using the squigly character called “tilde” ~ to separate a test that uses data for the row and the output that should be used if the row matches that test. The test has to return either TRUE or FALSE; the first TRUE causes a match and the other rows aren’t checked. If nothing matches then we say that the row “falls through” to the bottom row, which provides a default option.
products %>%mutate(product_label =case_when( price_in_dollars >100~str_glue("Treat yourself. Invest ${price_in_dollars} for a {product_name} today!"),TRUE~str_glue("{product_name} just ${price_in_dollars}.00. Buy today!") ))
Notice that we don’t actually check anything on the last line of the case_when we just say TRUE which means that if a row makes it to here then it must match (regardless of what is in the row).
Recode/lookup: case_match or join
To recode data we can replace one or more values using case_match. This is like case_when but the conditions only check a single field. Instead of TRUE at the bottom, we provide a .default value, which is the name of the column we are recoding.
There is a tidyverse function that is called recode and it was intended for doing these sort of quick “find and replace” operations … but as of September 2024 the documentation says that recode is lifecycle:superseded by case_match.
If you have replacement values in another table then you can do a lookup against that table using left_join.
country_codes <-tribble(~canonical_name, ~alternative_name,"United States", "United States","United States", "USA","United States", "US","United States", "United States of America","South Korea", "South Korea","South Korea", "Korea, South",) country_medals <-tribble(~country, ~gold_medal_count,"USA", 34,"South Korea", 13)country_region <-tribble(~english_name, ~region,"United States of America", "North America","Korea, South", "East Asia")
Here we can lookup the canonical name for both the medal counts and region using left_join
# Lookup canonical name using the english_name columncountry_region_canonical <- country_region |>left_join(country_codes, by =join_by(english_name == alternative_name)) |>select(canonical_name, region) # Medals has the alternative name in a different columncountry_medals_canonical <- country_medals |>left_join(country_codes, by =join_by(country == alternative_name)) |>select(canonical_name, gold_medal_count) # Hey now we can join these tables as they both have canonical name columncountry_medals_canonical |>left_join(country_region_canonical)
Nest and unnest
We can unnest a comma-separated list within a cell. This increases the number of rows. As in these example data:
The relevant function is separate_longer_delim. Notice the “longer” in the name; in tidyverse this implies that the output will add rows to the dataframe.
We will think of the reverse of this (re-nesting) as a type of group_by so we turn to that next.
Group and Aggregate
We will use group_by and summarize. group_by specifies the grouping variable and summarize specifies how we reduce the number of rows, usually to a single row per group.
Recall that grouping is akin to sorting a dataframe, then drawing lines across it where the value in the grouping column changes.
title
series
format
weight_in_grams
Harry Potter 1
Harry Potter
paperback
1456
Harry Potter 3
Harry Potter
hardcover
2604
Harry Potter 2
Harry Potter
hardcover
2900
Two Towers
Lord of the Rings
paperback
1800
Return of the King
Lord of the Rings
hardcover
1784
We can obtain the total weight of each group of books:
As we summarize a group we can calculate more than a single new column. A common need is to count the number of rows in a group. The grain of the books_with_weights data frame here is “one row represents a book”, so counting the rows produces a count of books.
In addition to counting rows, we can also reduce our groups by counting the number of unique things in a column, within a group, using n_distinct() rather than n().
For example if we wanted to know how many different manufacturers have cars in this dataset each year, we could use n_distinct()
year_produced
manufacturer
model
2014
Ferrari
458 Italia
2014
Lamborghini
Gallardo
2015
Ferrari
458 Speciale
2015
Ferrari
458 Spider
2015
Ferrari
California
2015
Ferrari
FF
2015
Ferrari
F12Berlinetta
2015
Ferrari
LaFerrari
2015
Lamborghini
Aventador
2015
Lamborghini
Huracan
2015
Audi
R8
2016
Ferrari
488 GTB
2016
Nissan
GT-R
2016
Bentley
Continental GT
2016
Maserati
Granturismo
When we summarize the group by counting rows we get the number of models in each group, but we can also look within a column and count the number of distinct things. Above I’ve highlighted the contents of the manufacturer column within the 2015 group, while there are 9 different rows, there are only three distinct strings: “Ferrari”, “Lamborghini”, and “Audi”.
If we want to select multiple rows from within groups we can do that using the slice_ functions after a group_by. We tell R to sort within each group and then take the top few rows.
#We can also ask a subtly different question with the `slice_` functions. We can ask for the rows that have the smallest or largest values of a variable. Here the sorting within the group happens inside the `slice_max/min` function using `order_by`. Here we are saying,# > Find the top 3 values of miles_per_gallon, then select all the rows that have one of those values.#my_cars %>% # group_by(year_produced) %>% # slice_max(n = 2, desc(miles_per_gallon_highway))
The slice_ functions include slice_sample which takes random rows from within groups. That can be very useful for making reduced datasets if analyses are taking a long time or space.
Another approach is to create a new column called rank_in_group while grouping, then use it to filter on later. Here we follow group_by with mutate instead of summarize which results in creating a value during grouping which is then added to each and every row.
We can take that wide dataframe longer using pivot_longer. We have to tell it which column headers to move into the data table. Here I say “all columns except country” by using the bang character: !. We also provide a new name for the column with the old headers (names_to = "year") and a name for the column that holds the old values (values_to = "urban_population")
The inverse move, taking longer data back to wide is done with pivot_wider as shown here. We have to specify the columns that should become headers and the columns that have the values.
In the tidyverse the word “pivot” has a clear association with these moves to longer or wider data.
Yet in the Excel world, a “pivot” has an equally clear association with a different move: the Pivot Table. A Pivot Table is, as we will see, almost always a Group By and Aggregate.
So beware that these words can have specific meanings in different frameworks … that is why in this course our data transformation is called “Wider/Longer” rather than “pivot”