R Tidyverse Introduction
In these materials we will learn how to implement our data transform steps using the tidyverse
libraries within R
.
This page provides a succinct example for each of our data transforms. Additional material on R Tidyverse, including logging into our class server and uploading files, is available on the Tidyverse Extras page in the Accessing Rstudio section.
Excluding rows: filter
In R Tidyverse when we are excluding rows based on conditions, we are performing a filter.
Show the code that defines Customer data frame
<- tribble(
customers ~customer_name, ~customer_age, ~customer_state,
"Jihong Li", 34, "California",
"Sarah Brown", 66, "Idaho",
"Rupert Brown", 89, "California",
"Ravi Shastri", 62, "Maharashtra"
)
|>
customers filter(customer_state == "California" & customer_age > 80)
We can use boolean logic inside the filter
adding conditions using &
. When we want to match against multiple options we can use `%in%.
|>
customers filter(customer_state %in% c("California", "Maharashtra"))
When we provide a list in R it is done with a little c
operator. Characters go inside double quotes, but numbers can be written without double quotes. Like this:
c("California", "Maharashtra")
c(3, 6, 2, 999)
We can also reverse the whole expression using the !
operator. This usually comes right at the start, just after the filter.
|>
customers filter(!customer_state %in% c("California", "Maharashtra"))
Excluding columns: select
R Tidyverse uses select
to allow the user to choose which columns to keep. For example, if we only want to show the age
and state
columns we can do:
|>
customers select(customer_age, customer_state)
But we can equally say “everything other than customer_name
” which we write using the !
exclamation point, meaning “not customer_name
”
|>
customers select(!customer_name)
select
has a whole set of helper functions, like starts_with("a")
or contains("b")
which let you sub-set the columns based on the column headers.
New columns: mutate
We can create new columns using mutate
. A basic mutate will take each value in a column and do something with it. For example we could apply a 10% discount to the price of our sporting goods store.
Show the code that defines products data frame
<- tribble(
products ~product_num, ~product_name, ~price_in_dollars,
"SPR-004456", "Skateboard", 121,
"SPR-003356", "Basketball", 34,
"SPR-004499", "Tennis balls", 12,
"FTW-098756", "Tennis shoes", 79
)
products
|>
products mutate(discount_price = price_in_dollars * 0.9)
We can also use mutate
together with str_glue
to flexibly combine columns and other pieces of text. For example, we could add a $
sign at the front of each value in price_in_dollars
along with .00
at the end.
|>
products mutate(formated_price = str_glue("${price_in_dollars}.00"))
str_glue
can reference multiple columns, flexibly combining text from them. Column names are used inside curly brackets (e.g., {product_name}
will be replaced with each value from a column).
|>
products mutate(product_label = str_glue("{product_name} just ${price_in_dollars}.00. Buy today!"))
R tidyverse links actions taken on the whole dataframe together. The little bit of code at the end of the line is the “pipe” character. When we read it aloud, though, we just say “then”. So I would read the code above as
Take the products data frame, then do a mutate to add a product_label column by gluing product_name and price_in dollars with some template text
In R tidyverse you will see two forms of the pipe character; both are used widely. The older form is %>%
and the newer form is |>
. In these notes I’ll mostly use |>
but you will likely see my use %>%
from time to time. You will definitely see %>%
on websites; I bet GenAI uses it too.
Extract/separate data from within cells
We will use the separate_wider_delim
. Here is data that has the product numbers:
We can get the product_code separated like this:
|>
products separate_wider_delim(product_num, delim = "-", names = c("product_code", "product_num_no_code"))
There is also separate_wider_regex
which lets you specify regex for different fields. Learn more about regex here: https://regexone.com/
|>
products separate_wider_regex(product_num, patterns = c(product_code = "\\w+", # matches SPR
".", # matches _
product_num_no_code = "\\d+")) # matches 004567
Treating cases differently: case_when
R tidyverse provides a case_when
function that is used inside mutate. The format is condition ~ output
using the squigly character called “tilde” ~
to separate a test that uses data for the row and the output that should be used if the row matches that test. The test has to return either TRUE
or FALSE
; the first TRUE
causes a match and the other rows aren’t checked. If nothing matches then we say that the row “falls through” to the bottom row, which provides a default option.
%>%
products mutate(product_label = case_when(
> 100 ~ str_glue("Treat yourself. Invest ${price_in_dollars} for a {product_name} today!"),
price_in_dollars TRUE ~ str_glue("{product_name} just ${price_in_dollars}.00. Buy today!")
))
Notice that we don’t actually check anything on the last line of the case_when
we just say TRUE
which means that if a row makes it to here then it must match (regardless of what is in the row).
Recode/lookup: case_match or join
To recode data we can replace one or more values using case_match
. This is like case_when
but the conditions only check a single field. Instead of TRUE
at the bottom, we provide a .default
value, which is the name of the column we are recoding.
tribble(
~city_name,
"Austin",
"Sydney",
"Bangalore",
"New Amsterdam"
%>%
) mutate(city_new_name = case_match(city_name,
"Bangalore" ~ "Bangaluru",
"New Amsterdam" ~ "New York",
.default = city_name))
There is a tidyverse function that is called recode
and it was intended for doing these sort of quick “find and replace” operations … but as of September 2024 the documentation says that recode
is lifecycle:superseded by case_match
.
If you have replacement values in another table then you can do a lookup against that table using left_join
.
<- tribble(
country_codes ~canonical_name, ~alternative_name,
"United States", "United States",
"United States", "USA",
"United States", "US",
"United States", "United States of America",
"South Korea", "South Korea",
"South Korea", "Korea, South",
)
<- tribble(
country_medals ~country, ~gold_medal_count,
"USA", 34,
"South Korea", 13
)
<- tribble(
country_region ~english_name, ~region,
"United States of America", "North America",
"Korea, South", "East Asia"
)
Here we can lookup the canonical name for both the medal counts and region using left_join
# Lookup canonical name using the english_name column
<- country_region |>
country_region_canonical left_join(country_codes, by = join_by(english_name == alternative_name)) |>
select(canonical_name, region)
# Medals has the alternative name in a different column
<- country_medals |>
country_medals_canonical left_join(country_codes, by = join_by(country == alternative_name)) |>
select(canonical_name, gold_medal_count)
# Hey now we can join these tables as they both have canonical name column
|>
country_medals_canonical left_join(country_region_canonical)
Nest and unnest
We can unnest a comma-separated list within a cell. This increases the number of rows. As in these example data:
Show the code that defines courses data frame
<- tribble(
courses ~course_name, ~students_enrolled,
"Divination", "Harry, Ron",
"Herbology", "Hermione, Harry",
"Flying Lessons", "Harry, Hermione, Ron"
)
courses
The relevant function is separate_longer_delim
. Notice the “longer” in the name; in tidyverse this implies that the output will add rows to the dataframe.
|>
courses separate_longer_delim(students_enrolled, delim = ", ")
We will think of the reverse of this (re-nesting) as a type of group_by
so we turn to that next.
Group and Aggregate
We will use group_by
and summarize
. group_by
specifies the grouping variable and summarize specifies how we reduce the number of rows, usually to a single row per group.
Recall that grouping is akin to sorting a dataframe, then drawing lines across it where the value in the grouping column changes.
title | series | format | weight_in_grams |
---|---|---|---|
Harry Potter 1 | Harry Potter | paperback | 1456 |
Harry Potter 3 | Harry Potter | hardcover | 2604 |
Harry Potter 2 | Harry Potter | hardcover | 2900 |
Two Towers | Lord of the Rings | paperback | 1800 |
Return of the King | Lord of the Rings | hardcover | 1784 |
We can obtain the total weight of each group of books:
|>
books_with_weights group_by(series) |>
summarize(series_weight = sum(weight_in_grams))
As we summarize a group we can calculate more than a single new column. A common need is to count the number of rows in a group. The grain of the books_with_weights
data frame here is “one row represents a book”, so counting the rows produces a count of books.
|>
books_with_weights group_by(series) |>
summarize(series_weight = sum(weight_in_grams),
book_count = n())
In addition to counting rows, we can also reduce our groups by counting the number of unique things in a column, within a group, using n_distinct()
rather than n()
.
For example if we wanted to know how many different manufacturers have cars in this dataset each year, we could use n_distinct()
year_produced | manufacturer | model |
---|---|---|
2014 | Ferrari | 458 Italia |
2014 | Lamborghini | Gallardo |
2015 | Ferrari | 458 Speciale |
2015 | Ferrari | 458 Spider |
2015 | Ferrari | California |
2015 | Ferrari | FF |
2015 | Ferrari | F12Berlinetta |
2015 | Ferrari | LaFerrari |
2015 | Lamborghini | Aventador |
2015 | Lamborghini | Huracan |
2015 | Audi | R8 |
2016 | Ferrari | 488 GTB |
2016 | Nissan | GT-R |
2016 | Bentley | Continental GT |
2016 | Maserati | Granturismo |
When we summarize the group by counting rows we get the number of models in each group, but we can also look within a column and count the number of distinct things. Above I’ve highlighted the contents of the manufacturer
column within the 2015 group, while there are 9 different rows, there are only three distinct strings: “Ferrari”, “Lamborghini”, and “Audi”.
|>
my_cars group_by(year_produced) |>
summarize(model_count = n(),
manufacturer_count = n_distinct(manufacturer))
Window Aggregates
If we want to select multiple rows from within groups we can do that using the slice_
functions after a group_by
. We tell R to sort within each group and then take the top few rows.
%>%
my_cars group_by(year_produced) %>%
arrange(desc(miles_per_gallon_highway), .by_group = TRUE) %>%
slice_head(n = 3)
#We can also ask a subtly different question with the `slice_` functions. We can ask for the rows that have the smallest or largest values of a variable. Here the sorting within the group happens inside the `slice_max/min` function using `order_by`. Here we are saying,
# > Find the top 3 values of miles_per_gallon, then select all the rows that have one of those values.
#my_cars %>%
# group_by(year_produced) %>%
# slice_max(n = 2, desc(miles_per_gallon_highway))
The slice_
functions include slice_sample
which takes random rows from within groups. That can be very useful for making reduced datasets if analyses are taking a long time or space.
Another approach is to create a new column called rank_in_group
while grouping, then use it to filter on later. Here we follow group_by
with mutate
instead of summarize
which results in creating a value during grouping which is then added to each and every row.
%>%
my_cars group_by(year_produced) %>%
mutate(rank_in_year_group = row_number(desc(miles_per_gallon_highway))) |>
arrange(year_produced, rank_in_year_group) |>
filter(rank_in_year_group <= 3)
You can also use more sophisticated ranking functions instead of row_number
, see https://dplyr.tidyverse.org/reference/row_number.html
Wider/Longer
Tidyverse provides the pivot_longer
and pivot_wider
functions.
Show the code that defines my_world_bank_pop data frame
<- world_bank_pop %>%
my_world_bank_pop filter(indicator == "SP.URB.TOTL" &
%in% c("AUS", "USA", "IND")) %>%
country mutate(country = case_match(country,
"AUS" ~ "Australia",
"IND" ~ "Indonesia",
"USA" ~ "United States")) %>%
select(country, "2010", "2011", "2012", "2013")
my_world_bank_pop
We can take that wide dataframe longer using pivot_longer
. We have to tell it which column headers to move into the data table. Here I say “all columns except country
” by using the bang character: !
. We also provide a new name for the column with the old headers (names_to = "year"
) and a name for the column that holds the old values (values_to = "urban_population"
)
<- my_world_bank_pop |>
world_bank_long pivot_longer(cols = !country, names_to = "year", values_to = "urban_population")
world_bank_long
The inverse move, taking longer data back to wide is done with pivot_wider
as shown here. We have to specify the columns that should become headers and the columns that have the values.
|>
world_bank_long pivot_wider(id_cols = country, names_from = "year", values_from = "urban_population")
There is really good detail on these functions available here https://tidyr.tidyverse.org/articles/pivot.html
In the tidyverse the word “pivot” has a clear association with these moves to longer or wider data.
Yet in the Excel world, a “pivot” has an equally clear association with a different move: the Pivot Table. A Pivot Table is, as we will see, almost always a Group By and Aggregate.
So beware that these words can have specific meanings in different frameworks … that is why in this course our data transformation is called “Wider/Longer” rather than “pivot”
R tidyverse quick reference
transform | primary_command |
---|---|
Exclude rows | filter |
Exclude columns | select |
New columns | mutate |
Combining columns | mutate + str_glue |
Extract data from cells | separate_wider_delim |
Treating cases differently | case_when |
Recode/lookup | case_match |
Un-nest | separate_longer_delim |
Nest | mutate + list |
Aggregate | summarize + sum |
Group Aggregate | group_by + summarize |
Window Aggregate | group_by + slice_max |
Longer | pivot_longer |
Wider | pivot_wider |