R Tidyverse Introduction

In these materials we will learn how to implement our data transform steps using the tidyverse libraries within R.

This page provides a succinct example for each of our data transforms. Additional material on R Tidyverse, including logging into our class server and uploading files, is available on the Tidyverse Extras page in the Accessing Rstudio section.

Excluding rows: filter

In R Tidyverse when we are excluding rows based on conditions, we are performing a filter.

Show the code that defines Customer data frame

customers <- tribble(
  ~customer_name, ~customer_age, ~customer_state,
  "Jihong Li", 34, "California",
  "Sarah Brown", 66, "Idaho",
  "Rupert Brown", 89, "California",
  "Ravi Shastri", 62, "Maharashtra"
)

customers |> 
  filter(customer_state == "California" & customer_age > 80)

We can use boolean logic inside the filter adding conditions using &. When we want to match against multiple options we can use `%in%.

customers |> 
  filter(customer_state %in% c("California", "Maharashtra"))

Note

When we provide a list in R it is done with a little c operator. Characters go inside double quotes, but numbers can be written without double quotes. Like this:

c("California", "Maharashtra")
c(3, 6, 2, 999)

We can also reverse the whole expression using the ! operator. This usually comes right at the start, just after the filter.

customers |> 
  filter(!customer_state %in% c("California", "Maharashtra"))

Excluding columns: select

R Tidyverse uses select to allow the user to choose which columns to keep. For example, if we only want to show the age and state columns we can do:

customers |>
  select(customer_age, customer_state)

But we can equally say “everything other than customer_name” which we write using the ! exclamation point, meaning “not customer_name”

customers |>
  select(!customer_name)

select has a whole set of helper functions, like starts_with("a") or contains("b") which let you sub-set the columns based on the column headers.

New columns: mutate

We can create new columns using mutate. A basic mutate will take each value in a column and do something with it. For example we could apply a 10% discount to the price of our sporting goods store.

Show the code that defines products data frame

products <- tribble(
  ~product_num, ~product_name, ~price_in_dollars, 
  "SPR-004456", "Skateboard", 121, 
  "SPR-003356", "Basketball", 34,
  "SPR-004499", "Tennis balls", 12,
  "FTW-098756", "Tennis shoes", 79
)

products

products |>
  mutate(discount_price = price_in_dollars * 0.9)

We can also use mutate together with str_glue to flexibly combine columns and other pieces of text. For example, we could add a $ sign at the front of each value in price_in_dollars along with .00 at the end.

products |>
  mutate(formated_price = str_glue("${price_in_dollars}.00"))

str_glue can reference multiple columns, flexibly combining text from them. Column names are used inside curly brackets (e.g., {product_name} will be replaced with each value from a column).

products |> 
  mutate(product_label = str_glue("{product_name} just ${price_in_dollars}.00. Buy today!"))

A syntax note: the Pipe character

R tidyverse links actions taken on the whole dataframe together. The little bit of code at the end of the line is the “pipe” character. When we read it aloud, though, we just say “then”. So I would read the code above as

Take the products data frame, then do a mutate to add a product_label column by gluing product_name and price_in dollars with some template text

In R tidyverse you will see two forms of the pipe character; both are used widely. The older form is %>% and the newer form is |>. In these notes I’ll mostly use |> but you will likely see my use %>% from time to time. You will definitely see %>% on websites; I bet GenAI uses it too.

Extract/separate data from within cells

We will use the separate_wider_delim. Here is data that has the product numbers:

We can get the product_code separated like this:

products |>
  separate_wider_delim(product_num, delim = "-", names = c("product_code", "product_num_no_code"))

There is also separate_wider_regex which lets you specify regex for different fields. Learn more about regex here: https://regexone.com/

products |>
  separate_wider_regex(product_num, patterns = c(product_code = "\\w+", # matches SPR
                                                 ".", # matches _
                                                 product_num_no_code = "\\d+")) # matches 004567

Treating cases differently: case_when

R tidyverse provides a case_when function that is used inside mutate. The format is condition ~ output using the squigly character called “tilde” ~ to separate a test that uses data for the row and the output that should be used if the row matches that test. The test has to return either TRUE or FALSE; the first TRUE causes a match and the other rows aren’t checked. If nothing matches then we say that the row “falls through” to the bottom row, which provides a default option.

products %>% 
  mutate(product_label = case_when(
    price_in_dollars > 100 ~ str_glue("Treat yourself. Invest ${price_in_dollars} for a {product_name} today!"),
    TRUE                   ~ str_glue("{product_name} just ${price_in_dollars}.00. Buy today!")
    ))

Notice that we don’t actually check anything on the last line of the case_when we just say TRUE which means that if a row makes it to here then it must match (regardless of what is in the row).

Recode/lookup: case_match or join

To recode data we can replace one or more values using case_match. This is like case_when but the conditions only check a single field. Instead of TRUE at the bottom, we provide a .default value, which is the name of the column we are recoding.

tribble(
  ~city_name,
  "Austin",
  "Sydney",
  "Bangalore",
  "New Amsterdam"
) %>% 
  mutate(city_new_name = case_match(city_name,
                            "Bangalore" ~ "Bangaluru",
                            "New Amsterdam" ~ "New York",
                            .default = city_name))

Caution

There is a tidyverse function that is called recode and it was intended for doing these sort of quick “find and replace” operations … but as of September 2024 the documentation says that recode is lifecycle:superseded by case_match.

If you have replacement values in another table then you can do a lookup against that table using left_join.

country_codes <- tribble(
  ~canonical_name, ~alternative_name,
  "United States", "United States",
  "United States", "USA",
  "United States", "US",
  "United States", "United States of America",
  "South Korea", "South Korea",
  "South Korea", "Korea, South",
) 

country_medals <- tribble(
  ~country, ~gold_medal_count,
  "USA", 34,
  "South Korea", 13
)

country_region <- tribble(
  ~english_name, ~region,
  "United States of America", "North America",
  "Korea, South", "East Asia"
)

Here we can lookup the canonical name for both the medal counts and region using left_join

# Lookup canonical name using the english_name column
country_region_canonical <- country_region |>
  left_join(country_codes, by = join_by(english_name == alternative_name)) |>
  select(canonical_name, region) 

# Medals has the alternative name in a different column
country_medals_canonical <- country_medals |>
  left_join(country_codes, by = join_by(country == alternative_name)) |>
  select(canonical_name, gold_medal_count) 

# Hey now we can join these tables as they both have canonical name column
country_medals_canonical |>
  left_join(country_region_canonical)

Nest and unnest

We can unnest a comma-separated list within a cell. This increases the number of rows. As in these example data:

Show the code that defines courses data frame

courses <- tribble(
  ~course_name, ~students_enrolled,
  "Divination", "Harry, Ron",
  "Herbology", "Hermione, Harry",
  "Flying Lessons", "Harry, Hermione, Ron"
)

courses

The relevant function is separate_longer_delim. Notice the “longer” in the name; in tidyverse this implies that the output will add rows to the dataframe.

courses |>
  separate_longer_delim(students_enrolled, delim = ", ")

We will think of the reverse of this (re-nesting) as a type of group_by so we turn to that next.

Group and Aggregate

We will use group_by and summarize. group_by specifies the grouping variable and summarize specifies how we reduce the number of rows, usually to a single row per group.

Recall that grouping is akin to sorting a dataframe, then drawing lines across it where the value in the grouping column changes.

title	series	format	weight_in_grams
Harry Potter 1	Harry Potter	paperback	1456
Harry Potter 3	Harry Potter	hardcover	2604
Harry Potter 2	Harry Potter	hardcover	2900
Two Towers	Lord of the Rings	paperback	1800
Return of the King	Lord of the Rings	hardcover	1784

We can obtain the total weight of each group of books:

books_with_weights |>
  group_by(series) |>
  summarize(series_weight = sum(weight_in_grams))

As we summarize a group we can calculate more than a single new column. A common need is to count the number of rows in a group. The grain of the books_with_weights data frame here is “one row represents a book”, so counting the rows produces a count of books.

books_with_weights |>
  group_by(series) |>
  summarize(series_weight = sum(weight_in_grams),
            book_count = n())

In addition to counting rows, we can also reduce our groups by counting the number of unique things in a column, within a group, using n_distinct() rather than n().

For example if we wanted to know how many different manufacturers have cars in this dataset each year, we could use n_distinct()

year_produced	manufacturer	model
2014	Ferrari	458 Italia
2014	Lamborghini	Gallardo
2015	Ferrari	458 Speciale
2015	Ferrari	458 Spider
2015	Ferrari	California
2015	Ferrari	FF
2015	Ferrari	F12Berlinetta
2015	Ferrari	LaFerrari
2015	Lamborghini	Aventador
2015	Lamborghini	Huracan
2015	Audi	R8
2016	Ferrari	488 GTB
2016	Nissan	GT-R
2016	Bentley	Continental GT
2016	Maserati	Granturismo

When we summarize the group by counting rows we get the number of models in each group, but we can also look within a column and count the number of distinct things. Above I’ve highlighted the contents of the manufacturer column within the 2015 group, while there are 9 different rows, there are only three distinct strings: “Ferrari”, “Lamborghini”, and “Audi”.

my_cars |>
  group_by(year_produced) |>
  summarize(model_count = n(),
            manufacturer_count = n_distinct(manufacturer))

Window Aggregates

If we want to select multiple rows from within groups we can do that using the slice_ functions after a group_by. We tell R to sort within each group and then take the top few rows.

my_cars %>% 
  group_by(year_produced) %>% 
  arrange(desc(miles_per_gallon_highway), .by_group = TRUE) %>% 
  slice_head(n = 3)

#We can also ask a subtly different question with the `slice_` functions.  We can ask for the rows that have the smallest or largest values of a variable.  Here the sorting within the group happens inside the `slice_max/min` function using `order_by`.  Here we are saying,

# > Find the top 3 values of miles_per_gallon, then select all the rows that have one of those values.

#my_cars %>% 
#  group_by(year_produced) %>% 
#  slice_max(n = 2, desc(miles_per_gallon_highway))

The slice_ functions include slice_sample which takes random rows from within groups. That can be very useful for making reduced datasets if analyses are taking a long time or space.

Another approach is to create a new column called rank_in_group while grouping, then use it to filter on later. Here we follow group_by with mutate instead of summarize which results in creating a value during grouping which is then added to each and every row.

my_cars %>% 
  group_by(year_produced) %>%
  mutate(rank_in_year_group = row_number(desc(miles_per_gallon_highway))) |>
  arrange(year_produced, rank_in_year_group) |> 
  filter(rank_in_year_group <= 3)

You can also use more sophisticated ranking functions instead of row_number, see https://dplyr.tidyverse.org/reference/row_number.html

Wider/Longer

Tidyverse provides the pivot_longer and pivot_wider functions.

Show the code that defines my_world_bank_pop data frame

my_world_bank_pop <- world_bank_pop %>% 
  filter(indicator == "SP.URB.TOTL" & 
           country %in% c("AUS", "USA", "IND")) %>% 
  mutate(country = case_match(country,
                              "AUS" ~ "Australia",
                              "IND" ~ "Indonesia",
                              "USA" ~ "United States")) %>% 
  select(country, "2010", "2011", "2012", "2013")

my_world_bank_pop

We can take that wide dataframe longer using pivot_longer. We have to tell it which column headers to move into the data table. Here I say “all columns except country” by using the bang character: !. We also provide a new name for the column with the old headers (names_to = "year") and a name for the column that holds the old values (values_to = "urban_population")

world_bank_long <- my_world_bank_pop |>
  pivot_longer(cols = !country, names_to = "year", values_to = "urban_population")

world_bank_long

The inverse move, taking longer data back to wide is done with pivot_wider as shown here. We have to specify the columns that should become headers and the columns that have the values.

world_bank_long |>
  pivot_wider(id_cols = country, names_from = "year", values_from = "urban_population")

There is really good detail on these functions available here https://tidyr.tidyverse.org/articles/pivot.html

Caution

In the tidyverse the word “pivot” has a clear association with these moves to longer or wider data.

Yet in the Excel world, a “pivot” has an equally clear association with a different move: the Pivot Table. A Pivot Table is, as we will see, almost always a Group By and Aggregate.

So beware that these words can have specific meanings in different frameworks … that is why in this course our data transformation is called “Wider/Longer” rather than “pivot”

R tidyverse quick reference

transform	primary_command
Exclude rows	filter
Exclude columns	select
New columns	mutate
Combining columns	mutate + str_glue
Extract data from cells	separate_wider_delim
Treating cases differently	case_when
Recode/lookup	case_match
Un-nest	separate_longer_delim
Nest	mutate + list
Aggregate	summarize + sum
Group Aggregate	group_by + summarize
Window Aggregate	group_by + slice_max
Longer	pivot_longer
Wider	pivot_wider