Tidyverse exercises

In this exercise we will learn to manage synonyms using some of the data moves we have learned.

A synonym is an alternative name for something. Good examples come up often with people’s names (James Howison vs James L Howison) and names of geographic areas (TX vs Texas). When we bring together disparete datasources often we need to make these connections.

Here we have some example data:

Show the code that defines state_founding
state_founding <- tribble(
  ~state_abbrev, ~state_joined_union_year,
  "TX", 1845,
  "NY", 1788,
  "AZ", 1912
) 
state_founding
Show the code that defines state_area
state_area <- tribble(
  ~state, ~state_area_sq_m, ~state_area_percent,
  "Texas", 268596, 7.06,
  "New York", 54555, 1.43,
  "Arizona", 113990, 3
) 
state_area

As literate humans with cultural competency we know that “TX” and “Texas” refer to the same thing in this context. Yet the computer sees those as completely different pieces of text.

In the end we will want to end up with this output table:

Make a plan.

We need to break this problem down into a few abstract steps.

  1. First, decide on the name that we want to use in the end. Here we chose state_cannonical_name.
  2. We need to ensure that both tables have this column, correctly mapped to the right value.
  3. We can then join the tables using that column
  4. We can ensure that the columns are in the right order.

Here are parts 3 and 4 of the solution.

###
# Your code will go here
###

left_join(state_founding_mapped, state_area_mapped, by = "state_cannonical_name") |>
  select(-state, -state_abbrev) |>
  select(state_cannonical_name, everything())

Now you will do part 2 of the plan. Here we need to add a column to each of the tables. There are two approaches: recode and lookup. Each uses new data, that creates the connection between “TX” and “Texas” and so on. We may be able to get this from elsewhere, or generate it using AI (I mean maybe? But who’s going to check it’s right?), but here we will build one by hand.

Recode

First, use the recode approach. Here you will need to build your map in code. Look at the R teaching materials to identify the function you will need.

Note

Recode is appropriate when there are just a few items.

Lookup

Second, use the lookup approach. Here we build a table, then use that table.

Note

Lookup is appropriate when there are more than a few items, and particularly when we can find mappings in external data sources.

The first column has the name that we want to map to, and the second column has a name that we might find in the data state_alt_name. Notice that we map all the names we might find in the data, including those that are already ‘right’.

Show the code that defines state_names
state_names <- tribble(
  ~state_cannonical_name, ~state_alt_name,
  "Texas", "Texas",
  "Texas", "TX",
  "New York", "New York",
  "New York", "NY",
  "Arizona", "Arizona",
  "Arizona", "AZ"
) 
state_names