SQL Transforms

SQL queries can also be used for data wrangling. SQL is a language, and there are lots of different database implementations. You might have heard of MySQL, PostgreSQL, Oracle, Microsoft SQL Server, or cloud implementations like Snowflake.

In this course we will be working with DuckDB. DuckDB is an “in-notebook” SQL package. This means that we don’t have to run a database server separately from our notebooks.

The SQL we will learn will be largely standard and should work across all databases. Where we use things that are specific to DuckDB I will highlight that, and provide pointers to explanations in other SQL implementations.

library(DBI)
library(duckdb)
options(duckdb.enable_rstudio_connection_pane=TRUE)

# create / connect to database file
# drv <- duckdb(dbdir = "my_database.duckdb")
# or you can have a temporary in memory database without a file.
drv <- duckdb()
con <- dbConnect(drv)

We can write data frames into the database, where they are known as tables.

data(gtcars)

gtcars_renamed <- gtcars %>% 
  select(manufacturer = mfr, everything())

# Write the dataset to DuckDB
dbWriteTable(con, "gtcars", gtcars_renamed, overwrite = TRUE)
customers <- tribble(
  ~customer_name, ~customer_age, ~customer_state,
  "Jihong Li", 34, "California",
  "Sarah Brown", 66, "Idaho",
  "Rupert Brown", 89, "California",
  "Ravi Shastri", 62, "Maharashtra"
) 

dbWriteTable(con, "customers", customers, overwrite = TRUE)

## and disconnect
# dbDisconnect(con, shutdown=TRUE)

Excluding rows and columns

SELECT *
FROM customers

FROM specifies the table that we want to query. Here we query a table called customers

SELECT enables us to exclude columns. To see all columns we use the special character * (pronounced star).

The output produced looks just like any data frame. Notice the “4 records” at the bottom? That is what a row is called in SQL, a record.

SELECT *
FROM customers
4 records
customer_name customer_age customer_state
Jihong Li 34 California
Sarah Brown 66 Idaho
Rupert Brown 89 California
Ravi Shastri 62 Maharashtra
Caution

SQL queries are particular about the order of parts. Most SQL implementations will give an error if we write FROM before SELECT (or if we just write FROM without a SELECT). DuckDB, funnily enough, will allow either one of those. But … when in SQL land stick with the typical ordering.

Excluding rows

The WHERE clause allows us to specify conditions that each row must meet. SQL uses a single = for these tests (rather than the double == used in other frameworks). Text items in the query are enclosed in single quotes, thus 'California' but numbers can be written on their own.

SELECT *
FROM customers
WHERE customer_state = 'California'
  AND customer_age > 80
1 records
customer_name customer_age customer_state
Rupert Brown 89 California

SQL also allows the approach of providing a list of alternatives, using IN. This is much preferred to using OR.

SELECT *
FROM customers
WHERE customer_state IN ('California', 'Maharashtra')
3 records
customer_name customer_age customer_state
Jihong Li 34 California
Rupert Brown 89 California
Ravi Shastri 62 Maharashtra

The opposite of IN is also available: NOT IN

SELECT *
FROM customers
WHERE customer_state NOT IN ('California', 'Maharashtra')
1 records
customer_name customer_age customer_state
Sarah Brown 66 Idaho

If you are thinking “Hey, that is sort of like English” then you are on to something. When SQL was first defined, waaaaay back in the late 1970s, the thought was that “managers” (as in not programmers) would be able to use this language. Today, with GenAI, we may finally be at a point where natural language prompts can write SQL queries (sometimes, maybe, but can you tell if it’s correct?).

Excluding columns

In SQL we specify the columns that we want to see using the SELECT clause, and providing a comma separated list.

SELECT customer_state, customer_name
FROM customers
4 records
customer_state customer_name
California Jihong Li
Idaho Sarah Brown
California Rupert Brown
Maharashtra Ravi Shastri

Generally speaking in SQL we haven’t been able to do more advanced sub-setting of columns (such as saying “everything other than customer_name” or “all columns that start with numbers”), although some dialects of SQL do offer terms like EXCEPT or EXCLUDE.

Creating new columns

SQL does allow the creation of new result columns, using the SELECT clause and AS to change names.

products <- tribble(
  ~product_num, ~product_name, ~price_in_dollars, 
  "SPR-004456", "Skateboard", 121, 
  "SPR-003356", "Basketball", 34,
  "SPR-004499", "Tennis balls", 12,
  "FTW-098756", "Tennis shoes", 79
)

dbWriteTable(con, "products", products, overwrite = TRUE)
SELECT *
FROM products
4 records
product_num product_name price_in_dollars
SPR-004456 Skateboard 121
SPR-003356 Basketball 34
SPR-004499 Tennis balls 12
FTW-098756 Tennis shoes 79

Here is how we would create a new column for a discounted price; we provide a little piece of code that is applied to every row. Here I’ve also swapped to putting each term in the SELECT on its own line to make things clearer (SQL is not fussy about whitespace).

SELECT product_num, 
       product_name, 
       price_in_dollars * 0.9 AS discounted_price
FROM products
4 records
product_num product_name discounted_price
SPR-004456 Skateboard 108.9
SPR-003356 Basketball 30.6
SPR-004499 Tennis balls 10.8
FTW-098756 Tennis shoes 71.1
Caution

Above I said that SQL SELECT queries can create new result columns. Here result is doing some heavy lifting.

When we execute an SQL SELECT query we get back results (a ResultSet) but we do not change the database at all. SELECT queries are read only. So while we can create a new column, that happens after the data is pulled from the database table and the new column only exists in the results of the query.

This makes sense in SQL land because the core usage is not inside an individual analyst’s notebook or computer, but in shared databases provided by centralized servers running on their own hardware. Changing the data would change it for everyone; if you changed a column name in a table, then likely everyone else’s queries would break. The website that the database backed would stop working and you’d, uh, definitely get woken up to fix it.

If we want the results of a query to persist in the database we do have the option of creating new tables, using the result of a SELECT query. We can say that the table should be temporary (not persist past the end of our session), or the table could be permanent.

CREATE OR REPLACE TEMP TABLE discounted_products AS
  SELECT product_num, 
         product_name, 
         price_in_dollars * 0.9 AS discounted_price
  FROM products

If we are getting into changing column names or structures in tables in a database that anyone else might use … then we are in the world of migrations which is its own fascinating area, with a set of version control technologies and social/managerial practices.

Combining columns

We can combine columns using SQL SELECT (again not changing the table just the results). This uses CONCAT and visits each row. We can use column names but we can also provide little chunks of text to be used with each line.

SELECT product_num,
       product_name,
       CONCAT(product_name, ' just $', price_in_dollars, '.00. Buy today!') AS product_label
FROM products
4 records
product_num product_name product_label
SPR-004456 Skateboard Skateboard just $121.0.00. Buy today!
SPR-003356 Basketball Basketball just $34.0.00. Buy today!
SPR-004499 Tennis balls Tennis balls just $12.0.00. Buy today!
FTW-098756 Tennis shoes Tennis shoes just $79.0.00. Buy today!

Treating different cases differently

SQL can do different things for different rows, using CASE in the SELECT clause. For example we can create a different product_label for the more expensive items.

mutate(product_label = case_when( price_in_dollars > 100 ~ str_glue(“Treat yourself. Invest ${price_in_dollars} for a {product_name} today!”), TRUE ~ str_glue(“{product_name} just ${price_in_dollars}.00. Buy today!”) )

SELECT product_num,
       product_name,
       CASE WHEN 
         price_in_dollars > 100 THEN 
           CONCAT('Treat yourself. Invest $', price_in_dollars, ' for a ', product_name, ' today!') 
         ELSE
           CONCAT(product_name, ' just $', price_in_dollars, '.00. Buy today!') 
         END
       AS product_label
FROM products
4 records
product_num product_name product_label
SPR-004456 Skateboard Treat yourself. Invest $121.0 for a Skateboard today!
SPR-003356 Basketball Basketball just $34.0.00. Buy today!
SPR-004499 Tennis balls Tennis balls just $12.0.00. Buy today!
FTW-098756 Tennis shoes Tennis shoes just $79.0.00. Buy today!

Extract/Separate

You can extract content out of cells within a column using string functions in SQL. They do vary a bit by dialect, but STRING_SPLIT is frequently available and is in DuckDB.

SELECT STRING_SPLIT(product_num, '-')[1] AS product_code,
       STRING_SPLIT(product_num, '-')[2] AS product_num_no_code,
       product_name
FROM products
4 records
product_code product_num_no_code product_name
SPR 004456 Skateboard
SPR 003356 Basketball
SPR 004499 Tennis balls
FTW 098756 Tennis shoes

You might notice that you have to repeat the STRING_SPLIT function twice. That’s just the way SQL works.

There are also functions that enable extract using regex. Here we are using what are called “capturing parens”. More on Regex, including capturing parens at https://regexone.com/

SELECT REGEXP_EXTRACT(product_num, '^(\w\w\w)') AS product_code,
       REGEXP_EXTRACT(product_num, '(\d+)') AS product_num_no_code,
       product_name
FROM products
4 records
product_code product_num_no_code product_name
SPR 004456 Skateboard
SPR 003356 Basketball
SPR 004499 Tennis balls
FTW 098756 Tennis shoes
Note

Are you getting errors about unknown columns? SQL dialects differ on whether you can use double quotes like " for strings, or whether you have to use single quotes (like '). DuckDB requires single quotes; MySQL will allow double quotes.

This is because SQL needs to know whether something is a column name or not. If there are no special characters in column names then you can just write the column name directly (like product_name above). Spaces, particularly, cause problems. If a column name has a space in it, then SQL needs some way of knowing that two words are actually one thing.

DuckDB uses the double quotes for that: so "product name" would make DuckDB think you are providing a column name (which it probably won’t find, if you meant a string).

MySQL can allow strings to have double quotes, because MySQL uses backticks to show a column name, e.g., `product name`.

Nest and unnest

SELECT *
FROM courses
3 records
course_name students_enrolled
Divination Harry, Ron
Herbology Hermione, Harry
Flying Lessons Harry, Hermione, Ron

DuckDB has a function that can create new rows from parts of a string that is split up, that function is called … UNNEST

SELECT course_name,
       UNNEST(STRING_SPLIT(students_enrolled, ',')) AS student_enrolled
FROM courses
7 records
course_name student_enrolled
Divination Harry
Divination Ron
Herbology Hermione
Herbology Harry
Flying Lessons Harry
Flying Lessons Hermione
Flying Lessons Ron
Note

In the code above we have a function call inside another function call. STRING_SPLIT is called inside UNNEST.

Progammers learn to read lines like these “inside out”. That is their eyes seek out the inner-most part of the function call, then they work outwards. So they might read this as “split the text in the column into a list, then pass that list to the UNNEST function”.

It can be easier to read if you use more lines and indents

SELECT course_name,
       UNNEST(
         STRING_SPLIT(
           students_enrolled, ','
         )
       ) AS student_enrolled
FROM courses

This is also what the pipe operator |> or %>% in R is helping with: instead of reading function calls inside out, we can write them as a pipeline. E.g.,

students_enrolled |>
  string_split(",") |>
  unnest()

which could be read as “take the string students_enrolled, then split it, then unnest the result”

Programming with pipes, is sometimes called “chaining” because you are linking together functions like links in a chain … as we will see there is an equivalent in Python called “method chaining,” in fact we’ll be learning what is called “chained pandas”.

Which sounds sad (poor animals, chained up) but is actually very nifty

Wider / Longer

my_world_bank_pop <- world_bank_pop %>% 
  #janitor::clean_names() %>% 
  filter(indicator == "SP.URB.TOTL" & 
           country %in% c("AUS", "USA", "IND")) %>% 
  mutate(country = case_match(country,
                              "AUS" ~ "Australia",
                              "IND" ~ "Indonesia",
                              "USA" ~ "United States")) %>% 
  select(country, "2010", "2011", "2012", "2013")

dbWriteTable(con, "populations_wide", my_world_bank_pop, overwrite = TRUE)



dbWriteTable(con, "populations_long", overwrite = TRUE,
             my_world_bank_pop |> 
               pivot_longer(cols = !country, names_to = "year", values_to = "population"))

Historically many dialects of SQL did not have a convenient approach to transformations involving taking a table wider or longer. MySQL Generally those sort of transformations had to be done using programming languages outside SQL, often Python or Perl, or GUI based ETL tools (Extract, Transform, Load)

Today, though, most SQL dialects offer an ability to take data wider or longer. In DuckDB the functions are called “UNPIVOT” (which takes data from wider formats to longer formats) and “PIVOT” which takes data from longer formats to wider formats.

Caution

Remember that the word pivot can be tricky. In Excel a “pivot table” is a kind of group_by/aggregate with the added ability to distribute one grouping variable into new columns. We will learn more about that, but for now just keep in mind that different communities use pivot in slightly different, partially overlapping ways.

And almost all communities using the word do not understand that others use it differently …

We can make a wider format longer using UNPIVOT like this

UNPIVOT populations_wide
ON 2010, 2011, 2012, 2013
INTO
    NAME year
    VALUE population;
Displaying records 1 - 10
country year population
Australia 2010 18767085
Australia 2011 19056040
Australia 2012 19414834
Australia 2013 19775013
Indonesia 2010 383721793
Indonesia 2011 393333604
Indonesia 2012 403171286
Indonesia 2013 413200994
United States 2010 249849720
United States 2011 252208133

That approach requires you to manually specify the column names, but there is an approach that allows you to specify the reverse (columns not to pivot) which we discused as knowing which are the “identifier” columns. In this example, the one column that we want to treat specially is country.

UNPIVOT populations_wide
ON COLUMNS(* EXCLUDE(country))
INTO
    NAME year
    VALUE population;
Displaying records 1 - 10
country year population
Australia 2010 18767085
Australia 2011 19056040
Australia 2012 19414834
Australia 2013 19775013
Indonesia 2010 383721793
Indonesia 2011 393333604
Indonesia 2012 403171286
Indonesia 2013 413200994
United States 2010 249849720
United States 2011 252208133
Note

The specific syntax for wider/longer operations in different SQL dialects varies a lot; it’s kind of a mess.

Happily, GenAI like ChatGPT and GitHub CoPilot, seem to be very good at consuming documentation and tutorials and providing well targeted snippets that can help us with specific syntax and specific dialects.

But when we are building prompts, we have to know what we are asking for. The language in this course around data transformations, abstracted from any particular framework, language, or dialect, will hopefully help you not only talk to other people about data transforms, but talk to AI as well!

We can transform longer tables to wider tables using PIVOT (which has PIVOT_WIDER as a handy synonym).

PIVOT_WIDER populations_long
ON year 
USING first(population);
3 records
country 2010 2011 2012 2013
Indonesia 383721793 393333604 403171286 413200994
Australia 18767085 19056040 19414834 19775013
United States 249849720 252208133 254614421 256953576

It is possible to achieve a similar outcome using CASE statements; in SQL dialects that don’t have a PIVOT you will see people do it this way.

SELECT
    country,
    SUM(CASE WHEN year = '2010' THEN population ELSE 0 END) AS "2010",
    SUM(CASE WHEN year = '2011' THEN population ELSE 0 END) AS "2011",
    SUM(CASE WHEN year = '2012' THEN population ELSE 0 END) AS "2012",
    SUM(CASE WHEN year = '2013' THEN population ELSE 0 END) AS "2013",
FROM
    populations_long
GROUP BY
    country
3 records
country 2010 2011 2012 2013
Indonesia 383721793 393333604 403171286 413200994
Australia 18767085 19056040 19414834 19775013
United States 249849720 252208133 254614421 256953576

The CASE query works row by row through this table. Let’s focus just on the Australia rows. And we will only do two years. And we will take the SUM off :)

First let’s focus only on two rows of input:

To understand how this is working let’s look just at the first row.

SELECT *
FROM populations_long
WHERE country = 'Australia'
LIMIT 2
2 records
country year population
Australia 2010 18767085
Australia 2011 19056040

The CASE query works row by row. To see what’s happening we will take off the SUM part.

SELECT
    country,
    CASE WHEN year = '2010' THEN population ELSE 0 END AS "2010",
    CASE WHEN year = '2011' THEN population ELSE 0 END AS "2011"
FROM
    populations_long
WHERE country = 'Australia'
LIMIT 2
2 records
country 2010 2011
Australia 18767085 0
Australia 0 19056040

The CASE statement looks at the first row (Australia, 2010, 18767085). Each part of the SELECT executes with those data as input. So the SELECT country produces “Australia”. The first CASE asks “Does year equal 2010” and the answer is yes, so it outputs 18767085 (the value of population for that row) and puts it in a column called “2010”. The second CASE then executes, but still with the same input data. Since year is not equal to 2010, this outputs 0 for a column called “2011”.

If we do that for all the years but still just Australia.

SELECT
    country,
    CASE WHEN year = '2010' THEN population ELSE 0 END AS "2010",
    CASE WHEN year = '2011' THEN population ELSE 0 END AS "2011",
    CASE WHEN year = '2012' THEN population ELSE 0 END AS "2012",
    CASE WHEN year = '2013' THEN population ELSE 0 END AS "2013",
FROM
    populations_long
WHERE country = 'Australia'
4 records
country 2010 2011 2012 2013
Australia 18767085 0 0 0
Australia 0 19056040 0 0
Australia 0 0 19414834 0
Australia 0 0 0 19775013

Each row in the original table has turned into three zeros and one population number (for the year that was in that row).

Hopefully now we can see what we need to do is “squeeze” up the rows, so that the zeros disappear and each year column just has the large number in it. Thankfully we have a way to do this, using GROUP BY and aggregate. In this case SUM will work because if we add the three 0s to the population number we end up with the population number.

SELECT
  country,
  SUM(CASE WHEN year = '2010' THEN population ELSE 0 END) AS "2010",
  SUM(CASE WHEN year = '2011' THEN population ELSE 0 END) AS "2011",
  SUM(CASE WHEN year = '2012' THEN population ELSE 0 END) AS "2012",
  SUM(CASE WHEN year = '2013' THEN population ELSE 0 END) AS "2013",
FROM
  populations_long
GROUP BY
  country
3 records
country 2010 2011 2012 2013
Indonesia 383721793 393333604 403171286 413200994
Australia 18767085 19056040 19414834 19775013
United States 249849720 252208133 254614421 256953576

This works, but is it … non-intuitive and manual. So huzzah for modern SQL dialects having a native PIVOT feature!

Exercises

  1. Load this data frame into a DuckDB table.
data(starwars)
  1. What characters are from the planet “Tatooine”?

  2. Which characters have grey hair? (Note: you will need to use STRING_SPLIT)