SQL Transforms

SQL queries can also be used for data wrangling. SQL is a language, and there are lots of different database implementations. You might have heard of MySQL, PostgreSQL, Oracle, Microsoft SQL Server, or cloud implementations like Snowflake.

In this course we will be working with DuckDB. DuckDB is an “in-notebook” SQL package. This means that we don’t have to run a database server separately from our notebooks.

The SQL we will learn will be largely standard and should work across all databases. Where we use things that are specific to DuckDB I will highlight that, and provide pointers to explanations in other SQL implementations.

library(DBI)
library(duckdb)
options(duckdb.enable_rstudio_connection_pane=TRUE)

# create / connect to database file
# drv <- duckdb(dbdir = "my_database.duckdb")
# or you can have a temporary in memory database without a file.
drv <- duckdb()
con <- dbConnect(drv)

We can write data frames into the database, where they are known as tables.

data(gtcars)

gtcars_renamed <- gtcars %>% 
  select(manufacturer = mfr, everything())

# Write the dataset to DuckDB
dbWriteTable(con, "gtcars", gtcars_renamed, overwrite = TRUE)

customers <- tribble(
  ~customer_name, ~customer_age, ~customer_state,
  "Jihong Li", 34, "California",
  "Sarah Brown", 66, "Idaho",
  "Rupert Brown", 89, "California",
  "Ravi Shastri", 62, "Maharashtra"
) 

dbWriteTable(con, "customers", customers, overwrite = TRUE)

## and disconnect
# dbDisconnect(con, shutdown=TRUE)

Excluding rows and columns

SELECT *
FROM customers

FROM specifies the table that we want to query. Here we query a table called customers

SELECT enables us to exclude columns. To see all columns we use the special character * (pronounced star).

The output produced looks just like any data frame. Notice the “4 records” at the bottom? That is what a row is called in SQL, a record.

SELECT *
FROM customers

4 records
customer_name	customer_age	customer_state
Jihong Li	34	California
Sarah Brown	66	Idaho
Rupert Brown	89	California
Ravi Shastri	62	Maharashtra

Caution

SQL queries are particular about the order of parts. Most SQL implementations will give an error if we write FROM before SELECT (or if we just write FROM without a SELECT). DuckDB, funnily enough, will allow either one of those. But … when in SQL land stick with the typical ordering.

Excluding rows

The WHERE clause allows us to specify conditions that each row must meet. SQL uses a single = for these tests (rather than the double == used in other frameworks). Text items in the query are enclosed in single quotes, thus 'California' but numbers can be written on their own.

SELECT *
FROM customers
WHERE customer_state = 'California'
  AND customer_age > 80

1 records
customer_name	customer_age	customer_state
Rupert Brown	89	California

SQL also allows the approach of providing a list of alternatives, using IN. This is much preferred to using OR.

SELECT *
FROM customers
WHERE customer_state IN ('California', 'Maharashtra')

3 records
customer_name	customer_age	customer_state
Jihong Li	34	California
Rupert Brown	89	California
Ravi Shastri	62	Maharashtra

The opposite of IN is also available: NOT IN

SELECT *
FROM customers
WHERE customer_state NOT IN ('California', 'Maharashtra')

1 records
customer_name	customer_age	customer_state
Sarah Brown	66	Idaho

If you are thinking “Hey, that is sort of like English” then you are on to something. When SQL was first defined, waaaaay back in the late 1970s, the thought was that “managers” (as in not programmers) would be able to use this language. Today, with GenAI, we may finally be at a point where natural language prompts can write SQL queries (sometimes, maybe, but can you tell if it’s correct?).

Excluding columns

In SQL we specify the columns that we want to see using the SELECT clause, and providing a comma separated list.

SELECT customer_state, customer_name
FROM customers

4 records
customer_state	customer_name
California	Jihong Li
Idaho	Sarah Brown
California	Rupert Brown
Maharashtra	Ravi Shastri

Generally speaking in SQL we haven’t been able to do more advanced sub-setting of columns (such as saying “everything other than customer_name” or “all columns that start with numbers”), although some dialects of SQL do offer terms like EXCEPT or EXCLUDE.

Creating new columns

SQL does allow the creation of new result columns, using the SELECT clause and AS to change names.

products <- tribble(
  ~product_num, ~product_name, ~price_in_dollars, 
  "SPR-004456", "Skateboard", 121, 
  "SPR-003356", "Basketball", 34,
  "SPR-004499", "Tennis balls", 12,
  "FTW-098756", "Tennis shoes", 79
)

dbWriteTable(con, "products", products, overwrite = TRUE)

SELECT *
FROM products

4 records
product_num	product_name	price_in_dollars
SPR-004456	Skateboard	121
SPR-003356	Basketball	34
SPR-004499	Tennis balls	12
FTW-098756	Tennis shoes	79

Here is how we would create a new column for a discounted price; we provide a little piece of code that is applied to every row. Here I’ve also swapped to putting each term in the SELECT on its own line to make things clearer (SQL is not fussy about whitespace).

SELECT product_num, 
       product_name, 
       price_in_dollars * 0.9 AS discounted_price
FROM products

4 records
product_num	product_name	discounted_price
SPR-004456	Skateboard	108.9
SPR-003356	Basketball	30.6
SPR-004499	Tennis balls	10.8
FTW-098756	Tennis shoes	71.1

Caution

Above I said that SQL SELECT queries can create new result columns. Here result is doing some heavy lifting.

When we execute an SQL SELECT query we get back results (a ResultSet) but we do not change the database at all. SELECT queries are read only. So while we can create a new column, that happens after the data is pulled from the database table and the new column only exists in the results of the query.

This makes sense in SQL land because the core usage is not inside an individual analyst’s notebook or computer, but in shared databases provided by centralized servers running on their own hardware. Changing the data would change it for everyone; if you changed a column name in a table, then likely everyone else’s queries would break. The website that the database backed would stop working and you’d, uh, definitely get woken up to fix it.

If we want the results of a query to persist in the database we do have the option of creating new tables, using the result of a SELECT query. We can say that the table should be temporary (not persist past the end of our session), or the table could be permanent.

CREATE OR REPLACE TEMP TABLE discounted_products AS
  SELECT product_num, 
         product_name, 
         price_in_dollars * 0.9 AS discounted_price
  FROM products

If we are getting into changing column names or structures in tables in a database that anyone else might use … then we are in the world of migrations which is its own fascinating area, with a set of version control technologies and social/managerial practices.

Combining columns

We can combine columns using SQL SELECT (again not changing the table just the results). This uses CONCAT and visits each row. We can use column names but we can also provide little chunks of text to be used with each line.

SELECT product_num,
       product_name,
       CONCAT(product_name, ' just $', price_in_dollars, '.00. Buy today!') AS product_label
FROM products

4 records
product_num	product_name	product_label
SPR-004456	Skateboard	Skateboard just $121.0.00. Buy today!
SPR-003356	Basketball	Basketball just $34.0.00. Buy today!
SPR-004499	Tennis balls	Tennis balls just $12.0.00. Buy today!
FTW-098756	Tennis shoes	Tennis shoes just $79.0.00. Buy today!

Treating different cases differently

SQL can do different things for different rows, using CASE in the SELECT clause. For example we can create a different product_label for the more expensive items.

mutate(product_label = case_when( price_in_dollars > 100 ~ str_glue(“Treat yourself. Invest ${price_in_dollars} for a {product_name} today!”), TRUE ~ str_glue(“{product_name} just ${price_in_dollars}.00. Buy today!”) )

SELECT product_num,
       product_name,
       CASE WHEN 
         price_in_dollars > 100 THEN 
           CONCAT('Treat yourself. Invest $', price_in_dollars, ' for a ', product_name, ' today!') 
         ELSE
           CONCAT(product_name, ' just $', price_in_dollars, '.00. Buy today!') 
         END
       AS product_label
FROM products

4 records
product_num	product_name	product_label
SPR-004456	Skateboard	Treat yourself. Invest $121.0 for a Skateboard today!
SPR-003356	Basketball	Basketball just $34.0.00. Buy today!
SPR-004499	Tennis balls	Tennis balls just $12.0.00. Buy today!
FTW-098756	Tennis shoes	Tennis shoes just $79.0.00. Buy today!

Extract/Separate

You can extract content out of cells within a column using string functions in SQL. They do vary a bit by dialect, but STRING_SPLIT is frequently available and is in DuckDB.

SELECT STRING_SPLIT(product_num, '-')[1] AS product_code,
       STRING_SPLIT(product_num, '-')[2] AS product_num_no_code,
       product_name
FROM products

4 records
product_code	product_num_no_code	product_name
SPR	004456	Skateboard
SPR	003356	Basketball
SPR	004499	Tennis balls
FTW	098756	Tennis shoes

You might notice that you have to repeat the STRING_SPLIT function twice. That’s just the way SQL works.

There are also functions that enable extract using regex. Here we are using what are called “capturing parens”. More on Regex, including capturing parens at https://regexone.com/

SELECT REGEXP_EXTRACT(product_num, '^(\w\w\w)') AS product_code,
       REGEXP_EXTRACT(product_num, '(\d+)') AS product_num_no_code,
       product_name
FROM products

4 records
product_code	product_num_no_code	product_name
SPR	004456	Skateboard
SPR	003356	Basketball
SPR	004499	Tennis balls
FTW	098756	Tennis shoes

Note

Are you getting errors about unknown columns? SQL dialects differ on whether you can use double quotes like " for strings, or whether you have to use single quotes (like '). DuckDB requires single quotes; MySQL will allow double quotes.

This is because SQL needs to know whether something is a column name or not. If there are no special characters in column names then you can just write the column name directly (like product_name above). Spaces, particularly, cause problems. If a column name has a space in it, then SQL needs some way of knowing that two words are actually one thing.

DuckDB uses the double quotes for that: so "product name" would make DuckDB think you are providing a column name (which it probably won’t find, if you meant a string).

MySQL can allow strings to have double quotes, because MySQL uses backticks to show a column name, e.g., `product name`.

Nest and unnest

SELECT *
FROM courses

3 records
course_name	students_enrolled
Divination	Harry, Ron
Herbology	Hermione, Harry
Flying Lessons	Harry, Hermione, Ron

DuckDB has a function that can create new rows from parts of a string that is split up, that function is called … UNNEST

SELECT course_name,
       UNNEST(STRING_SPLIT(students_enrolled, ',')) AS student_enrolled
FROM courses

7 records
course_name	student_enrolled
Divination	Harry
Divination	Ron
Herbology	Hermione
Herbology	Harry
Flying Lessons	Harry
Flying Lessons	Hermione
Flying Lessons	Ron

Note

In the code above we have a function call inside another function call. STRING_SPLIT is called inside UNNEST.

Progammers learn to read lines like these “inside out”. That is their eyes seek out the inner-most part of the function call, then they work outwards. So they might read this as “split the text in the column into a list, then pass that list to the UNNEST function”.

It can be easier to read if you use more lines and indents

SELECT course_name,
       UNNEST(
         STRING_SPLIT(
           students_enrolled, ','
         )
       ) AS student_enrolled
FROM courses

This is also what the pipe operator |> or %>% in R is helping with: instead of reading function calls inside out, we can write them as a pipeline. E.g.,

students_enrolled |>
  string_split(",") |>
  unnest()

which could be read as “take the string students_enrolled, then split it, then unnest the result”

Programming with pipes, is sometimes called “chaining” because you are linking together functions like links in a chain … as we will see there is an equivalent in Python called “method chaining,” in fact we’ll be learning what is called “chained pandas”.

Which sounds sad (poor animals, chained up) but is actually very nifty

Wider / Longer

my_world_bank_pop <- world_bank_pop %>% 
  #janitor::clean_names() %>% 
  filter(indicator == "SP.URB.TOTL" & 
           country %in% c("AUS", "USA", "IND")) %>% 
  mutate(country = case_match(country,
                              "AUS" ~ "Australia",
                              "IND" ~ "Indonesia",
                              "USA" ~ "United States")) %>% 
  select(country, "2010", "2011", "2012", "2013")

dbWriteTable(con, "populations_wide", my_world_bank_pop, overwrite = TRUE)



dbWriteTable(con, "populations_long", overwrite = TRUE,
             my_world_bank_pop |> 
               pivot_longer(cols = !country, names_to = "year", values_to = "population"))

Historically many dialects of SQL did not have a convenient approach to transformations involving taking a table wider or longer. MySQL Generally those sort of transformations had to be done using programming languages outside SQL, often Python or Perl, or GUI based ETL tools (Extract, Transform, Load)

Today, though, most SQL dialects offer an ability to take data wider or longer. In DuckDB the functions are called “UNPIVOT” (which takes data from wider formats to longer formats) and “PIVOT” which takes data from longer formats to wider formats.

Caution

Remember that the word pivot can be tricky. In Excel a “pivot table” is a kind of group_by/aggregate with the added ability to distribute one grouping variable into new columns. We will learn more about that, but for now just keep in mind that different communities use pivot in slightly different, partially overlapping ways.

And almost all communities using the word do not understand that others use it differently …

We can make a wider format longer using UNPIVOT like this

UNPIVOT populations_wide
ON 2010, 2011, 2012, 2013
INTO
    NAME year
    VALUE population;

Displaying records 1 - 10
country	year	population
Australia	2010	18767085
Australia	2011	19056040
Australia	2012	19414834
Australia	2013	19775013
Indonesia	2010	383721793
Indonesia	2011	393333604
Indonesia	2012	403171286
Indonesia	2013	413200994
United States	2010	249849720
United States	2011	252208133

That approach requires you to manually specify the column names, but there is an approach that allows you to specify the reverse (columns not to pivot) which we discused as knowing which are the “identifier” columns. In this example, the one column that we want to treat specially is country.

UNPIVOT populations_wide
ON COLUMNS(* EXCLUDE(country))
INTO
    NAME year
    VALUE population;

Displaying records 1 - 10
country	year	population
Australia	2010	18767085
Australia	2011	19056040
Australia	2012	19414834
Australia	2013	19775013
Indonesia	2010	383721793
Indonesia	2011	393333604
Indonesia	2012	403171286
Indonesia	2013	413200994
United States	2010	249849720
United States	2011	252208133

Note

The specific syntax for wider/longer operations in different SQL dialects varies a lot; it’s kind of a mess.

Happily, GenAI like ChatGPT and GitHub CoPilot, seem to be very good at consuming documentation and tutorials and providing well targeted snippets that can help us with specific syntax and specific dialects.

But when we are building prompts, we have to know what we are asking for. The language in this course around data transformations, abstracted from any particular framework, language, or dialect, will hopefully help you not only talk to other people about data transforms, but talk to AI as well!

We can transform longer tables to wider tables using PIVOT (which has PIVOT_WIDER as a handy synonym).

PIVOT_WIDER populations_long
ON year 
USING first(population);

3 records
country	2010	2011	2012	2013
Australia	18767085	19056040	19414834	19775013
United States	249849720	252208133	254614421	256953576
Indonesia	383721793	393333604	403171286	413200994

It is possible to achieve a similar outcome using CASE statements; in SQL dialects that don’t have a PIVOT you will see people do it this way.

SELECT
    country,
    SUM(CASE WHEN year = '2010' THEN population ELSE 0 END) AS "2010",
    SUM(CASE WHEN year = '2011' THEN population ELSE 0 END) AS "2011",
    SUM(CASE WHEN year = '2012' THEN population ELSE 0 END) AS "2012",
    SUM(CASE WHEN year = '2013' THEN population ELSE 0 END) AS "2013",
FROM
    populations_long
GROUP BY
    country

3 records
country	2010	2011	2012	2013
Australia	18767085	19056040	19414834	19775013
United States	249849720	252208133	254614421	256953576
Indonesia	383721793	393333604	403171286	413200994

The CASE query works row by row through this table. Let’s focus just on the Australia rows. And we will only do two years. And we will take the SUM off :)

First let’s focus only on two rows of input:

To understand how this is working let’s look just at the first row.

SELECT *
FROM populations_long
WHERE country = 'Australia'
LIMIT 2

2 records
country	year	population
Australia	2010	18767085
Australia	2011	19056040

The CASE query works row by row. To see what’s happening we will take off the SUM part.

SELECT
    country,
    CASE WHEN year = '2010' THEN population ELSE 0 END AS "2010",
    CASE WHEN year = '2011' THEN population ELSE 0 END AS "2011"
FROM
    populations_long
WHERE country = 'Australia'
LIMIT 2

2 records
country	2010	2011
Australia	18767085	0
Australia	0	19056040

The CASE statement looks at the first row (Australia, 2010, 18767085). Each part of the SELECT executes with those data as input. So the SELECT country produces “Australia”. The first CASE asks “Does year equal 2010” and the answer is yes, so it outputs 18767085 (the value of population for that row) and puts it in a column called “2010”. The second CASE then executes, but still with the same input data. Since year is not equal to 2010, this outputs 0 for a column called “2011”.

If we do that for all the years but still just Australia.

SELECT
    country,
    CASE WHEN year = '2010' THEN population ELSE 0 END AS "2010",
    CASE WHEN year = '2011' THEN population ELSE 0 END AS "2011",
    CASE WHEN year = '2012' THEN population ELSE 0 END AS "2012",
    CASE WHEN year = '2013' THEN population ELSE 0 END AS "2013",
FROM
    populations_long
WHERE country = 'Australia'

4 records
country	2010	2011	2012	2013
Australia	18767085	0	0	0
Australia	0	19056040	0	0
Australia	0	0	19414834	0
Australia	0	0	0	19775013

Each row in the original table has turned into three zeros and one population number (for the year that was in that row).

Hopefully now we can see what we need to do is “squeeze” up the rows, so that the zeros disappear and each year column just has the large number in it. Thankfully we have a way to do this, using GROUP BY and aggregate. In this case SUM will work because if we add the three 0s to the population number we end up with the population number.

SELECT
  country,
  SUM(CASE WHEN year = '2010' THEN population ELSE 0 END) AS "2010",
  SUM(CASE WHEN year = '2011' THEN population ELSE 0 END) AS "2011",
  SUM(CASE WHEN year = '2012' THEN population ELSE 0 END) AS "2012",
  SUM(CASE WHEN year = '2013' THEN population ELSE 0 END) AS "2013",
FROM
  populations_long
GROUP BY
  country

3 records
country	2010	2011	2012	2013
Indonesia	383721793	393333604	403171286	413200994
Australia	18767085	19056040	19414834	19775013
United States	249849720	252208133	254614421	256953576

This works, but is it … non-intuitive and manual. So huzzah for modern SQL dialects having a native PIVOT feature!

Exercises

Load this data frame into a DuckDB table.

data(starwars)

What characters are from the planet “Tatooine”?
Which characters have grey hair? (Note: you will need to use STRING_SPLIT)