SQL Transforms
SQL queries can also be used for data wrangling. SQL is a language, and there are lots of different database implementations. You might have heard of MySQL, PostgreSQL, Oracle, Microsoft SQL Server, or cloud implementations like Snowflake.
In this course we will be working with DuckDB. DuckDB is an “in-notebook” SQL package. This means that we don’t have to run a database server separately from our notebooks.
The SQL we will learn will be largely standard and should work across all databases. Where we use things that are specific to DuckDB I will highlight that, and provide pointers to explanations in other SQL implementations.
library(DBI)
library(duckdb)
options(duckdb.enable_rstudio_connection_pane=TRUE)
# create / connect to database file
# drv <- duckdb(dbdir = "my_database.duckdb")
# or you can have a temporary in memory database without a file.
drv <- duckdb()
con <- dbConnect(drv)We can write data frames into the database, where they are known as tables.
data(gtcars)
gtcars_renamed <- gtcars %>%
select(manufacturer = mfr, everything())
# Write the dataset to DuckDB
dbWriteTable(con, "gtcars", gtcars_renamed, overwrite = TRUE)customers <- tribble(
~customer_name, ~customer_age, ~customer_state,
"Jihong Li", 34, "California",
"Sarah Brown", 66, "Idaho",
"Rupert Brown", 89, "California",
"Ravi Shastri", 62, "Maharashtra"
)
dbWriteTable(con, "customers", customers, overwrite = TRUE)
## and disconnect
# dbDisconnect(con, shutdown=TRUE)Excluding rows and columns
SELECT *
FROM customersFROM specifies the table that we want to query. Here we query a table called customers
SELECT enables us to exclude columns. To see all columns we use the special character * (pronounced star).
The output produced looks just like any data frame. Notice the “4 records” at the bottom? That is what a row is called in SQL, a record.
SELECT *
FROM customers| customer_name | customer_age | customer_state |
|---|---|---|
| Jihong Li | 34 | California |
| Sarah Brown | 66 | Idaho |
| Rupert Brown | 89 | California |
| Ravi Shastri | 62 | Maharashtra |
SQL queries are particular about the order of parts. Most SQL implementations will give an error if we write FROM before SELECT (or if we just write FROM without a SELECT). DuckDB, funnily enough, will allow either one of those. But … when in SQL land stick with the typical ordering.
Excluding rows
The WHERE clause allows us to specify conditions that each row must meet. SQL uses a single = for these tests (rather than the double == used in other frameworks). Text items in the query are enclosed in single quotes, thus 'California' but numbers can be written on their own.
SELECT *
FROM customers
WHERE customer_state = 'California'
AND customer_age > 80| customer_name | customer_age | customer_state |
|---|---|---|
| Rupert Brown | 89 | California |
SQL also allows the approach of providing a list of alternatives, using IN. This is much preferred to using OR.
SELECT *
FROM customers
WHERE customer_state IN ('California', 'Maharashtra')| customer_name | customer_age | customer_state |
|---|---|---|
| Jihong Li | 34 | California |
| Rupert Brown | 89 | California |
| Ravi Shastri | 62 | Maharashtra |
The opposite of IN is also available: NOT IN
SELECT *
FROM customers
WHERE customer_state NOT IN ('California', 'Maharashtra')| customer_name | customer_age | customer_state |
|---|---|---|
| Sarah Brown | 66 | Idaho |
If you are thinking “Hey, that is sort of like English” then you are on to something. When SQL was first defined, waaaaay back in the late 1970s, the thought was that “managers” (as in not programmers) would be able to use this language. Today, with GenAI, we may finally be at a point where natural language prompts can write SQL queries (sometimes, maybe, but can you tell if it’s correct?).
Excluding columns
In SQL we specify the columns that we want to see using the SELECT clause, and providing a comma separated list.
SELECT customer_state, customer_name
FROM customers| customer_state | customer_name |
|---|---|
| California | Jihong Li |
| Idaho | Sarah Brown |
| California | Rupert Brown |
| Maharashtra | Ravi Shastri |
Generally speaking in SQL we haven’t been able to do more advanced sub-setting of columns (such as saying “everything other than customer_name” or “all columns that start with numbers”), although some dialects of SQL do offer terms like EXCEPT or EXCLUDE.
Creating new columns
SQL does allow the creation of new result columns, using the SELECT clause and AS to change names.
products <- tribble(
~product_num, ~product_name, ~price_in_dollars,
"SPR-004456", "Skateboard", 121,
"SPR-003356", "Basketball", 34,
"SPR-004499", "Tennis balls", 12,
"FTW-098756", "Tennis shoes", 79
)
dbWriteTable(con, "products", products, overwrite = TRUE)SELECT *
FROM products| product_num | product_name | price_in_dollars |
|---|---|---|
| SPR-004456 | Skateboard | 121 |
| SPR-003356 | Basketball | 34 |
| SPR-004499 | Tennis balls | 12 |
| FTW-098756 | Tennis shoes | 79 |
Here is how we would create a new column for a discounted price; we provide a little piece of code that is applied to every row. Here I’ve also swapped to putting each term in the SELECT on its own line to make things clearer (SQL is not fussy about whitespace).
SELECT product_num,
product_name,
price_in_dollars * 0.9 AS discounted_price
FROM products| product_num | product_name | discounted_price |
|---|---|---|
| SPR-004456 | Skateboard | 108.9 |
| SPR-003356 | Basketball | 30.6 |
| SPR-004499 | Tennis balls | 10.8 |
| FTW-098756 | Tennis shoes | 71.1 |
Above I said that SQL SELECT queries can create new result columns. Here result is doing some heavy lifting.
When we execute an SQL SELECT query we get back results (a ResultSet) but we do not change the database at all. SELECT queries are read only. So while we can create a new column, that happens after the data is pulled from the database table and the new column only exists in the results of the query.
This makes sense in SQL land because the core usage is not inside an individual analyst’s notebook or computer, but in shared databases provided by centralized servers running on their own hardware. Changing the data would change it for everyone; if you changed a column name in a table, then likely everyone else’s queries would break. The website that the database backed would stop working and you’d, uh, definitely get woken up to fix it.
If we want the results of a query to persist in the database we do have the option of creating new tables, using the result of a SELECT query. We can say that the table should be temporary (not persist past the end of our session), or the table could be permanent.
CREATE OR REPLACE TEMP TABLE discounted_products AS
SELECT product_num,
product_name,
price_in_dollars * 0.9 AS discounted_price
FROM productsIf we are getting into changing column names or structures in tables in a database that anyone else might use … then we are in the world of migrations which is its own fascinating area, with a set of version control technologies and social/managerial practices.
Combining columns
We can combine columns using SQL SELECT (again not changing the table just the results). This uses CONCAT and visits each row. We can use column names but we can also provide little chunks of text to be used with each line.
SELECT product_num,
product_name,
CONCAT(product_name, ' just $', price_in_dollars, '.00. Buy today!') AS product_label
FROM products| product_num | product_name | product_label |
|---|---|---|
| SPR-004456 | Skateboard | Skateboard just $121.0.00. Buy today! |
| SPR-003356 | Basketball | Basketball just $34.0.00. Buy today! |
| SPR-004499 | Tennis balls | Tennis balls just $12.0.00. Buy today! |
| FTW-098756 | Tennis shoes | Tennis shoes just $79.0.00. Buy today! |
Treating different cases differently
SQL can do different things for different rows, using CASE in the SELECT clause. For example we can create a different product_label for the more expensive items.
mutate(product_label = case_when( price_in_dollars > 100 ~ str_glue(“Treat yourself. Invest ${price_in_dollars} for a {product_name} today!”), TRUE ~ str_glue(“{product_name} just ${price_in_dollars}.00. Buy today!”) )
SELECT product_num,
product_name,
CASE WHEN
price_in_dollars > 100 THEN
CONCAT('Treat yourself. Invest $', price_in_dollars, ' for a ', product_name, ' today!')
ELSE
CONCAT(product_name, ' just $', price_in_dollars, '.00. Buy today!')
END
AS product_label
FROM products| product_num | product_name | product_label |
|---|---|---|
| SPR-004456 | Skateboard | Treat yourself. Invest $121.0 for a Skateboard today! |
| SPR-003356 | Basketball | Basketball just $34.0.00. Buy today! |
| SPR-004499 | Tennis balls | Tennis balls just $12.0.00. Buy today! |
| FTW-098756 | Tennis shoes | Tennis shoes just $79.0.00. Buy today! |
Extract/Separate
You can extract content out of cells within a column using string functions in SQL. They do vary a bit by dialect, but STRING_SPLIT is frequently available and is in DuckDB.
SELECT STRING_SPLIT(product_num, '-')[1] AS product_code,
STRING_SPLIT(product_num, '-')[2] AS product_num_no_code,
product_name
FROM products| product_code | product_num_no_code | product_name |
|---|---|---|
| SPR | 004456 | Skateboard |
| SPR | 003356 | Basketball |
| SPR | 004499 | Tennis balls |
| FTW | 098756 | Tennis shoes |
You might notice that you have to repeat the STRING_SPLIT function twice. That’s just the way SQL works.
There are also functions that enable extract using regex. Here we are using what are called “capturing parens”. More on Regex, including capturing parens at https://regexone.com/
SELECT REGEXP_EXTRACT(product_num, '^(\w\w\w)') AS product_code,
REGEXP_EXTRACT(product_num, '(\d+)') AS product_num_no_code,
product_name
FROM products| product_code | product_num_no_code | product_name |
|---|---|---|
| SPR | 004456 | Skateboard |
| SPR | 003356 | Basketball |
| SPR | 004499 | Tennis balls |
| FTW | 098756 | Tennis shoes |
Are you getting errors about unknown columns? SQL dialects differ on whether you can use double quotes like " for strings, or whether you have to use single quotes (like '). DuckDB requires single quotes; MySQL will allow double quotes.
This is because SQL needs to know whether something is a column name or not. If there are no special characters in column names then you can just write the column name directly (like product_name above). Spaces, particularly, cause problems. If a column name has a space in it, then SQL needs some way of knowing that two words are actually one thing.
DuckDB uses the double quotes for that: so "product name" would make DuckDB think you are providing a column name (which it probably won’t find, if you meant a string).
MySQL can allow strings to have double quotes, because MySQL uses backticks to show a column name, e.g., `product name`.
Nest and unnest
SELECT *
FROM courses| course_name | students_enrolled |
|---|---|
| Divination | Harry, Ron |
| Herbology | Hermione, Harry |
| Flying Lessons | Harry, Hermione, Ron |
DuckDB has a function that can create new rows from parts of a string that is split up, that function is called … UNNEST
SELECT course_name,
UNNEST(STRING_SPLIT(students_enrolled, ',')) AS student_enrolled
FROM courses| course_name | student_enrolled |
|---|---|
| Divination | Harry |
| Divination | Ron |
| Herbology | Hermione |
| Herbology | Harry |
| Flying Lessons | Harry |
| Flying Lessons | Hermione |
| Flying Lessons | Ron |
In the code above we have a function call inside another function call. STRING_SPLIT is called inside UNNEST.
Progammers learn to read lines like these “inside out”. That is their eyes seek out the inner-most part of the function call, then they work outwards. So they might read this as “split the text in the column into a list, then pass that list to the UNNEST function”.
It can be easier to read if you use more lines and indents
SELECT course_name,
UNNEST(
STRING_SPLIT(
students_enrolled, ','
)
) AS student_enrolled
FROM coursesThis is also what the pipe operator |> or %>% in R is helping with: instead of reading function calls inside out, we can write them as a pipeline. E.g.,
students_enrolled |>
string_split(",") |>
unnest()which could be read as “take the string students_enrolled, then split it, then unnest the result”
Programming with pipes, is sometimes called “chaining” because you are linking together functions like links in a chain … as we will see there is an equivalent in Python called “method chaining,” in fact we’ll be learning what is called “chained pandas”.
Which sounds sad (poor animals, chained up) but is actually very nifty
Wider / Longer
my_world_bank_pop <- world_bank_pop %>%
#janitor::clean_names() %>%
filter(indicator == "SP.URB.TOTL" &
country %in% c("AUS", "USA", "IND")) %>%
mutate(country = case_match(country,
"AUS" ~ "Australia",
"IND" ~ "Indonesia",
"USA" ~ "United States")) %>%
select(country, "2010", "2011", "2012", "2013")
dbWriteTable(con, "populations_wide", my_world_bank_pop, overwrite = TRUE)
dbWriteTable(con, "populations_long", overwrite = TRUE,
my_world_bank_pop |>
pivot_longer(cols = !country, names_to = "year", values_to = "population"))Historically many dialects of SQL did not have a convenient approach to transformations involving taking a table wider or longer. MySQL Generally those sort of transformations had to be done using programming languages outside SQL, often Python or Perl, or GUI based ETL tools (Extract, Transform, Load)
Today, though, most SQL dialects offer an ability to take data wider or longer. In DuckDB the functions are called “UNPIVOT” (which takes data from wider formats to longer formats) and “PIVOT” which takes data from longer formats to wider formats.
Remember that the word pivot can be tricky. In Excel a “pivot table” is a kind of group_by/aggregate with the added ability to distribute one grouping variable into new columns. We will learn more about that, but for now just keep in mind that different communities use pivot in slightly different, partially overlapping ways.
And almost all communities using the word do not understand that others use it differently …
We can make a wider format longer using UNPIVOT like this
UNPIVOT populations_wide
ON 2010, 2011, 2012, 2013
INTO
NAME year
VALUE population;| country | year | population |
|---|---|---|
| Australia | 2010 | 18767085 |
| Australia | 2011 | 19056040 |
| Australia | 2012 | 19414834 |
| Australia | 2013 | 19775013 |
| Indonesia | 2010 | 383721793 |
| Indonesia | 2011 | 393333604 |
| Indonesia | 2012 | 403171286 |
| Indonesia | 2013 | 413200994 |
| United States | 2010 | 249849720 |
| United States | 2011 | 252208133 |
That approach requires you to manually specify the column names, but there is an approach that allows you to specify the reverse (columns not to pivot) which we discused as knowing which are the “identifier” columns. In this example, the one column that we want to treat specially is country.
UNPIVOT populations_wide
ON COLUMNS(* EXCLUDE(country))
INTO
NAME year
VALUE population;| country | year | population |
|---|---|---|
| Australia | 2010 | 18767085 |
| Australia | 2011 | 19056040 |
| Australia | 2012 | 19414834 |
| Australia | 2013 | 19775013 |
| Indonesia | 2010 | 383721793 |
| Indonesia | 2011 | 393333604 |
| Indonesia | 2012 | 403171286 |
| Indonesia | 2013 | 413200994 |
| United States | 2010 | 249849720 |
| United States | 2011 | 252208133 |
The specific syntax for wider/longer operations in different SQL dialects varies a lot; it’s kind of a mess.
Happily, GenAI like ChatGPT and GitHub CoPilot, seem to be very good at consuming documentation and tutorials and providing well targeted snippets that can help us with specific syntax and specific dialects.
But when we are building prompts, we have to know what we are asking for. The language in this course around data transformations, abstracted from any particular framework, language, or dialect, will hopefully help you not only talk to other people about data transforms, but talk to AI as well!
We can transform longer tables to wider tables using PIVOT (which has PIVOT_WIDER as a handy synonym).
PIVOT_WIDER populations_long
ON year
USING first(population);| country | 2010 | 2011 | 2012 | 2013 |
|---|---|---|---|---|
| Australia | 18767085 | 19056040 | 19414834 | 19775013 |
| United States | 249849720 | 252208133 | 254614421 | 256953576 |
| Indonesia | 383721793 | 393333604 | 403171286 | 413200994 |
It is possible to achieve a similar outcome using CASE statements; in SQL dialects that don’t have a PIVOT you will see people do it this way.
SELECT
country,
SUM(CASE WHEN year = '2010' THEN population ELSE 0 END) AS "2010",
SUM(CASE WHEN year = '2011' THEN population ELSE 0 END) AS "2011",
SUM(CASE WHEN year = '2012' THEN population ELSE 0 END) AS "2012",
SUM(CASE WHEN year = '2013' THEN population ELSE 0 END) AS "2013",
FROM
populations_long
GROUP BY
country| country | 2010 | 2011 | 2012 | 2013 |
|---|---|---|---|---|
| Australia | 18767085 | 19056040 | 19414834 | 19775013 |
| United States | 249849720 | 252208133 | 254614421 | 256953576 |
| Indonesia | 383721793 | 393333604 | 403171286 | 413200994 |
The CASE query works row by row through this table. Let’s focus just on the Australia rows. And we will only do two years. And we will take the SUM off :)
First let’s focus only on two rows of input:
To understand how this is working let’s look just at the first row.
SELECT *
FROM populations_long
WHERE country = 'Australia'
LIMIT 2| country | year | population |
|---|---|---|
| Australia | 2010 | 18767085 |
| Australia | 2011 | 19056040 |
The CASE query works row by row. To see what’s happening we will take off the SUM part.
SELECT
country,
CASE WHEN year = '2010' THEN population ELSE 0 END AS "2010",
CASE WHEN year = '2011' THEN population ELSE 0 END AS "2011"
FROM
populations_long
WHERE country = 'Australia'
LIMIT 2| country | 2010 | 2011 |
|---|---|---|
| Australia | 18767085 | 0 |
| Australia | 0 | 19056040 |
The CASE statement looks at the first row (Australia, 2010, 18767085). Each part of the SELECT executes with those data as input. So the SELECT country produces “Australia”. The first CASE asks “Does year equal 2010” and the answer is yes, so it outputs 18767085 (the value of population for that row) and puts it in a column called “2010”. The second CASE then executes, but still with the same input data. Since year is not equal to 2010, this outputs 0 for a column called “2011”.
If we do that for all the years but still just Australia.
SELECT
country,
CASE WHEN year = '2010' THEN population ELSE 0 END AS "2010",
CASE WHEN year = '2011' THEN population ELSE 0 END AS "2011",
CASE WHEN year = '2012' THEN population ELSE 0 END AS "2012",
CASE WHEN year = '2013' THEN population ELSE 0 END AS "2013",
FROM
populations_long
WHERE country = 'Australia'| country | 2010 | 2011 | 2012 | 2013 |
|---|---|---|---|---|
| Australia | 18767085 | 0 | 0 | 0 |
| Australia | 0 | 19056040 | 0 | 0 |
| Australia | 0 | 0 | 19414834 | 0 |
| Australia | 0 | 0 | 0 | 19775013 |
Each row in the original table has turned into three zeros and one population number (for the year that was in that row).
Hopefully now we can see what we need to do is “squeeze” up the rows, so that the zeros disappear and each year column just has the large number in it. Thankfully we have a way to do this, using GROUP BY and aggregate. In this case SUM will work because if we add the three 0s to the population number we end up with the population number.
SELECT
country,
SUM(CASE WHEN year = '2010' THEN population ELSE 0 END) AS "2010",
SUM(CASE WHEN year = '2011' THEN population ELSE 0 END) AS "2011",
SUM(CASE WHEN year = '2012' THEN population ELSE 0 END) AS "2012",
SUM(CASE WHEN year = '2013' THEN population ELSE 0 END) AS "2013",
FROM
populations_long
GROUP BY
country| country | 2010 | 2011 | 2012 | 2013 |
|---|---|---|---|---|
| Indonesia | 383721793 | 393333604 | 403171286 | 413200994 |
| Australia | 18767085 | 19056040 | 19414834 | 19775013 |
| United States | 249849720 | 252208133 | 254614421 | 256953576 |
This works, but is it … non-intuitive and manual. So huzzah for modern SQL dialects having a native PIVOT feature!
Exercises
- Load this data frame into a DuckDB table.
data(starwars)What characters are from the planet “Tatooine”?
Which characters have grey hair? (Note: you will need to use STRING_SPLIT)