SQL Transforms
SQL queries can also be used for data wrangling. SQL is a language, and there are lots of different database implementations. You might have heard of MySQL, PostgreSQL, Oracle, Microsoft SQL Server, or cloud implementations like Snowflake.
In this course we will be working with DuckDB. DuckDB is an “in-notebook” SQL package. This means that we don’t have to run a database server separately from our notebooks.
The SQL we will learn will be largely standard and should work across all databases. Where we use things that are specific to DuckDB I will highlight that, and provide pointers to explanations in other SQL implementations.
library(DBI)
library(duckdb)
options(duckdb.enable_rstudio_connection_pane=TRUE)
# create / connect to database file
# drv <- duckdb(dbdir = "my_database.duckdb")
# or you can have a temporary in memory database without a file.
<- duckdb()
drv <- dbConnect(drv) con
We can write data frames into the database, where they are known as tables.
data(gtcars)
<- gtcars %>%
gtcars_renamed select(manufacturer = mfr, everything())
# Write the dataset to DuckDB
dbWriteTable(con, "gtcars", gtcars_renamed, overwrite = TRUE)
<- tribble(
customers ~customer_name, ~customer_age, ~customer_state,
"Jihong Li", 34, "California",
"Sarah Brown", 66, "Idaho",
"Rupert Brown", 89, "California",
"Ravi Shastri", 62, "Maharashtra"
)
dbWriteTable(con, "customers", customers, overwrite = TRUE)
## and disconnect
# dbDisconnect(con, shutdown=TRUE)
Excluding rows and columns
SELECT *
FROM customers
FROM
specifies the table that we want to query. Here we query a table called customers
SELECT
enables us to exclude columns. To see all columns we use the special character *
(pronounced star).
The output produced looks just like any data frame. Notice the “4 records” at the bottom? That is what a row is called in SQL, a record.
SELECT *
FROM customers
customer_name | customer_age | customer_state |
---|---|---|
Jihong Li | 34 | California |
Sarah Brown | 66 | Idaho |
Rupert Brown | 89 | California |
Ravi Shastri | 62 | Maharashtra |
SQL queries are particular about the order of parts. Most SQL implementations will give an error if we write FROM
before SELECT
(or if we just write FROM
without a SELECT
). DuckDB, funnily enough, will allow either one of those. But … when in SQL land stick with the typical ordering.
Excluding rows
The WHERE
clause allows us to specify conditions that each row must meet. SQL uses a single =
for these tests (rather than the double ==
used in other frameworks). Text items in the query are enclosed in single quotes, thus 'California'
but numbers can be written on their own.
SELECT *
FROM customers
WHERE customer_state = 'California'
AND customer_age > 80
customer_name | customer_age | customer_state |
---|---|---|
Rupert Brown | 89 | California |
SQL also allows the approach of providing a list of alternatives, using IN
. This is much preferred to using OR
.
SELECT *
FROM customers
WHERE customer_state IN ('California', 'Maharashtra')
customer_name | customer_age | customer_state |
---|---|---|
Jihong Li | 34 | California |
Rupert Brown | 89 | California |
Ravi Shastri | 62 | Maharashtra |
The opposite of IN
is also available: NOT IN
SELECT *
FROM customers
WHERE customer_state NOT IN ('California', 'Maharashtra')
customer_name | customer_age | customer_state |
---|---|---|
Sarah Brown | 66 | Idaho |
If you are thinking “Hey, that is sort of like English” then you are on to something. When SQL was first defined, waaaaay back in the late 1970s, the thought was that “managers” (as in not programmers) would be able to use this language. Today, with GenAI, we may finally be at a point where natural language prompts can write SQL queries (sometimes, maybe, but can you tell if it’s correct?).
Excluding columns
In SQL we specify the columns that we want to see using the SELECT
clause, and providing a comma separated list.
SELECT customer_state, customer_name
FROM customers
customer_state | customer_name |
---|---|
California | Jihong Li |
Idaho | Sarah Brown |
California | Rupert Brown |
Maharashtra | Ravi Shastri |
Generally speaking in SQL we haven’t been able to do more advanced sub-setting of columns (such as saying “everything other than customer_name” or “all columns that start with numbers”), although some dialects of SQL do offer terms like EXCEPT
or EXCLUDE
.
Creating new columns
SQL does allow the creation of new result columns, using the SELECT
clause and AS
to change names.
<- tribble(
products ~product_num, ~product_name, ~price_in_dollars,
"SPR-004456", "Skateboard", 121,
"SPR-003356", "Basketball", 34,
"SPR-004499", "Tennis balls", 12,
"FTW-098756", "Tennis shoes", 79
)
dbWriteTable(con, "products", products, overwrite = TRUE)
SELECT *
FROM products
product_num | product_name | price_in_dollars |
---|---|---|
SPR-004456 | Skateboard | 121 |
SPR-003356 | Basketball | 34 |
SPR-004499 | Tennis balls | 12 |
FTW-098756 | Tennis shoes | 79 |
Here is how we would create a new column for a discounted price; we provide a little piece of code that is applied to every row. Here I’ve also swapped to putting each term in the SELECT on its own line to make things clearer (SQL is not fussy about whitespace).
SELECT product_num,
product_name, * 0.9 AS discounted_price
price_in_dollars FROM products
product_num | product_name | discounted_price |
---|---|---|
SPR-004456 | Skateboard | 108.9 |
SPR-003356 | Basketball | 30.6 |
SPR-004499 | Tennis balls | 10.8 |
FTW-098756 | Tennis shoes | 71.1 |
Above I said that SQL SELECT
queries can create new result columns. Here result is doing some heavy lifting.
When we execute an SQL SELECT
query we get back results (a ResultSet
) but we do not change the database at all. SELECT
queries are read only. So while we can create a new column, that happens after the data is pulled from the database table and the new column only exists in the results of the query.
This makes sense in SQL land because the core usage is not inside an individual analyst’s notebook or computer, but in shared databases provided by centralized servers running on their own hardware. Changing the data would change it for everyone; if you changed a column name in a table, then likely everyone else’s queries would break. The website that the database backed would stop working and you’d, uh, definitely get woken up to fix it.
If we want the results of a query to persist in the database we do have the option of creating new tables, using the result of a SELECT
query. We can say that the table should be temporary (not persist past the end of our session), or the table could be permanent.
CREATE OR REPLACE TEMP TABLE discounted_products AS
SELECT product_num,
product_name, * 0.9 AS discounted_price
price_in_dollars FROM products
If we are getting into changing column names or structures in tables in a database that anyone else might use … then we are in the world of migrations which is its own fascinating area, with a set of version control technologies and social/managerial practices.
Combining columns
We can combine columns using SQL SELECT (again not changing the table just the results). This uses CONCAT
and visits each row. We can use column names but we can also provide little chunks of text to be used with each line.
SELECT product_num,
product_name,CONCAT(product_name, ' just $', price_in_dollars, '.00. Buy today!') AS product_label
FROM products
product_num | product_name | product_label |
---|---|---|
SPR-004456 | Skateboard | Skateboard just $121.0.00. Buy today! |
SPR-003356 | Basketball | Basketball just $34.0.00. Buy today! |
SPR-004499 | Tennis balls | Tennis balls just $12.0.00. Buy today! |
FTW-098756 | Tennis shoes | Tennis shoes just $79.0.00. Buy today! |
Treating different cases differently
SQL can do different things for different rows, using CASE
in the SELECT clause. For example we can create a different product_label for the more expensive items.
mutate(product_label = case_when( price_in_dollars > 100 ~ str_glue(“Treat yourself. Invest ${price_in_dollars} for a {product_name} today!”), TRUE ~ str_glue(“{product_name} just ${price_in_dollars}.00. Buy today!”) )
SELECT product_num,
product_name,CASE WHEN
> 100 THEN
price_in_dollars CONCAT('Treat yourself. Invest $', price_in_dollars, ' for a ', product_name, ' today!')
ELSE
CONCAT(product_name, ' just $', price_in_dollars, '.00. Buy today!')
END
AS product_label
FROM products
product_num | product_name | product_label |
---|---|---|
SPR-004456 | Skateboard | Treat yourself. Invest $121.0 for a Skateboard today! |
SPR-003356 | Basketball | Basketball just $34.0.00. Buy today! |
SPR-004499 | Tennis balls | Tennis balls just $12.0.00. Buy today! |
FTW-098756 | Tennis shoes | Tennis shoes just $79.0.00. Buy today! |
Extract/Separate
You can extract content out of cells within a column using string functions in SQL. They do vary a bit by dialect, but STRING_SPLIT is frequently available and is in DuckDB.
SELECT STRING_SPLIT(product_num, '-')[1] AS product_code,
'-')[2] AS product_num_no_code,
STRING_SPLIT(product_num,
product_nameFROM products
product_code | product_num_no_code | product_name |
---|---|---|
SPR | 004456 | Skateboard |
SPR | 003356 | Basketball |
SPR | 004499 | Tennis balls |
FTW | 098756 | Tennis shoes |
You might notice that you have to repeat the STRING_SPLIT
function twice. That’s just the way SQL works.
There are also functions that enable extract using regex. Here we are using what are called “capturing parens”. More on Regex, including capturing parens at https://regexone.com/
SELECT REGEXP_EXTRACT(product_num, '^(\w\w\w)') AS product_code,
'(\d+)') AS product_num_no_code,
REGEXP_EXTRACT(product_num,
product_nameFROM products
product_code | product_num_no_code | product_name |
---|---|---|
SPR | 004456 | Skateboard |
SPR | 003356 | Basketball |
SPR | 004499 | Tennis balls |
FTW | 098756 | Tennis shoes |
Are you getting errors about unknown columns? SQL dialects differ on whether you can use double quotes like "
for strings, or whether you have to use single quotes (like '
). DuckDB requires single quotes; MySQL will allow double quotes.
This is because SQL needs to know whether something is a column name or not. If there are no special characters in column names then you can just write the column name directly (like product_name
above). Spaces, particularly, cause problems. If a column name has a space in it, then SQL needs some way of knowing that two words are actually one thing.
DuckDB uses the double quotes for that: so "product name"
would make DuckDB think you are providing a column name (which it probably won’t find, if you meant a string).
MySQL can allow strings to have double quotes, because MySQL uses backticks to show a column name, e.g., `product name`
.
Nest and unnest
SELECT *
FROM courses
course_name | students_enrolled |
---|---|
Divination | Harry, Ron |
Herbology | Hermione, Harry |
Flying Lessons | Harry, Hermione, Ron |
DuckDB has a function that can create new rows from parts of a string that is split up, that function is called … UNNEST
SELECT course_name,
',')) AS student_enrolled
UNNEST(STRING_SPLIT(students_enrolled, FROM courses
course_name | student_enrolled |
---|---|
Divination | Harry |
Divination | Ron |
Herbology | Hermione |
Herbology | Harry |
Flying Lessons | Harry |
Flying Lessons | Hermione |
Flying Lessons | Ron |
In the code above we have a function call inside another function call. STRING_SPLIT
is called inside UNNEST
.
Progammers learn to read lines like these “inside out”. That is their eyes seek out the inner-most part of the function call, then they work outwards. So they might read this as “split the text in the column into a list, then pass that list to the UNNEST function”.
It can be easier to read if you use more lines and indents
SELECT course_name,
UNNEST(
STRING_SPLIT(','
students_enrolled,
)AS student_enrolled
) FROM courses
This is also what the pipe operator |>
or %>%
in R is helping with: instead of reading function calls inside out, we can write them as a pipeline. E.g.,
|>
students_enrolled string_split(",") |>
unnest()
which could be read as “take the string students_enrolled, then split it, then unnest the result”
Programming with pipes, is sometimes called “chaining” because you are linking together functions like links in a chain … as we will see there is an equivalent in Python called “method chaining,” in fact we’ll be learning what is called “chained pandas”.
Which sounds sad (poor animals, chained up) but is actually very nifty
Wider / Longer
<- world_bank_pop %>%
my_world_bank_pop #janitor::clean_names() %>%
filter(indicator == "SP.URB.TOTL" &
%in% c("AUS", "USA", "IND")) %>%
country mutate(country = case_match(country,
"AUS" ~ "Australia",
"IND" ~ "Indonesia",
"USA" ~ "United States")) %>%
select(country, "2010", "2011", "2012", "2013")
dbWriteTable(con, "populations_wide", my_world_bank_pop, overwrite = TRUE)
dbWriteTable(con, "populations_long", overwrite = TRUE,
|>
my_world_bank_pop pivot_longer(cols = !country, names_to = "year", values_to = "population"))
Historically many dialects of SQL did not have a convenient approach to transformations involving taking a table wider or longer. MySQL Generally those sort of transformations had to be done using programming languages outside SQL, often Python or Perl, or GUI based ETL tools (Extract, Transform, Load)
Today, though, most SQL dialects offer an ability to take data wider or longer. In DuckDB the functions are called “UNPIVOT” (which takes data from wider formats to longer formats) and “PIVOT” which takes data from longer formats to wider formats.
Remember that the word pivot
can be tricky. In Excel a “pivot table” is a kind of group_by/aggregate with the added ability to distribute one grouping variable into new columns. We will learn more about that, but for now just keep in mind that different communities use pivot
in slightly different, partially overlapping ways.
And almost all communities using the word do not understand that others use it differently …
We can make a wider format longer using UNPIVOT
like this
UNPIVOT populations_wideON 2010, 2011, 2012, 2013
INTO
year
NAME VALUE population;
country | year | population |
---|---|---|
Australia | 2010 | 18767085 |
Australia | 2011 | 19056040 |
Australia | 2012 | 19414834 |
Australia | 2013 | 19775013 |
Indonesia | 2010 | 383721793 |
Indonesia | 2011 | 393333604 |
Indonesia | 2012 | 403171286 |
Indonesia | 2013 | 413200994 |
United States | 2010 | 249849720 |
United States | 2011 | 252208133 |
That approach requires you to manually specify the column names, but there is an approach that allows you to specify the reverse (columns not to pivot) which we discused as knowing which are the “identifier” columns. In this example, the one column that we want to treat specially is country
.
UNPIVOT populations_wideON COLUMNS(* EXCLUDE(country))
INTO
year
NAME VALUE population;
country | year | population |
---|---|---|
Australia | 2010 | 18767085 |
Australia | 2011 | 19056040 |
Australia | 2012 | 19414834 |
Australia | 2013 | 19775013 |
Indonesia | 2010 | 383721793 |
Indonesia | 2011 | 393333604 |
Indonesia | 2012 | 403171286 |
Indonesia | 2013 | 413200994 |
United States | 2010 | 249849720 |
United States | 2011 | 252208133 |
The specific syntax for wider/longer operations in different SQL dialects varies a lot; it’s kind of a mess.
Happily, GenAI like ChatGPT and GitHub CoPilot, seem to be very good at consuming documentation and tutorials and providing well targeted snippets that can help us with specific syntax and specific dialects.
But when we are building prompts, we have to know what we are asking for. The language in this course around data transformations, abstracted from any particular framework, language, or dialect, will hopefully help you not only talk to other people about data transforms, but talk to AI as well!
We can transform longer tables to wider tables using PIVOT
(which has PIVOT_WIDER
as a handy synonym).
PIVOT_WIDER populations_longON year
USING first(population);
country | 2010 | 2011 | 2012 | 2013 |
---|---|---|---|---|
Australia | 18767085 | 19056040 | 19414834 | 19775013 |
United States | 249849720 | 252208133 | 254614421 | 256953576 |
Indonesia | 383721793 | 393333604 | 403171286 | 413200994 |
It is possible to achieve a similar outcome using CASE
statements; in SQL dialects that don’t have a PIVOT
you will see people do it this way.
SELECT
country,SUM(CASE WHEN year = '2010' THEN population ELSE 0 END) AS "2010",
SUM(CASE WHEN year = '2011' THEN population ELSE 0 END) AS "2011",
SUM(CASE WHEN year = '2012' THEN population ELSE 0 END) AS "2012",
SUM(CASE WHEN year = '2013' THEN population ELSE 0 END) AS "2013",
FROM
populations_longGROUP BY
country
country | 2010 | 2011 | 2012 | 2013 |
---|---|---|---|---|
Australia | 18767085 | 19056040 | 19414834 | 19775013 |
United States | 249849720 | 252208133 | 254614421 | 256953576 |
Indonesia | 383721793 | 393333604 | 403171286 | 413200994 |
The CASE
query works row by row through this table. Let’s focus just on the Australia rows. And we will only do two years. And we will take the SUM
off :)
First let’s focus only on two rows of input:
To understand how this is working let’s look just at the first row.
SELECT *
FROM populations_long
WHERE country = 'Australia'
LIMIT 2
country | year | population |
---|---|---|
Australia | 2010 | 18767085 |
Australia | 2011 | 19056040 |
The CASE
query works row by row. To see what’s happening we will take off the SUM
part.
SELECT
country,CASE WHEN year = '2010' THEN population ELSE 0 END AS "2010",
CASE WHEN year = '2011' THEN population ELSE 0 END AS "2011"
FROM
populations_longWHERE country = 'Australia'
LIMIT 2
country | 2010 | 2011 |
---|---|---|
Australia | 18767085 | 0 |
Australia | 0 | 19056040 |
The CASE
statement looks at the first row (Australia, 2010, 18767085). Each part of the SELECT executes with those data as input. So the SELECT country produces “Australia”. The first CASE
asks “Does year equal 2010” and the answer is yes, so it outputs 18767085 (the value of population for that row) and puts it in a column called “2010”. The second CASE
then executes, but still with the same input data. Since year is not equal to 2010, this outputs 0 for a column called “2011”.
If we do that for all the years but still just Australia.
SELECT
country,CASE WHEN year = '2010' THEN population ELSE 0 END AS "2010",
CASE WHEN year = '2011' THEN population ELSE 0 END AS "2011",
CASE WHEN year = '2012' THEN population ELSE 0 END AS "2012",
CASE WHEN year = '2013' THEN population ELSE 0 END AS "2013",
FROM
populations_longWHERE country = 'Australia'
country | 2010 | 2011 | 2012 | 2013 |
---|---|---|---|---|
Australia | 18767085 | 0 | 0 | 0 |
Australia | 0 | 19056040 | 0 | 0 |
Australia | 0 | 0 | 19414834 | 0 |
Australia | 0 | 0 | 0 | 19775013 |
Each row in the original table has turned into three zeros and one population number (for the year that was in that row).
Hopefully now we can see what we need to do is “squeeze” up the rows, so that the zeros disappear and each year column just has the large number in it. Thankfully we have a way to do this, using GROUP BY
and aggregate. In this case SUM
will work because if we add the three 0s to the population number we end up with the population number.
SELECT
country,SUM(CASE WHEN year = '2010' THEN population ELSE 0 END) AS "2010",
SUM(CASE WHEN year = '2011' THEN population ELSE 0 END) AS "2011",
SUM(CASE WHEN year = '2012' THEN population ELSE 0 END) AS "2012",
SUM(CASE WHEN year = '2013' THEN population ELSE 0 END) AS "2013",
FROM
populations_longGROUP BY
country
country | 2010 | 2011 | 2012 | 2013 |
---|---|---|---|---|
Indonesia | 383721793 | 393333604 | 403171286 | 413200994 |
Australia | 18767085 | 19056040 | 19414834 | 19775013 |
United States | 249849720 | 252208133 | 254614421 | 256953576 |
This works, but is it … non-intuitive and manual. So huzzah for modern SQL dialects having a native PIVOT
feature!
Exercises
- Load this data frame into a DuckDB table.
data(starwars)
What characters are from the planet “Tatooine”?
Which characters have grey hair? (Note: you will need to use STRING_SPLIT)