SQL Transforms Extras
Importing csv files directly to duckdb
First we set up our connection to duck db in the notebook, using code like this:
library(DBI)
library(duckdb)
options(duckdb.enable_rstudio_connection_pane=TRUE)
# create / connect to in memory database (not stored in a file)
<- duckdb()
drv <- dbConnect(drv) con
Then, up until now, we’ve been using R to read data files into memory, then using the dbWriteTable
function to write the data frame from R into a table in DuckDB.
We can, however, directly read data files into duckdb, without loading them using R. Here we load a file about Olympic medalists, from a broader project described at https://edjnet.github.io/OlympicsGoNUTS/
CREATE TABLE medalists AS
SELECT * FROM '../datasets/2024_medalists_all.csv';
DuckDB has an impressive CSV sniffer that tries to figure out the correct datatypes for columns and other aspects of the CSV files, but you can obtain more control as well. Full details are here: https://duckdb.org/docs/data/csv/overview.html
While much of SQL querying is standardized, loading data into SQL databases is not, so recognize that these commands are DuckDB specific.
In other databases, such as Postgres or MySQL, there are slightly varying approaches, usually using the COPY
command. See for example https://www.postgresqltutorial.com/postgresql-tutorial/import-csv-file-into-posgresql-table/
DuckDB even allows us to query directly from a CSV file, without actually loading this into a permanent table. DuckDB parses the CSV file and loads it into a temporary table, just for this one query.
For example, we can find all the 2024 Medalists from South Korea directly using:
SELECT medalist_name, medal, event_part_of_sport
FROM '../datasets/2024_medalists_all.csv'
WHERE country_medal = 'South Korea'
medalist_name | medal | event_part_of_sport |
---|---|---|
Kim Woo-jin | gold | archery |
Lee Woo-seok | bronze | archery |
Kim Je-deok | gold | archery |
Kim Woo-jin | gold | archery |
Lee Woo-seok | gold | archery |
Lim Si-hyeon | gold | archery |
Nam Su-hyeon | silver | archery |
Jeon Hun-young | gold | archery |
Lim Si-hyeon | gold | archery |
Nam Su-hyeon | gold | archery |
Arranging SQL transforms into a pipeline
Unlike R tidyverse and Python Pandas, SQL does not lend itself cleanly to creating transformation pipelines. Historically, SQL databases tended to just have the final schema in place, and other tools were used to transform the data on the way into the database. This is known as “Extract-Transform-Load” or ETL https://en.wikipedia.org/wiki/Extract,_transform,_load
In recent years, though, efforts to undertake transformations using SQL have accelerated, drawing on the very large data capabilities that are part of modern SQL databases.
There are four options, all relying on the concept of creating tables from the result of a SELECT
query. Those tables can then be used in subsequent queries.
CREATE TABLE AS SELECT
The first we have already seen:
CREATE OR REPLACE TABLE products_split AS
SELECT STRING_SPLIT(product_num, '-')[1] AS product_code,
'-')[2] AS product_num_no_code,
STRING_SPLIT(product_num,
product_nameFROM products
And then go on to query from the new products_split
table.
Sub-queries
We can also specify sub-queries, which are sort of unnamed temporary tables. The syntax sticks a whole query right in the middle of other queries, so it is generally only used for short elements, like this:
SELECT title, weight_in_grams AS lightest_book
FROM books_with_weights
WHERE weight_in_grams = (
SELECT MIN(weight_in_grams) FROM books_with_weights
)
Generally any complicated query should not be done as a sub-query, instead we turn to the CTE.
CTEs (aka WITH)
A second approach does the transformations at query time; you can think of this as a preamble to an SQL query which creates temporary tables.
WITH products_split_cte AS (
SELECT STRING_SPLIT(product_num, '-')[1] AS product_code,
'-')[2] AS product_num_no_code,
STRING_SPLIT(product_num,
product_nameFROM products
)SELECT *
FROM products_split_cte
WHERE product_code = 'FTW'
product_code | product_num_no_code | product_name |
---|---|---|
FTW | 098756 | Tennis shoes |
You can even create multiple transformations via WITH, each referencing the last. For example, here we first split out the product codes, creating a temporary products_split_cte
then we use that table to count the number of items with each code, creating a temporary product_counts_cte
. Then we query product_counts_cte
as though it was a table in the database.
WITH
AS (
products_split_cte SELECT STRING_SPLIT(product_num, '-')[1] AS product_code,
'-')[2] AS product_num_no_code,
STRING_SPLIT(product_num,
product_nameFROM products
),AS (
product_counts_cte SELECT product_code, COUNT(*) AS product_count
FROM products_split_cte
GROUP BY product_code
)
SELECT *
FROM product_counts_cte
WHERE product_code = 'SPR'
product_code | product_count |
---|---|
SPR | 3 |
You can use multiple ctes in the WITH
block to build out a complex pipeline.
What is a CTE anyway?
When we use the WITH
block to create temporary tables those are called CTEs
. If we look it up we’ll find out that CTE stands for Concurrent Table Expression … but that doesn’t really explain anything. Perhaps the real reason they are called this is lost in time, but the consensus seems to be that these temporary table definitions are “common” to the rest of this query (in the sense of available to other parts of the query). See https://stackoverflow.com/questions/43564050/why-is-common-table-expression-called-common
The key thing here is to recognize that when people say, “Use a CTE” they mean “define a temporary table using the WITH
block at the top of your query.
I find it useful to signal that I’m referencing a temporary table by adding the suffix _cte
to the end of the tables I create this way.
Directed Acyclical Graphs (DAGs) like DBT
SQL transforms are used in large scale data work, usually called Data Engineering. In these situations there is a team that is transforming data within a shared store (sometimes called a Data Warehouse or Data Lake). The Data Engineers then create Data Products for analysts to reference directly. You can think of the Data Products as the end points of a transformation pipeline.
At scale, some of the transformations can be slow and even costly (if additional servers have to be spun up to execute large queries). While a WITH
block is local and convenient, you definitely want to hold on to the results of a large transformation, so that others can query it. While directly creating tables using CREATE TABLE AS SELECT
can store transformations, this can quickly lead to out of date transforms or version issues, or just lots of scattered tables with unclear origins.
Into this space step DAGs. DAG stands for Directed Acyclical Graph and it is a way of describing data pipelines by specifying dependencies between tables. A DAG runs outside the SQL database, and is able to optimize the steps needed to update tables.
Examples of DAGs include the tools DBT (Data Build Tool), Apache Airflow, or Dagster. They are sometimes called “Data Orchestration” tools.
While these tools are outside the scope of this course, we can get a quick feel for DBT in this example. DBT works by having a set of separate text files. Inside each file is a SELECT
query. DBT runs that query and creates a table with the same name as the file. E.g.,
filename: products_split.sql
contents:
SELECT STRING_SPLIT(product_num, '-')[1] AS product_code,
STRING_SPLIT(product_num, '-')[2] AS product_num_no_code,
product_name
FROM products
When run, DBT would use the SELECT
query results and create a table called products_split
.
Where DBT becomes more powerful than CTEs or manual temporary tables is because DBT can have references from one file to another, signalling that there is a dependency between them.
For example,
filename: product_counts.sql
contents:
SELECT product_code, COUNT(*) AS product_count
FROM {{ ref('products_split') }}
GROUP BY product_code
The DBT special code here is {{ ref('products_split') }}
. DBT is able to look at this and know that it has to figure out how to create products_split
, which it does by finding the file named that then executing the code inside it to create the table needed before product_counts.sql
can be executed.
In this way DBT and other DAGs allow the creation of complex pipelines … but the ordering is specified through dependencies, rather than procedurally through the order of code as with R tidyverse or Python Pandas.
You can learn more about DBT on DataCamp at https://www.datacamp.com/courses/introduction-to-dbt