SQL Transforms Extras

Importing csv files directly to duckdb

First we set up our connection to duck db in the notebook, using code like this:

library(DBI)
library(duckdb)
options(duckdb.enable_rstudio_connection_pane=TRUE)

# create / connect to in memory database (not stored in a file)
drv <- duckdb()
con <- dbConnect(drv)

Then, up until now, we’ve been using R to read data files into memory, then using the dbWriteTable function to write the data frame from R into a table in DuckDB.

We can, however, directly read data files into duckdb, without loading them using R. Here we load a file about Olympic medalists, from a broader project described at https://edjnet.github.io/OlympicsGoNUTS/

CREATE TABLE medalists AS
    SELECT * FROM '../datasets/2024_medalists_all.csv';

DuckDB has an impressive CSV sniffer that tries to figure out the correct datatypes for columns and other aspects of the CSV files, but you can obtain more control as well. Full details are here: https://duckdb.org/docs/data/csv/overview.html

Caution

While much of SQL querying is standardized, loading data into SQL databases is not, so recognize that these commands are DuckDB specific.

In other databases, such as Postgres or MySQL, there are slightly varying approaches, usually using the COPY command. See for example https://www.postgresqltutorial.com/postgresql-tutorial/import-csv-file-into-posgresql-table/

DuckDB even allows us to query directly from a CSV file, without actually loading this into a permanent table. DuckDB parses the CSV file and loads it into a temporary table, just for this one query.

For example, we can find all the 2024 Medalists from South Korea directly using:

SELECT medalist_name, medal, event_part_of_sport
FROM '../datasets/2024_medalists_all.csv'
WHERE country_medal = 'South Korea'

Displaying records 1 - 10
medalist_name	medal	event_part_of_sport
Kim Woo-jin	gold	archery
Lee Woo-seok	bronze	archery
Kim Je-deok	gold	archery
Kim Woo-jin	gold	archery
Lee Woo-seok	gold	archery
Lim Si-hyeon	gold	archery
Nam Su-hyeon	silver	archery
Jeon Hun-young	gold	archery
Lim Si-hyeon	gold	archery
Nam Su-hyeon	gold	archery

Arranging SQL transforms into a pipeline

Unlike R tidyverse and Python Pandas, SQL does not lend itself cleanly to creating transformation pipelines. Historically, SQL databases tended to just have the final schema in place, and other tools were used to transform the data on the way into the database. This is known as “Extract-Transform-Load” or ETL https://en.wikipedia.org/wiki/Extract,_transform,_load

In recent years, though, efforts to undertake transformations using SQL have accelerated, drawing on the very large data capabilities that are part of modern SQL databases.

There are four options, all relying on the concept of creating tables from the result of a SELECT query. Those tables can then be used in subsequent queries.

CREATE TABLE AS SELECT

The first we have already seen:

CREATE OR REPLACE TABLE products_split AS
  SELECT STRING_SPLIT(product_num, '-')[1] AS product_code,
         STRING_SPLIT(product_num, '-')[2] AS product_num_no_code,
         product_name
  FROM products

And then go on to query from the new products_split table.

Sub-queries

We can also specify sub-queries, which are sort of unnamed temporary tables. The syntax sticks a whole query right in the middle of other queries, so it is generally only used for short elements, like this:

SELECT title, weight_in_grams AS lightest_book
FROM books_with_weights
WHERE weight_in_grams = (
  SELECT MIN(weight_in_grams) FROM books_with_weights
)

Generally any complicated query should not be done as a sub-query, instead we turn to the CTE.

CTEs (aka WITH)

A second approach does the transformations at query time; you can think of this as a preamble to an SQL query which creates temporary tables.

WITH products_split_cte AS (
      SELECT STRING_SPLIT(product_num, '-')[1] AS product_code,
         STRING_SPLIT(product_num, '-')[2] AS product_num_no_code,
         product_name
      FROM products
  )
SELECT *
FROM products_split_cte
WHERE product_code = 'FTW'

1 records
product_code	product_num_no_code	product_name
FTW	098756	Tennis shoes

You can even create multiple transformations via WITH, each referencing the last. For example, here we first split out the product codes, creating a temporary products_split_cte then we use that table to count the number of items with each code, creating a temporary product_counts_cte. Then we query product_counts_cte as though it was a table in the database.

WITH 
  products_split_cte AS (
      SELECT STRING_SPLIT(product_num, '-')[1] AS product_code,
         STRING_SPLIT(product_num, '-')[2] AS product_num_no_code,
         product_name
      FROM products
  ),
  product_counts_cte AS (
      SELECT product_code, COUNT(*) AS product_count
      FROM products_split_cte
      GROUP BY product_code
  )
  
SELECT *
FROM product_counts_cte
WHERE product_code = 'SPR'

1 records
product_code	product_count
SPR	3

You can use multiple ctes in the WITH block to build out a complex pipeline.

What is a CTE anyway?

When we use the WITH block to create temporary tables those are called CTEs. If we look it up we’ll find out that CTE stands for Concurrent Table Expression … but that doesn’t really explain anything. Perhaps the real reason they are called this is lost in time, but the consensus seems to be that these temporary table definitions are “common” to the rest of this query (in the sense of available to other parts of the query). See https://stackoverflow.com/questions/43564050/why-is-common-table-expression-called-common

The key thing here is to recognize that when people say, “Use a CTE” they mean “define a temporary table using the WITH block at the top of your query.

I find it useful to signal that I’m referencing a temporary table by adding the suffix _cte to the end of the tables I create this way.

Directed Acyclical Graphs (DAGs) like DBT

SQL transforms are used in large scale data work, usually called Data Engineering. In these situations there is a team that is transforming data within a shared store (sometimes called a Data Warehouse or Data Lake). The Data Engineers then create Data Products for analysts to reference directly. You can think of the Data Products as the end points of a transformation pipeline.

At scale, some of the transformations can be slow and even costly (if additional servers have to be spun up to execute large queries). While a WITH block is local and convenient, you definitely want to hold on to the results of a large transformation, so that others can query it. While directly creating tables using CREATE TABLE AS SELECT can store transformations, this can quickly lead to out of date transforms or version issues, or just lots of scattered tables with unclear origins.

Into this space step DAGs. DAG stands for Directed Acyclical Graph and it is a way of describing data pipelines by specifying dependencies between tables. A DAG runs outside the SQL database, and is able to optimize the steps needed to update tables.

Examples of DAGs include the tools DBT (Data Build Tool), Apache Airflow, or Dagster. They are sometimes called “Data Orchestration” tools.

While these tools are outside the scope of this course, we can get a quick feel for DBT in this example. DBT works by having a set of separate text files. Inside each file is a SELECT query. DBT runs that query and creates a table with the same name as the file. E.g.,

filename: products_split.sql
contents:
SELECT STRING_SPLIT(product_num, '-')[1] AS product_code,
       STRING_SPLIT(product_num, '-')[2] AS product_num_no_code,
       product_name
FROM products

When run, DBT would use the SELECT query results and create a table called products_split.

Where DBT becomes more powerful than CTEs or manual temporary tables is because DBT can have references from one file to another, signalling that there is a dependency between them.

For example,

filename: product_counts.sql
contents:
SELECT product_code, COUNT(*) AS product_count
FROM {{ ref('products_split') }}
GROUP BY product_code

The DBT special code here is {{ ref('products_split') }}. DBT is able to look at this and know that it has to figure out how to create products_split, which it does by finding the file named that then executing the code inside it to create the table needed before product_counts.sql can be executed.

In this way DBT and other DAGs allow the creation of complex pipelines … but the ordering is specified through dependencies, rather than procedurally through the order of code as with R tidyverse or Python Pandas.

You can learn more about DBT on DataCamp at https://www.datacamp.com/courses/introduction-to-dbt