SQL Aggregate and Grouping

library(DBI)
library(duckdb)
options(duckdb.enable_rstudio_connection_pane=TRUE)

# Using a temporary in memory database without a file.
drv <- duckdb()
# drv <- duckdb(dbdir = "my_database.duckdb")
con <- dbConnect(drv)

SQL Aggregate and Group

First we load our toy data on books.

books_with_weights <- tribble(
  ~title, ~series, ~format, ~weight_in_grams,
  "Harry Potter 1", "Harry Potter", "paperback", 1456,
  "Two Towers", "Lord of the Rings","paperback", 1800,
  "Harry Potter 3", "Harry Potter","hardcover", 2604,
  "Harry Potter 2", "Harry Potter","hardcover", 2900,
  "Return of the King", "Lord of the Rings", "hardcover", 1784
)

# Write the dataset to DuckDB
dbWriteTable(con, "books_with_weights", books_with_weights, overwrite = TRUE)

Recall that aggregation takes a set of rows and reduces them to a single value. In SQL this happens in the SELECT clause. Remember that the SELECT clause happens after the WHERE clause … so we are always aggregating the rows that are left.

We can obtain the total weight of each group of books using the SUM function in the SELECT clause and an AS alias to rename the columns. Here I show the build up of the SQL query using a semi-colon between each query, all the queries execute, but only the output from the last query shows in Rstudio.

SELECT *
FROM books_with_weights
;
SELECT weight_in_grams
FROM books_with_weights
5 records
weight_in_grams
1456
1800
2604
2900
1784

The aggregate function works to reduce this column to a single value.

SELECT SUM(weight_in_grams) AS total_weight
FROM books_with_weights
1 records
total_weight
10544

If we have a WHERE clause this changes the rows that are involved (because SELECT executes after WHERE)

SELECT *
FROM books_with_weights
;
SELECT *
FROM books_with_weights
WHERE series = 'Harry Potter'
3 records
title series format weight_in_grams
Harry Potter 1 Harry Potter paperback 1456
Harry Potter 3 Harry Potter hardcover 2604
Harry Potter 2 Harry Potter hardcover 2900
SELECT SUM(weight_in_grams) AS total_weight
FROM books_with_weights
WHERE series = 'Harry Potter'
1 records
total_weight
6960

Three forms of COUNT

In SQL counting is done with COUNT. The grain of the books_with_weights data frame here is “one row represents a book”, so counting the rows produces a count of books. In SQL we count rows using COUNT(*).

SELECT SUM(weight_in_grams) AS total_weight,
       COUNT(*) AS book_count
FROM books_with_weights
1 records
total_weight book_count
10544 5

SQL has three forms of COUNT, but one is forbidden

  1. COUNT(*) which counts the number of rows.
  2. COUNT(DISTINCT column) which counts the number of distinct values in a column.
  3. COUNT(column) counts the number of values that are not NULL.

COUNT(column) is the forbidden one. I will always grade that as wrong. COUNT(column) is forbidden because it is subject to subtle and confusing errors (and communicates poor understanding of what is being done).

Consider the differences between the three forms of COUNT on this list:

id, name
--------
1, Domhog
2, NULL
3, Domhog
4, Shuyen
  • COUNT(*) would give the answer 4 (there are 4 rows).
  • COUNT(DISTINCT name) would give the answer 2 (just ‘Domhog’ and ‘Shuyen’)
  • COUNT(name) would give the answer 3 (there are 3 non-null values in the name column.)

I advise people to avoid COUNT(column) because the semantics are confusing in a query. I mean that what people are intending to accomplish is unclear. More, though, I haven’t come across a time when it is necessary.

Whenever you are tempted to use COUNT(column), ask yourself whether COUNT(*) is clearer and is actually what you mean. Most counting queries are questions about the number of rows (after some conditions/joins), not the number of non-NULL values.

ps. I’m very interested to find an example where COUNT(column) is the ideal thing, so please email me if you see one. Only thing I can think of is something like counting records with relationships. e.g. a table like people(id, name, sig_other_id) where those not in a relationship have NULL for their sig_other_id.

--How many people have sig others?
SELECT COUNT(sig_other_id) AS num_in_relationships
FROM people

But I think even that would be much clearer as:

SELECT COUNT(*) as num_in_relationships
FROM people
WHERE people.sig_other_id IS NOT NULL

So you can always re-write COUNT(column) as COUNT(*) with a WHERE ... IS NOT NULL clause. Much clearer.

If we wanted to know how many different manufacturers have cars in the gtcars dataset, in R tidyverse we used n_distinct(manufacter) but in SQL we would use COUNT(DISTINCT manufacturer)

data(gtcars)

gtcars_renamed <- gtcars %>% 
  select(manufacturer = mfr, everything())

# Write the dataset to DuckDB
dbWriteTable(con, "gtcars", gtcars_renamed, overwrite = TRUE)
SELECT COUNT(DISTINCT manufacturer) AS num_manufacters
FROM gtcars
1 records
num_manufacters
19

A warning on MIN and MAX

A note about MIN and MAX vs ORDER BY LIMIT 1

SQL offers aggregate functions called MIN() and MAX(). These seem straight forward, the find the minimum or maximum value in a column.

SELECT MIN(weight_in_grams) AS lightest_book
FROM books_with_weights
1 records
lightest_book
1456

Which will, correctly, return 1456, the lowest value in the column.

However questions about the ‘lowest’, ‘highest’, ‘earliest’ usually also want another part of the row with that value. e.g., ‘What’s the title of the lightest book?’ or ‘Who bought the earliest ticket?’, ‘which ticket number had the lowest price?’

These queries are better answered using ORDER BY and LIMIT 1. Using MIN and MAX the intuitive way can silently give a totally incorrect answer. DuckDB will refuse to run this query, but MySQL will run it, it will look believeable but will have actually chosen a random value for title.

SELECT title,
       MIN(weight_in_grams) AS lightest_book
FROM books_with_weights
Error: rapi_prepare: Failed to prepare query SELECT title,
       MIN(weight_in_grams) AS lightest_book
FROM books_with_weights
Error: Binder Error: column "title" must appear in the GROUP BY clause or must be part of an aggregate function.
Either add it to the GROUP BY list, or use "ANY_VALUE(title)" if the exact value of "title" is not important.
LINE 1: SELECT title,
               ^

This bears repeating: If you use MIN or MAX in the SELECT clause with any other column you will not get the value in the row that had the lowest value, you will get a random value from the other column. I know that this is hard to believe. When an aggregate function is used in SELECT then we must be collapsing a group of rows … if we put a “bare” column name in the SELECT together with an aggregate function, then the system needs a way to pick a value. If the term is in the GROUP BY then it is clear, since all the values must be the same, but if the column is not then it will pick one at random …. or DuckDB asks us to force that behavior by using the ANY_VALUE(column) function, or throws an error and refuses to run.

While you should use ORDER BY and LIMIT 1, you can use MIN or MAX but you have to use it via IN (or a sub-query). Here think of the question as two steps: ‘What is the minimum price? And what the ticketnumbers and prices for those tickets which had that price?’

-- First get the minimum value
SELECT SELECT MIN(weight_in_grams) FROM books_with_weights
;
-- That returns a single value which you then can use.
-- You get the server to 'remember' it by surrounding the query with parens.
SELECT title, weight_in_grams AS lightest_book
FROM books_with_weights
WHERE weight_in_grams = (
  SELECT MIN(weight_in_grams) FROM books_with_weights
)
Error: {"exception_type":"Parser","exception_message":"syntax error at or near \"SELECT\"","position":"38","error_subtype":"SYNTAX_ERROR"}

But hey if you use ORDER BY and LIMIT you can easily get the top five or bottom whatever (as long as ties don’t matter), so just don’t use MIN or MAX in the SELECT clause, unless it’s a sub-query.

SELECT title, weight_in_grams AS lightest_book
FROM books_with_weights
ORDER BY weight_in_grams ASC
LIMIT 5

I would only use MIN or MAX when ties really matter, and only in a sub-query. Even then there is a more direct approach to ranking, see Window Aggregates below.

GROUP BY

Recall that grouping is akin to sorting a dataframe, then drawing lines across it where the value in the grouping column changes.

title series format weight_in_grams
Harry Potter 1 Harry Potter paperback 1456
Harry Potter 3 Harry Potter hardcover 2604
Harry Potter 2 Harry Potter hardcover 2900
Two Towers Lord of the Rings paperback 1800
Return of the King Lord of the Rings hardcover 1784

In SQL we will build up grouping queries using ORDER BY to see the rows before they are collapsed. To find the number of books in each series we will do:

SELECT *
FROM books_with_weights
;
SELECT *
FROM books_with_weights
ORDER BY series
;
5 records
title series format weight_in_grams
Harry Potter 1 Harry Potter paperback 1456
Harry Potter 3 Harry Potter hardcover 2604
Harry Potter 2 Harry Potter hardcover 2900
Two Towers Lord of the Rings paperback 1800
Return of the King Lord of the Rings hardcover 1784

We will then change the ORDER BY to GROUP BY then add the columns in the GROUP BY to the start of the select and finally add the aggregate function.

SELECT series,
FROM books_with_weights
GROUP BY series
;
SELECT series, SUM(weight_in_grams) AS total_weight
FROM books_with_weights
GROUP BY series
2 records
series total_weight
Lord of the Rings 3584
Harry Potter 6960

As before we can have multiple aggregate functions collapsing the same groups.

SELECT series,
FROM books_with_weights
GROUP BY series
;
SELECT series, 
       SUM(weight_in_grams) AS total_weight,
       COUNT(*) AS book_count
FROM books_with_weights
GROUP BY series
2 records
series total_weight book_count
Harry Potter 6960 3
Lord of the Rings 3584 2

COUNT(DISTINCT column) in group

We can also look within a column and count the number of distinct things, as a way of collapsing the group. Switching to the cars data, we can count the number of different manufacturers for each year.

SELECT *
FROM gtcars
;
SELECT *
FROM gtcars
ORDER BY year
;
SELECT year,
FROM gtcars
GROUP BY year
;
SELECT year, COUNT(DISTINCT manufacturer)
FROM gtcars
GROUP BY year
4 records
year count(DISTINCT manufacturer)
2015 3
2014 2
2016 13
2017 8

Post Group and aggregate (ORDER BY and HAVING)

GROUP BY happens after FROM and WHERE (but before SELECT). Since WHERE happens early in the query, we can’t use it to exclude results from these results. And that is commonly a need, for queries that ask “Does any group have more than 1?”

We could pipeline these results into a temporary table, and then use WHERE on that, but SQL has a short cut here: the HAVING clause, which is a second WHERE that excutes against the results of a query with a GROUP BY. For example we could find years in which more than 5 manufactuers produced models:

SELECT year, COUNT(DISTINCT manufacturer) AS model_count
FROM gtcars
GROUP BY year
HAVING COUNT(DISTINCT manufacturer) > 5
2 records
year model_count
2016 13
2017 8

We can use ORDER BY to change the order that the aggregated results are displayed (which is to say that ORDER BY happens after GROUP BY and even after HAVING).

SELECT year, COUNT(DISTINCT manufacturer) AS model_count
FROM gtcars
GROUP BY year
HAVING COUNT(DISTINCT manufacturer) > 5
ORDER BY year
2 records
year model_count
2016 13
2017 8

GROUP BY ALL

Above I made a big deal out of the step from ORDER BY to GROUP BY having two parts: change to GROUP BY and copy all the fields to the start of the SELECT. This is important, but it does mean that all the columns are repeated twice, which when queries are edited can lead to annoying mistakes.

DuckDB offers an improvement to SQL to address this, GROUP BY ALL which means that we can put our grouping columns in a single place (the SELECT) and then GROUP BY ALL uses all of the non-aggregate columns in the SELECT.

DuckDB documentation uses this example (based on starwars).

SELECT
    systems,
    planets,
    cities,
    cantinas,
    sum(scum + villainy) AS total_scum_and_villainy
FROM star_wars_locations
GROUP BY ALL;
-- GROUP BY systems, planets, cities, cantinas

Check out the duckdb documentation for other SQL improvements: https://duckdb.org/2022/05/04/friendlier-sql.html. But be mindful that these techniques may not be available in other dialects of SQL.

Window Aggregates

In R window aggregates (which return more than one row per group) are a fairly straightforward extention of the syntax used for basic group bys. For example, we can use arrange to sort within groups, then use the slice_head function to get just a few rows per group.

gtcars %>% 
  group_by(year) %>% 
  arrange(desc(mpg_h), .by_group = TRUE) %>% 
  slice_head(n = 3)

In SQL, though, the concepts are the same but the syntax varies quite a bit from simple GROUP BY.

We have to use two steps. The first gets our rankings within a group (for each and every row), the second uses those rankings to exclude rows below a certain rank.

Getting the rank within a group uses the new syntax of `OVER (PARTITION BY … ORDER BY)

SELECT *
FROM gtcars
;
SELECT model,
       year,
       mpg_h
FROM gtcars 
ORDER BY year, mpg_h DESC
;
SELECT model,
       year,
       mpg_h,
       row_number() OVER (PARTITION BY year ORDER BY mpg_h) AS rank_in_year
FROM gtcars 
Displaying records 1 - 10
model year mpg_h rank_in_year
Vantage 2016 19 1
Dawn 2016 19 2
Granturismo 2016 21 3
Rapide S 2016 21 4
Vanquish 2016 21 5
Wraith 2016 21 6
488 GTB 2016 22 7
GT-R 2016 22 8
M5 2016 22 9
M6 2016 22 10

The second step should be straightforward, using WHERE rank_in_year <=3. But as with GROUP BY the order that queries execute means that the WHERE happens before the PARTITION. For reasons that are not clear to me, HAVING cannot be used here either (only with GROUP BY). So we have to use one of the temporary table approaches (sub-query, CTE, or CREATE OR REPLACE TABLE AS SELECT, see Arranging SQL transforms into a pipeline) so that we can immediately query our results.

Here I use a temp table:

SELECT model,
       year,
       mpg_h,
       row_number() OVER (PARTITION BY year ORDER BY mpg_h) AS rank_in_year
FROM gtcars 
;
CREATE OR REPLACE TABLE tmp_mpg_ranks AS 
  SELECT model,
       year,
       mpg_h,
       row_number() OVER (PARTITION BY year ORDER BY mpg_h) AS rank_in_year
  FROM gtcars 
;
SELECT *
FROM tmp_mpg_ranks
WHERE rank_in_year <= 3
ORDER BY year
Displaying records 1 - 10
model year mpg_h rank_in_year
458 Italia 2014 17 1
Gallardo 2014 20 2
FF 2015 16 1
F12Berlinetta 2015 16 2
LaFerrari 2015 16 3
Vantage 2016 19 1
Dawn 2016 19 2
Granturismo 2016 21 3
GTC4Lusso 2017 17 1
GT 2017 18 2

Many will prefer to avoid creating a table for what is really just part of a query, so they will use a sub-query, where we copy the working ranking query into where the table name would normally go, like this:

SELECT model,
       year,
       mpg_h,
       row_number() OVER (PARTITION BY year ORDER BY mpg_h) AS rank_in_year
FROM gtcars 
;
SELECT *
FROM (
  SELECT model,
       year,
       mpg_h,
       row_number() OVER (PARTITION BY year ORDER BY mpg_h) AS rank_in_year
  FROM gtcars 
)
WHERE rank_in_year <= 3
ORDER BY year
Displaying records 1 - 10
model year mpg_h rank_in_year
458 Italia 2014 17 1
Gallardo 2014 20 2
FF 2015 16 1
F12Berlinetta 2015 16 2
LaFerrari 2015 16 3
Vantage 2016 19 1
Dawn 2016 19 2
Granturismo 2016 21 3
GTC4Lusso 2017 17 1
GT 2017 18 2

As with r tidyverse, you can also use more sophisticated ranking functions instead of row_number, such as RANK or DENSE_RANK see https://duckdb.org/docs/sql/functions/window_functions.html

Exercises

Do the five exercises on SUM and COUNT on SQLZoo