SQL Aggregate and Grouping
library(DBI)
library(duckdb)
options(duckdb.enable_rstudio_connection_pane=TRUE)
# Using a temporary in memory database without a file.
<- duckdb()
drv # drv <- duckdb(dbdir = "my_database.duckdb")
<- dbConnect(drv) con
SQL Aggregate and Group
First we load our toy data on books.
<- tribble(
books_with_weights ~title, ~series, ~format, ~weight_in_grams,
"Harry Potter 1", "Harry Potter", "paperback", 1456,
"Two Towers", "Lord of the Rings","paperback", 1800,
"Harry Potter 3", "Harry Potter","hardcover", 2604,
"Harry Potter 2", "Harry Potter","hardcover", 2900,
"Return of the King", "Lord of the Rings", "hardcover", 1784
)
# Write the dataset to DuckDB
dbWriteTable(con, "books_with_weights", books_with_weights, overwrite = TRUE)
Recall that aggregation takes a set of rows and reduces them to a single value. In SQL this happens in the SELECT clause. Remember that the SELECT clause happens after the WHERE clause … so we are always aggregating the rows that are left.
We can obtain the total weight of each group of books using the SUM
function in the SELECT clause and an AS
alias to rename the columns. Here I show the build up of the SQL query using a semi-colon between each query, all the queries execute, but only the output from the last query shows in Rstudio.
SELECT *
FROM books_with_weights
;SELECT weight_in_grams
FROM books_with_weights
weight_in_grams |
---|
1456 |
1800 |
2604 |
2900 |
1784 |
The aggregate function works to reduce this column to a single value.
SELECT SUM(weight_in_grams) AS total_weight
FROM books_with_weights
total_weight |
---|
10544 |
If we have a WHERE
clause this changes the rows that are involved (because SELECT
executes after WHERE
)
SELECT *
FROM books_with_weights
;SELECT *
FROM books_with_weights
WHERE series = 'Harry Potter'
title | series | format | weight_in_grams |
---|---|---|---|
Harry Potter 1 | Harry Potter | paperback | 1456 |
Harry Potter 3 | Harry Potter | hardcover | 2604 |
Harry Potter 2 | Harry Potter | hardcover | 2900 |
SELECT SUM(weight_in_grams) AS total_weight
FROM books_with_weights
WHERE series = 'Harry Potter'
total_weight |
---|
6960 |
Three forms of COUNT
In SQL counting is done with COUNT
. The grain of the books_with_weights
data frame here is “one row represents a book”, so counting the rows produces a count of books. In SQL we count rows using COUNT(*)
.
SELECT SUM(weight_in_grams) AS total_weight,
COUNT(*) AS book_count
FROM books_with_weights
total_weight | book_count |
---|---|
10544 | 5 |
SQL has three forms of COUNT, but one is forbidden
- COUNT(*) which counts the number of rows.
- COUNT(DISTINCT column) which counts the number of distinct values in a column.
COUNT(column)counts the number of values that are not NULL.
COUNT(column) is the forbidden one. I will always grade that as wrong. COUNT(column) is forbidden because it is subject to subtle and confusing errors (and communicates poor understanding of what is being done).
Consider the differences between the three forms of COUNT
on this list:
id, name
--------
1, Domhog
2, NULL
3, Domhog
4, Shuyen
COUNT(*)
would give the answer4
(there are 4 rows).COUNT(DISTINCT name)
would give the answer2
(just ‘Domhog’ and ‘Shuyen’)COUNT(name)
would give the answer3
(there are 3 non-null values in the name column.)
I advise people to avoid COUNT(column)
because the semantics are confusing in a query. I mean that what people are intending to accomplish is unclear. More, though, I haven’t come across a time when it is necessary.
Whenever you are tempted to use COUNT(column)
, ask yourself whether COUNT(*)
is clearer and is actually what you mean. Most counting queries are questions about the number of rows (after some conditions/joins), not the number of non-NULL values.
ps. I’m very interested to find an example where COUNT(column)
is the ideal thing, so please email me if you see one. Only thing I can think of is something like counting records with relationships. e.g. a table like people(id, name, sig_other_id)
where those not in a relationship have NULL
for their sig_other_id
.
--How many people have sig others?
SELECT COUNT(sig_other_id) AS num_in_relationships
FROM people
But I think even that would be much clearer as:
SELECT COUNT(*) as num_in_relationships
FROM people
WHERE people.sig_other_id IS NOT NULL
So you can always re-write COUNT(column)
as COUNT(*)
with a WHERE ... IS NOT NULL
clause. Much clearer.
If we wanted to know how many different manufacturers have cars in the gtcars dataset, in R tidyverse we used n_distinct(manufacter)
but in SQL we would use COUNT(DISTINCT manufacturer)
data(gtcars)
<- gtcars %>%
gtcars_renamed select(manufacturer = mfr, everything())
# Write the dataset to DuckDB
dbWriteTable(con, "gtcars", gtcars_renamed, overwrite = TRUE)
SELECT COUNT(DISTINCT manufacturer) AS num_manufacters
FROM gtcars
num_manufacters |
---|
19 |
A warning on MIN and MAX
A note about MIN and MAX vs ORDER BY LIMIT 1
SQL offers aggregate functions called MIN()
and MAX()
. These seem straight forward, the find the minimum or maximum value in a column.
SELECT MIN(weight_in_grams) AS lightest_book
FROM books_with_weights
lightest_book |
---|
1456 |
Which will, correctly, return 1456
, the lowest value in the column.
However questions about the ‘lowest’, ‘highest’, ‘earliest’ usually also want another part of the row with that value. e.g., ‘What’s the title of the lightest book?’ or ‘Who bought the earliest ticket?’, ‘which ticket number had the lowest price?’
These queries are better answered using ORDER BY
and LIMIT 1
. Using MIN
and MAX
the intuitive way can silently give a totally incorrect answer. DuckDB will refuse to run this query, but MySQL will run it, it will look believeable but will have actually chosen a random value for title
.
SELECT title,
MIN(weight_in_grams) AS lightest_book
FROM books_with_weights
Error: rapi_prepare: Failed to prepare query SELECT title,
MIN(weight_in_grams) AS lightest_book
FROM books_with_weights
Error: Binder Error: column "title" must appear in the GROUP BY clause or must be part of an aggregate function.
Either add it to the GROUP BY list, or use "ANY_VALUE(title)" if the exact value of "title" is not important.
LINE 1: SELECT title,
^
This bears repeating: If you use MIN
or MAX
in the SELECT
clause with any other column you will not get the value in the row that had the lowest value, you will get a random value from the other column. I know that this is hard to believe. When an aggregate function is used in SELECT then we must be collapsing a group of rows … if we put a “bare” column name in the SELECT
together with an aggregate function, then the system needs a way to pick a value. If the term is in the GROUP BY
then it is clear, since all the values must be the same, but if the column is not then it will pick one at random …. or DuckDB asks us to force that behavior by using the ANY_VALUE(column)
function, or throws an error and refuses to run.
While you should use ORDER BY
and LIMIT 1
, you can use MIN
or MAX
but you have to use it via IN (or a sub-query). Here think of the question as two steps: ‘What is the minimum price? And what the ticketnumbers and prices for those tickets which had that price?’
-- First get the minimum value
SELECT SELECT MIN(weight_in_grams) FROM books_with_weights
;-- That returns a single value which you then can use.
-- You get the server to 'remember' it by surrounding the query with parens.
SELECT title, weight_in_grams AS lightest_book
FROM books_with_weights
WHERE weight_in_grams = (
SELECT MIN(weight_in_grams) FROM books_with_weights
)
Error: {"exception_type":"Parser","exception_message":"syntax error at or near \"SELECT\"","position":"38","error_subtype":"SYNTAX_ERROR"}
But hey if you use ORDER BY
and LIMIT
you can easily get the top five or bottom whatever (as long as ties don’t matter), so just don’t use MIN
or MAX
in the SELECT
clause, unless it’s a sub-query.
SELECT title, weight_in_grams AS lightest_book
FROM books_with_weights
ORDER BY weight_in_grams ASC
LIMIT 5
I would only use MIN or MAX when ties really matter, and only in a sub-query. Even then there is a more direct approach to ranking, see Window Aggregates below.
GROUP BY
Recall that grouping is akin to sorting a dataframe, then drawing lines across it where the value in the grouping column changes.
title | series | format | weight_in_grams |
---|---|---|---|
Harry Potter 1 | Harry Potter | paperback | 1456 |
Harry Potter 3 | Harry Potter | hardcover | 2604 |
Harry Potter 2 | Harry Potter | hardcover | 2900 |
Two Towers | Lord of the Rings | paperback | 1800 |
Return of the King | Lord of the Rings | hardcover | 1784 |
In SQL we will build up grouping queries using ORDER BY to see the rows before they are collapsed. To find the number of books in each series we will do:
SELECT *
FROM books_with_weights
;SELECT *
FROM books_with_weights
ORDER BY series
;
title | series | format | weight_in_grams |
---|---|---|---|
Harry Potter 1 | Harry Potter | paperback | 1456 |
Harry Potter 3 | Harry Potter | hardcover | 2604 |
Harry Potter 2 | Harry Potter | hardcover | 2900 |
Two Towers | Lord of the Rings | paperback | 1800 |
Return of the King | Lord of the Rings | hardcover | 1784 |
We will then change the ORDER BY
to GROUP BY
then add the columns in the GROUP BY
to the start of the select and finally add the aggregate function.
SELECT series,
FROM books_with_weights
GROUP BY series
;SELECT series, SUM(weight_in_grams) AS total_weight
FROM books_with_weights
GROUP BY series
series | total_weight |
---|---|
Lord of the Rings | 3584 |
Harry Potter | 6960 |
As before we can have multiple aggregate functions collapsing the same groups.
SELECT series,
FROM books_with_weights
GROUP BY series
;SELECT series,
SUM(weight_in_grams) AS total_weight,
COUNT(*) AS book_count
FROM books_with_weights
GROUP BY series
series | total_weight | book_count |
---|---|---|
Harry Potter | 6960 | 3 |
Lord of the Rings | 3584 | 2 |
COUNT(DISTINCT column) in group
We can also look within a column and count the number of distinct things, as a way of collapsing the group. Switching to the cars data, we can count the number of different manufacturers for each year.
SELECT *
FROM gtcars
;SELECT *
FROM gtcars
ORDER BY year
;SELECT year,
FROM gtcars
GROUP BY year
;SELECT year, COUNT(DISTINCT manufacturer)
FROM gtcars
GROUP BY year
year | count(DISTINCT manufacturer) |
---|---|
2015 | 3 |
2014 | 2 |
2016 | 13 |
2017 | 8 |
Post Group and aggregate (ORDER BY and HAVING)
GROUP BY happens after FROM and WHERE (but before SELECT). Since WHERE
happens early in the query, we can’t use it to exclude results from these results. And that is commonly a need, for queries that ask “Does any group have more than 1?”
We could pipeline these results into a temporary table, and then use WHERE
on that, but SQL has a short cut here: the HAVING
clause, which is a second WHERE
that excutes against the results of a query with a GROUP BY
. For example we could find years in which more than 5 manufactuers produced models:
SELECT year, COUNT(DISTINCT manufacturer) AS model_count
FROM gtcars
GROUP BY year
HAVING COUNT(DISTINCT manufacturer) > 5
year | model_count |
---|---|
2016 | 13 |
2017 | 8 |
We can use ORDER BY to change the order that the aggregated results are displayed (which is to say that ORDER BY happens after GROUP BY and even after HAVING).
SELECT year, COUNT(DISTINCT manufacturer) AS model_count
FROM gtcars
GROUP BY year
HAVING COUNT(DISTINCT manufacturer) > 5
ORDER BY year
year | model_count |
---|---|
2016 | 13 |
2017 | 8 |
GROUP BY ALL
Above I made a big deal out of the step from ORDER BY
to GROUP BY
having two parts: change to GROUP BY
and copy all the fields to the start of the SELECT
. This is important, but it does mean that all the columns are repeated twice, which when queries are edited can lead to annoying mistakes.
DuckDB offers an improvement to SQL to address this, GROUP BY ALL
which means that we can put our grouping columns in a single place (the SELECT) and then GROUP BY ALL
uses all of the non-aggregate columns in the SELECT
.
DuckDB documentation uses this example (based on starwars).
SELECT
systems,
planets,
cities,
cantinas,sum(scum + villainy) AS total_scum_and_villainy
FROM star_wars_locations
GROUP BY ALL;
-- GROUP BY systems, planets, cities, cantinas
Check out the duckdb documentation for other SQL improvements: https://duckdb.org/2022/05/04/friendlier-sql.html. But be mindful that these techniques may not be available in other dialects of SQL.
Window Aggregates
In R window aggregates (which return more than one row per group) are a fairly straightforward extention of the syntax used for basic group bys. For example, we can use arrange to sort within groups, then use the slice_head function to get just a few rows per group.
%>%
gtcars group_by(year) %>%
arrange(desc(mpg_h), .by_group = TRUE) %>%
slice_head(n = 3)
In SQL, though, the concepts are the same but the syntax varies quite a bit from simple GROUP BY.
We have to use two steps. The first gets our rankings within a group (for each and every row), the second uses those rankings to exclude rows below a certain rank.
Getting the rank within a group uses the new syntax of `OVER (PARTITION BY … ORDER BY)
SELECT *
FROM gtcars
;SELECT model,
year,
mpg_hFROM gtcars
ORDER BY year, mpg_h DESC
;SELECT model,
year,
mpg_h,row_number() OVER (PARTITION BY year ORDER BY mpg_h) AS rank_in_year
FROM gtcars
model | year | mpg_h | rank_in_year |
---|---|---|---|
Vantage | 2016 | 19 | 1 |
Dawn | 2016 | 19 | 2 |
Granturismo | 2016 | 21 | 3 |
Rapide S | 2016 | 21 | 4 |
Vanquish | 2016 | 21 | 5 |
Wraith | 2016 | 21 | 6 |
488 GTB | 2016 | 22 | 7 |
GT-R | 2016 | 22 | 8 |
M5 | 2016 | 22 | 9 |
M6 | 2016 | 22 | 10 |
The second step should be straightforward, using WHERE rank_in_year <=3
. But as with GROUP BY
the order that queries execute means that the WHERE
happens before the PARTITION
. For reasons that are not clear to me, HAVING
cannot be used here either (only with GROUP BY
). So we have to use one of the temporary table approaches (sub-query, CTE, or CREATE OR REPLACE TABLE AS SELECT, see Arranging SQL transforms into a pipeline) so that we can immediately query our results.
Here I use a temp table:
SELECT model,
year,
mpg_h,row_number() OVER (PARTITION BY year ORDER BY mpg_h) AS rank_in_year
FROM gtcars
;CREATE OR REPLACE TABLE tmp_mpg_ranks AS
SELECT model,
year,
mpg_h,row_number() OVER (PARTITION BY year ORDER BY mpg_h) AS rank_in_year
FROM gtcars
;SELECT *
FROM tmp_mpg_ranks
WHERE rank_in_year <= 3
ORDER BY year
model | year | mpg_h | rank_in_year |
---|---|---|---|
458 Italia | 2014 | 17 | 1 |
Gallardo | 2014 | 20 | 2 |
FF | 2015 | 16 | 1 |
F12Berlinetta | 2015 | 16 | 2 |
LaFerrari | 2015 | 16 | 3 |
Vantage | 2016 | 19 | 1 |
Dawn | 2016 | 19 | 2 |
Granturismo | 2016 | 21 | 3 |
GTC4Lusso | 2017 | 17 | 1 |
GT | 2017 | 18 | 2 |
Many will prefer to avoid creating a table for what is really just part of a query, so they will use a sub-query, where we copy the working ranking query into where the table name would normally go, like this:
SELECT model,
year,
mpg_h,row_number() OVER (PARTITION BY year ORDER BY mpg_h) AS rank_in_year
FROM gtcars
;SELECT *
FROM (
SELECT model,
year,
mpg_h,row_number() OVER (PARTITION BY year ORDER BY mpg_h) AS rank_in_year
FROM gtcars
)WHERE rank_in_year <= 3
ORDER BY year
model | year | mpg_h | rank_in_year |
---|---|---|---|
458 Italia | 2014 | 17 | 1 |
Gallardo | 2014 | 20 | 2 |
FF | 2015 | 16 | 1 |
F12Berlinetta | 2015 | 16 | 2 |
LaFerrari | 2015 | 16 | 3 |
Vantage | 2016 | 19 | 1 |
Dawn | 2016 | 19 | 2 |
Granturismo | 2016 | 21 | 3 |
GTC4Lusso | 2017 | 17 | 1 |
GT | 2017 | 18 | 2 |
As with r tidyverse, you can also use more sophisticated ranking functions instead of row_number
, such as RANK or DENSE_RANK see https://duckdb.org/docs/sql/functions/window_functions.html
Exercises
Do the five exercises on SUM and COUNT on SQLZoo