We can rename columns in our results table using the
AS keyword. In this way we are not limited to directly copying the names of the columns from the tables. For example, when we select just the “red” objects we can rename the column in the results to reflect that:
SELECT * FROM colors, objects WHERE color.name = "red" AND colors.id = objects.color_id SELECT objects.name AS "name_of_red_objects" FROM colors, objects WHERE color.name = "red" AND colors.id = objects.color_id
It is good practice to follow the same conventions one would in naming columns: using lowercase and connecting words with underscores.
Thus far we’ve learned how to narrow down rows and columns to obtain answers to queries. We’ve narrowed down rows using conditions (
LIMIT), and columns using
In this module we’re going to learn how to work with columns of data. Aggregate functions “collapse” the values in a column, returning a result. The functions go into the SELECT part.
One of the simplest is to add together all the of the values using
SUM. This figure shows part of the tickets table from our
If we wanted to know the total of the values in the
tickets.price column we can get that with the three steps below.
-- identify the rows we want to work with SELECT * FROM tickets -- identify just the column we want to work with SELECT tickets.price FROM tickets -- Collapse the column vertically to a single value. SELECT SUM(tickets.price) FROM tickets
Which would give us a result of
72 (1 row)
When we use an aggregate function it is convenient to employ an alias to give a name to the result:
SELECT SUM(tickets.price) AS total_of_tickets_price FROM tickets
There are a set of these functions that we can apply:
SUM for addition,
AVG for the average of the numbers in the column. There are also the
MIN for minimum,
MAX for maximum, but be wary of those (more on that below).
Note that there is no
MEDIAN function nor a
MODE function in sql. Both can be done but it is complicated and once one is doing real statistics it is better to move out of SQL and use
Another useful function is
DISTINCT which reduces the column by removing duplicates. Unlike other aggregation functions
DISTINCT does not use parens.
-- Who made purchases of tickets? People might have purchased twice -- but only show their name once. -- Start with purchases to be sure that the person made a purchase. -- This table has one row per purchase. SELECT * FROM purchases -- Join to people to get names SELECT * FROM purchases, people WHERE purchases.person_id = people.id -- We still have a row per purchase, so if someone made more -- than one they'd show up twice. SELECT people.name FROM purchases, people WHERE purchases.person_id = people.id -- remove possible duplicates SELECT DISTINCT people.name FROM purchases, people WHERE purchases.person_id = people.id
Counting values in a column (or rows in a table) is an important form of aggregation function. At first you might not think of that as aggregating, but it does take a column (or a table) and reduce it down to a single value, just as functions like
There are three forms of COUNT, each subtly different from each other.
The three forms of COUNT are:
COUNT(*)counts the number of rows.
COUNT(column)counts the number of non-NULL values in a column.
COUNT(DISTINCT column)counts the unique values in a column, after removing NULLs.
COUNT(*) can be useful for simple tables:
SELECT * FROM tickets SELECT COUNT(*) AS count_of_tickets FROM tickets
Or used with data conditions to count the number of things that meet some condition, such as how many bands had fees higher than $100?
SELECT * FROM bands SELECT * FROM bands WHERE bands.fee > 100 SELECT COUNT(*) as number_of_expensive_bands FROM bands WHERE bands.fee > 100
You almost always want to use an
AS alias with
COUNT(*) is also useful after joins. For example, one can find the number of performances at the “BMI” venue:
SELECT * FROM venues -- find just the BMI venue. SELECT * FROM venues WHERE venues.name = "BMI" -- join over to performances. Returns a row for each -- performance at BMI. SELECT * FROM venues, performances WHERE venues.name = "BMI" AND venues.id = performances.venue_id SELECT COUNT(*) AS number_performances_at_bmi FROM venues, performances WHERE venues.name = "BMI" AND venues.id = performances.venue_id
COUNT(DISTINCT column) is also very useful. For example, counting the number of bands that play at the BMI venue. As above you have to join to performances to ensure that a performance took place, but given that bands might play more than one performance you could end up with more than one row for each band.
DISTINCT helps resolve that.
-- following on from above. This has one row per performance. SELECT * FROM venues, performances WHERE venues.name = "BMI" AND venues.id = performances.venue_id --- performances has a field about the bands, band_id. -- If the id of a band appears in band_id then they played at BMI, but -- if they played more than once their id would show up more than once. -- We can resolve that using COUNT DISTINCT SELECT COUNT(DISTINCT performances.band_id) FROM venues, performances WHERE venues.name = "BMI" AND venues.id = performances.venue_id -- Note that even though the question was about bands, we didn't need -- to join the bands table, since performances.band_id is about bands.
Generally speaking counting rows or distinct ids is a better approach than counting distinct names because many things in the world can have the same name (as the Office Space movie, filmed in Austin, showed us memorably about ‘Michael Bolton’.)
To reiterate, consider the differences between the three forms of
COUNT on this list:
id, name -------- 1, Domhog 2, NULL 3, Domhog 4, Shuyen
COUNT(*)would give the answer
4(there are 4 rows).
COUNT(name)would give the answer
3(there are 3 non-null values in the name column.)
COUNT(DISTINCT name)would give the answer
2(just “Domhog” and “Shuyen”)
I advise people to avoid
COUNT(column) because I find the semantics confusing in a query. I mean that what people are intending to accomplish is unclear. More, though, I haven’t come across a time when it is necessary.
Whenever you are tempted to use
COUNT(column), ask yourself whether
COUNT(*) is clearer and what you actually mean. Most counting queries are questions about the number of rows (after some conditions/joins), not the number of non-NULL values.
ps. I’m very interested to find an example where
COUNT(column) is the ideal thing, so please email me if you see one. Only thing I can think of is something like counting records with relationships. e.g. a table like
people(id, name, sig_other_id) where those not in a relationship have
NULL for their
--How many people have sig others? SELECT COUNT(sig_other_id) FROM people
But I think even that would be much clearer as:
SELECT COUNT(*) as sig_other_count FROM people WHERE people.sig_other_id IS NOT NULL
The aggregate functions
MAX() seem straight forward, the find the minimum or maximum value in a column.
SELECT MIN(tickets.price) AS lowest_ticket_price FROM tickets
Which will, correctly, return
5, the lowest value in the table.
However questions about the “lowest”, “highest”, “earliest” usually also want another part of the row with that value. e.g., “Who bought the earliest ticket?”, “which ticket number had the lowest price?”
These queries are better answered using
ORDER BY and
LIMIT 1. Using
MAX the intuitive way will silently give a totally incorrect answer.
This bears repeating: If you use
MAX in the
SELECT clause with any other column you will not get the value in the row that had the lowest value, you will get a random value from the other column. I know that this is hard to believe. We’ll talk about why when we get to groups.
While you should use
ORDER BY and
LIMIT 1, you can use
MAX but you have to use it via IN (or a sub-query). Here think of the question as two steps: “What is the minimum price? And what the ticketnumbers and prices for those tickets which had that price?”
-- First get the minimum value SELECT MIN(tickets.price) FROM tickets -- That returns a single value which you then can use. -- You get the server to "remember" it by surrounding the query with parens. SELECT tickets.ticketnum, tickets.price FROM tickets WHERE tickets.price = (SELECT MIN(tickets.price) FROM tickets) -- or, using IN (which also works with DISTINCT) SELECT tickets.ticketnum, tickets.price FROM tickets WHERE tickets.price IN (SELECT MIN(tickets.price) FROM tickets)
But hey if you use
ORDER BY and
LIMIT you can easily get the top five or bottom whatever (as long as ties don’t matter), so just don’t use
MAX in the
SELECT clause, unless it’s a sub-query.
SELECT tickets.ticketnum, tickets.price FROM tickets ORDER BY tickets.price ASC LIMIT 5
I would only use MIN or MAX when ties really matter, and only in a sub-query.
You may find the first five exercises on SUM and COUNT on SQLZoo useful.