Data Wrangling Course

James Howison's Data Wrangling course from the Information School at the University of Texas at Austin.

Moving across many tables with joins in SQL

This class shows how to use our database designs to “move” across our databases, following the lines in the design to join tables. You can think of the lines as paths that you can follow from one table, through others, gathering up the data that you need to provide the answer to your query. As a path, hopefully it is clear that you can’t skip a table, to get from one to the other you have to follow the lines.

When I think about queries I think about start and end points. The start point is the table that has the data given in the question. The end point has the data that the question is asking about. For example, using the Objects and Colors example:

And the question:

What color is the mug?

The start point is the data we are given: “mug”. That data is in the objects.name field, and so our starting point is the objects table.

The end point is what we’re being asked about, in this case “the color”. That data is in the colors.name column, so our end point is the colors table.

At this point I look at the ERD and “place myself” in the objects table:

SELECT *
FROM objects

I can the apply the starting condition:

SELECT *
FROM objects
WHERE objects.name = "mug"

From here I need to “move” to the colors table, by joining it in. To do this I have to think about the join condition, which I can figure out from the relational vocab. For has_many / belongs_to the foreign key is always in the table to the left of belongs_to. Therefore the foreign key is in the objects table. This is enough information to create the join condition, since the foreign key is always named after the table it links to and must be color_id. Thus the join condition is:

objects.color_id = colors.id

I can now use this to add to my query:

SELECT *
FROM objects
WHERE objects.name = "mug"

-- Joining in colors

SELECT *
FROM objects, colors
WHERE objects.name = "mug"
  AND objects.color_id = colors.id

After checking that the join worked sensibly (running the query and examining the rows), at this point I have reached my destination field: colors.name and I can write the SELECT part:

SELECT *
FROM objects, colors
WHERE objects.name = "mug"
  AND objects.color_id = colors.id

-- Now get just the colors.name column
SELECT colors.name
FROM objects, colors
WHERE objects.name = "mug"
  AND objects.color_id = colors.id

The process works the same when moving across more than two tables. Here is our full Objects database:

and a query:

What color are the clicky objects?

The start point here is “clicky” which we would find in the noises.name table. The end point (what we’re being asked to answer) is in colors.name table. Thus we begin in the noises table and will have to move across to the colors table. We can’t get there directly, so we have to follow the lines.

SELECT *
FROM noises

SELECT *
FROM noises
WHERE noises.name = "clicky"

Now we have to join in the next table, which is objects. Looking at the relational vocab I can see that the foreign key is in objects which means it must be called objects.noise_id. The order one writes a join condition in doesn’t matter, so I like to write mine “from” the table I’m “in” to the table I’m “moving to”. Given that we are “in” noises this makes the join condition noises.id = objects.noise_id.

SELECT *
FROM noises
WHERE noises.name = "clicky"

-- Join in the objects table.

SELECT *
FROM noises, objects
WHERE noises.name = "clicky"
  AND noises.id = objects.noise_id

We run that query and check that everythign worked. We then have to continue to move, now moving from the objects table to the colors table. Again we examine the relational vocab. Object is to the right of belongs_to and so the foreign key is again in the objects table. Therefore it must be called objects.color_id and our join condition must be objects.color_id = colors.id.

SELECT *
FROM noises, objects
WHERE noises.name = "clicky"
  AND noises.id = objects.noise_id

-- moving over to colors
SELECT *
FROM noises, objects, colors
WHERE noises.name = "clicky"
  AND noises.id = objects.noise_id
  AND objects.color_id = colors.id

Again, running that query to ensure we arrived in colors ok and then we can change our SELECT part to give us just what we were asked for (the color).

SELECT *
FROM noises, objects, colors
WHERE noises.name = "clicky"
  AND noises.id = objects.noise_id
  AND objects.color_id = colors.id

-- Now pulling out just the answer we were asked for.
SELECT colors.name
FROM noises, objects, colors
WHERE noises.name = "clicky"
  AND noises.id = objects.noise_id
  AND objects.color_id = colors.id

Using the alternative JOIN syntax, which uses identical join conditions but just places them in the FROM clause the whole sequence would be:

SELECT *
FROM noises

SELECT *
FROM noises
WHERE noises.name = "clicky"

SELECT *
FROM noises

SELECT *
FROM noises
  JOIN objects
    ON noises.id = objects.noise_id
WHERE noises.name = "clicky"

-- Now add colors. This is an addition JOIN / ON part.
-- In this syntax I find the question of what the 3rd table is
-- being joined to a bit hard to think about. The answer is that
-- it is being joined to the combination of the first two tables.
-- If you build up the query step by step, executing each step,
-- then this is much clearer.
SELECT *
FROM noises
  JOIN objects
    ON noises.id = objects.noise_id
  JOIN colors
    ON objects.color_id = colors.id
WHERE noises.name = "clicky"

-- And now the SELECT part.

SELECT colors.name
FROM noises
  JOIN objects
    ON noises.id = objects.noise_id
  JOIN colors
    ON objects.color_id = colors.id
WHERE noises.name = "clicky"

Joining habtm.

Joining across a habtm relationship follows the same rules but one has to remember that the foreign keys are in the extra table that doesn’t show up in the Entity diagram. We call that table a “Simple Association Table” and it only has the two foreign keys.

Consider this habtm relationship:

To move from Book to Genre we have to go through the simple association table. Recall that the simple association table is named by the alphabetical combination of the Entities, so in this case it is called books_genres.

To illustrate we’ll build up a query to answer the question:

What genres is the book "Wuthering Heights" in?

The start point is books and the end point is genres.

SELECT *
FROM books

SELECT *
FROM books
WHERE books.name = "Wuthering Heights"

Now we want to follow the line to genres but to do so we have to go via the simple association table (books_genres), joining it in on the way. We also know that the foreign keys are always in the simple association table. So our foreign key here is books_genres.book_id which connects to books.id that makes our join condition books.id = books_genres.book_id.

SELECT *
FROM books
WHERE books.name = "Wuthering Heights"

-- move into the simple assocation table

SELECT *
FROM books, books_genres
WHERE books.name = "Wuthering Heights"
  AND books.id = books_genres.book_id

Now we are “in” books_genres and need to move onward to genres. We swap to the other foreign key in books_genres which is books_genres.genre_id and link to the primary key for genres making the join condition: books_genres.genre_id = genres.id

SELECT *
FROM books, books_genres
WHERE books.name = "Wuthering Heights"
  AND books.id = books_genres.book_id

-- move out of books_genres to genres
SELECT *
FROM books, books_genres, genres
WHERE books.name = "Wuthering Heights"
  AND books.id = books_genres.book_id
  AND books_genres.genre_id = genres.id

-- Finally we can change our SELECT part.
SELECT genres.name
FROM books, books_genres, genres
WHERE books.name = "Wuthering Heights"
  AND books.id = books_genres.book_id
  AND books_genres.genre_id = genres.id

Using the JOIN syntax (exactly the same join conditions):

SELECT *
FROM books

SELECT *
FROM books
WHERE books.name = "Wuthering Heights"

SELECT *
FROM books
  JOIN books_genres
    ON books.id = books_genres.book_id
WHERE books.name = "Wuthering Heights"

-- and now moving to genres.

SELECT *
FROM books
  JOIN books_genres
    ON books.id = books_genres.book_id
  JOIN genres
    ON books_genres.genre_id = genres.id
WHERE books.name = "Wuthering Heights"

-- and finally changing the SELECT.

SELECT genres.name
FROM books
  JOIN books_genres
    ON books.id = books_genres.book_id
  JOIN genres
    ON books_genres.genre_id = genres.id
WHERE books.name = "Wuthering Heights"

Adding conditions “en route”

Thus far we’ve talked about queries where we have a data condition at our starting point, such as finding the “red” objects. To answer those we apply a condition, such as colors.name = "red" in our starting table and then move across the ER diagram, joining tables as needed.

However we can also apply data conditions at places “along the route,” as we add in tables using join conditions. Take, for example, this has_many through relationship:

And this question:

Which people have subscriptions to "The Week" ending this year (2015)?

In this question we’re given two pieces of data: the name of a magazine (“The Week”) as well as (part of) a date, and we’re asked to supply people’s names. The magazine part is just like our queries above, we can apply it in the “starting” table, which would be magazines. The end point is also similar to above, because we’re asked for all the people’s names, so we’ll be joining across to people. However, the date part is a little different. The date fields are in the middle table (subscriptions), so we will add that condition as we “pass through” that table.

One other thing: because the question is asking about subscriptions that end in 2015 we’ll use that data in the end_year field, rather than the start_year field.

-- starting table
SELECT *
FROM magazines

-- add starting condition
SELECT *
FROM magazines
WHERE magazines.name = "The Week"

-- join over to subscriptions
-- subscription to right of belongs_to, thus foreign key there.
SELECT *
FROM magazines, subscriptions
WHERE magazines.name = "The Week"
  AND magazines.id = subscriptions.magazine_id

-- Now apply year condition, before moving to people.
SELECT *
FROM magazines, subscriptions
WHERE magazines.name = "The Week"
  AND magazines.id = subscriptions.magazine_id
  AND subscriptions.end_year = 2015

-- Now add people with the join condition.
SELECT *
FROM magazines, subscriptions, people
WHERE magazines.name = "The Week"
  AND magazines.id = subscriptions.magazine_id
  AND subscriptions.end_year = 2015
  AND subscriptions.person_id = people.id

-- And finally, change the SELECT part to just what we were asked for.
SELECT people.name
FROM magazines, subscriptions, people
WHERE magazines.name = "The Week"
  AND magazines.id = subscriptions.magazine_id
  AND subscriptions.end_year = 2015
  AND subscriptions.person_id = people.id

Hopefully you can see that addressing these queries is a matter of:

  1. Identifying data you are given. (e.g., “2015” and “The Week”)
  2. Finding the relevant fields in the database for each. (e.g., “subscriptions ending in” –> subscriptions.end_year)
  3. Constructing data conditions for each (e.g., subscriptions.end_year = 2015)
  4. Identifying the data you are being asked for (e.g., “Which people …”)
  5. Finding relevant fields for the answer (“Which people …” –> people.name)
  6. Locating starting and ending tables and a path between them.
  7. Navigating across the ER diagram, applying conditions as you move through the relevant tables (always executing the query)
  8. Converting the SELECT * to the fields identified in step 5.

I find the “moving across the ERD” way of thinking about building up queries to be useful and I recommend it to you. I find it useful for two reasons: it helps me break the problem down into steps, and building the query up step by step helps me locate errors.

However it is worth noting that the actual order in which conditions and joins are placed in the final query is irrelevant. So it is equally valid (logically identical, in fact) to think of joining all the tables that you need, then applying all the data conditions to that one large table. Going a step further back it is equally valid to think of creating one giant cross-product table (by putting the tables into the FROM clause and getting all possible combinations of their rows) and then applying the data conditions, and applying the join conditions last.

Nonetheless, for that approach to be accurate you would still have to ensure that you have all the join conditions needed and that you hadn’t included any tables without relevant join conditions. And that requires you to inspect the database design to see how the tables are related to each other.

So, there’s nothing special about the order that the conditions and joins are applied, but thinking about “moving across the ERD” will help you put queries together correctly.