Data Wrangling Course

James Howison's Data Wrangling course from the Information School at the University of Texas at Austin.

Topics

Introduction to SELECT queries

In this class we’re going to cover the basics of SQL’s SELECT query. SELECT works by applying filters to the table, hiding data you don’t want to see and returning data that you do want to see. The returned data ends up in a new table which has just the columns and rows that you asked for.

In the first class we used cards to build up our table representation. We then used the big floppy paper templates to look at a row one by one. We asked a question which either gave us an answer of true or false. If the answer was true then we copied the data from that row into a new row on the other side of the desks in the classroom, into a table called results. This is precisely how a SELECT query works.

The steps in a SELECT query are:

  1. Identify the table you are asking for data from
  2. Specify the conditions you want to ask for each row
  3. Specify the columns you want to copy from those rows

These correspond to different parts of the query. Somewhat oddly, though, the parts aren’t in the same order above.

SELECT [columns]
FROM   [tables]
WHERE  [conditions about rows]

I wait until the end to specify the columns I want to see, so I start my queries with SELECT * where the star means “all columns”.

I build my queries up step by step inside the file in Atom, copying each step to the SQL tab in phpmyadmin. This is a little laborious but as our queries become more complex (especially as we work with more than one table) you’ll find that this habit serves you very well.

The WHERE clause: picking rows using conditions

On the holden server you have access to a database called class_cars which has just one table (cars). These queries will run against it. Each row in the database represents a single car model and there are 234 car models in the database (I reduced the full set by only including cars produced in 1999 and 2008). On each row we have data about one car, such as the year it was produced (cars.year), the manufacturer (cars.manufacturer) (e.g., audi, ford), the model (e.g., a4, focus). Using a select query we can narrow the database to find cars that match our query.

Let’s say we are interested in cars that have four cylinders. The cars.cylinders column has that information. We start by confirming that we are addressing the right table by executing this query:

SELECT *
FROM cars

This asks for all the data in the table and will return 234 rows and all the columns. There is no WHERE clause, which means that there are no conditions on the rows and that is why all of them are returned. When you run it in phpmyadmin it will look like this:

If we want to add some conditions we can add a WHERE clause:

SELECT *
FROM cars
WHERE cars.cylinders = 4

Now we will only see rows where the condition in the WHERE clause is true. It’s just the same as using the paper template: the computer looks at one row at a time and checks the number in the cylinders column. If that number is a 4 then the whole row is copied into the results. If you run that query on phpmyadmin, by copying the query above and pasting it into the SQL tab (go ahead and try it now) you’ll see that the results have 81 rows (although phpmyadmin breaks the 81 rows up into three pages of 4 pages of 25 each, you can change the results per page at the bottom). Every one of the rows has a 4 in the cylinders column.

We can add more conditions to our WHERE clause to reduce the number of rows shown even more. For example, let’s say that from those 4 cylinder cars we want to find older cars, produced earlier than the year 2000. We can do that by adding a condition to the WHERE clause:

SELECT *
FROM cars
WHERE cars.cylinders = 4
  AND cars.year < 2000

Two things to notice here. The AND and the < (less than) sign. First we use the word AND between our two conditions. I place it under the WHERE, indenting it by two spaces to show that it is still part of the WHERE clause and causing the conditions to line up down the page. You could equally see this with the WHERE clause written on one line:

SELECT *
FROM cars
WHERE cars.cylinders = 4 AND cars.year < 2000

But I prefer to have one condition per line, to keep things visually neat.

The < (less than) sign will allow more than one value in the row, as long as it is below 2000. (It would exclude rows with year = 2000, since we used < (less than) and not <= which is less than or equal to.). If we run this query we see our results reduced from 81 to 45 cars.

Conditions with different data types

Recall that our columns in SQL have datatypes associated with them. The datatype has some impact on how we write our conditions. When we are querying number columns (INT, DECIMAL) we don’t use quotes around the data we supply. See how there are no quotes around 4 and 2000 in this query:

SELECT *
FROM cars
WHERE cars.cylinders = 4
  AND cars.year < 2000

But if we’re writing a condition on a non-number field we need to use quotes around the bit we supply. e.g.,

SELECT *
FROM cars
WHERE cars.manufacturer = "Audi"

We also need quotes around date fields cars.production_date = "2008-06-06". You can use either single or double quotes. I use double mostly, unless I have a quote character in the bit I’m supplying then I use single quotes.

Boolean Algebra (resolving AND and OR)

The WHERE clause in the query above is asking three questions of each row, each of which resolves to TRUE or FALSE. Three? I only see two! (you say). Well here are the three:

  1. Is the value for cylinders on this row equal to 4? –> TRUE or FALSE
  2. Is the value for year on this row less than 2000 –> TRUE or FALSE
  3. Was the answer to questions 1 and 2 both TRUE? –> TRUE or FALSE.

This is called Boolean Algebra, which you may have heard of. It’s a way of reducing a combination of conditions down to a single TRUE or FALSE value. That’s what happens here, in the first round we reduce each condition to a TRUE or FALSE (e.g., TRUE AND FALSE) and in the second round we reduce again (TRUE AND FALSE –> FALSE (because both were not TRUE)).

This ends up more intuitive than you might be feeling at the moment. The key thing is that when you join conditions with AND all of the conditions must be true for the row to be in the results.

In addition to AND we can also join conditions with OR. When we do that we are saying that we want the row if any of the conditions are TRUE. OR ends up used much less than AND in sql. Let’s say that we wanted all the cars but only if they were made by Audi or Ford. We would write:

SELECT *
FROM cars
WHERE cars.manufacturer = "Audi"
   OR cars.manufacturer = "Ford"

That works fine and we get 43 results. Things can get tricky with OR if you go beyond two conditions, mixing AND and OR. Consider these three queries:

SELECT *
FROM cars
WHERE cars.year = 2008
  AND cars.manufacturer = "Audi"

which returns 9 rows. And this:

SELECT *
FROM cars
WHERE cars.year = 2008
  AND cars.manufacturer = "Ford"

which returns 10 rows. So we know that if we combine the 2008 Ford cars and 2008’ Audi cars we should get 19 rows. Yet:

SELECT *
FROM cars
WHERE cars.year = 2008
  AND cars.manufacturer = "Ford"
   OR cars.manufacturer = "Audi"

gives us 28 results. So we have 9 extra results. Where are they coming from? See this screenshot:

Look closely at the first row; it shows a value of 1999 for year. Why is it in the results (since 1999 most definitely is not equal to 2008).

The reason is to do with how the boolean algebra resolves. Remember that each condition is resolved first. In the figures below each clause resolves to the value written below it. So for that row (“Audi” and 1999) we would get:

cars.year = 2008 AND cars.manufacturer = "Ford" OR cars.manufacturer = "Audi"
       FALSE     AND              FALSE         OR          TRUE

The cars.year = 2008 resolves to the FALSE written under it and so on. But when we come to resolve the AND and ORs in boolean algebra the AND is resolved before the OR and thus we get:

FALSE AND FALSE OR TRUE
     FALSE      OR TRUE
               TRUE

The FALSE AND FALSE on the first line is resolved first and become a FALSE, leaving us with FALSE OR TRUE which resolves to TRUE and thus the 1999 Audi is included in the results.

What our query actually asked for was “Find me all the Fords built in 2008, and all the Audis (regardless of the year)” so as soon as a row was TRUE for the right-hand side of the OR it was included.

You can get around this by changing the order of things using parenthesis (). So we could change the order that things are resolved with:

SELECT *
FROM cars
WHERE cars.year = 2008
  AND ( cars.manufacturer = "Ford" OR cars.manufacturer = "Audi" )

Which would require every row to have 2008 for its year and either Ford or Audi for its manufacturer. This is very useful. I encourage you to always use parens whenever you use any ORs in your queries. It just makes things much easier.

Use IN rather than OR (most of the time)

Even easier still is using the IN keyword, which you should do whenever you have a list of alternatives for a value:

SELECT *
FROM cars
WHERE cars.year = 2008
  AND cars.manufacturer IN ("Ford", "Audi")

which returns the 19 rows that we expected.

Other operators

Above we’ve used two operators inside the parts of our WHERE: the equals sing =, the < sign, and the IN keyword. There are many other operators, such as >, >= and so on.

There are also operators that do the reverse of these. Thus I can also say “not equal”: != or <> or the reverse of IN which is NOT IN (values from the list that follows will be excluded from the results.).

Finally there are a set of useful operators, such as BETWEEN operator that simplifies range queries. eg. cars.cylinders BETWEEN 4 AND 6 or, if we were using dates: cars.production_date BETWEEN '2008-03-01' AND '2009-06-05'

You can find other operators in the mysql documentation or in various tutorials online. Google “mysql operators”.

The SELECT clause: specifying columns

The first clause in the query specifies the columns that you want to see in the results. Compared to understanding boolean algebra this bit is easy. You supply a comma-separated list of column headings after the SELECT (instead of the *). So if we only wanted the manufacturer and model we could write:

SELECT cars.manufacturer, cars.model
FROM cars
WHERE cars.year = 2008

You can think of querying like this as a process of hiding, knocking out, or covering up parts of the original table that you don’t want. So if WHERE knocks rows out from your results, SELECT knocks out columns from your results.

Down the track we’ll learn other things we can do in the SELECT part of the query (such as combining columns or doing math between columns).

Doing math in the SELECT clause

The SELECT part of the query let’s us specify the columns that we want, but we can also do some basic math to derive new values.

For example, imagine a table with data about a retail store, a row for each month and a column for expenses and a column for sales.

stores(id, month, expenses, sales)

If sales is our incoming money and expenses is our outgoing money, it would be useful to find the difference between them. We can do that in SQL using the minus sign -:

SELECT stores.sales, stores.expenses, stores.sales - stores.expenses
FROM stores

This math works on each row, subtracting the value for expenses in that row from the value in sales from that row. We use the values in the two columns to create a third column in the results.

Similarly you can add values using +, multiply values together using the * sign, and divide using the / slash character.

ORDER BY and LIMIT: Sorting results and getting top x rows

The three clauses above leave you with a results table, showing only the rows and columns that match your query. Now we’ll learn two things we can do with these results: sorting and limiting.

Let’s say that we wanted to find the more fuel efficient 4 cylinder cars for city driving. First we’d get all the 4 cylinder cars:

SELECT *
FROM cars
WHERE cars.cylinders = 4

Now we want to re-order the results table according to city_mpg (miles per gallon). Note that we’re reordering the results table, and not changing the original data in the database at all.

SELECT *
FROM cars
WHERE cars.cylinders = 4
ORDER BY city_mpg DESC

We add an ORDER BY clause saying that the rows should be ordered by the values in the city_mpg column for that row. Finally we say that we want those with the highest values first, thus our sorting order is descending which is abbreviated to DESC. The opposite is ascending, showing the smallest values at the top, shown with ASC. If you don’t put either then the results are sorted in Ascending order, but honestly it’s just clearest to always include either DESC or ASC.

Here are the results (selecting just manufacturer, model, fuel type (diesel/gas) and city_mpg):

We can see that the top three are made by Volkswagen, then a Toyota Corolla and a Honda Civic and so on right through the 81 rows that return.

After we’ve sorted we often want to see just the very top result. So rather than finding the more fuel efficient cars, we find the most fuel efficient car. This is done with LIMIT 1 which means only return the first row (after sorting).

SELECT *
FROM cars
WHERE cars.cylinders = 4
ORDER BY city_mpg DESC
LIMIT 1

I always put LIMIT on its own line, but you will frequently see it on the end of the ORDER BY line as well. I use its own line because its clearer. You could also get the top 5 using LIMIT 5:

SELECT *
FROM cars
WHERE cars.cylinders = 4
ORDER BY city_mpg DESC
LIMIT 5

Summary

The SELECT query allows you to select data out of the table in your database. It copies rows that match the conditions in the WHERE clause (using boolean algebra) and generates a results table. You can then sort and limit the results table. The figure below shows the parts of the basic select query:

Finally as usual w3cSchool has a useful SQL tutorial. It doesn’t go in exactly the same order as we are, but it’s pretty close. Here are their materials on SQL SELECT