Data Wrangling Course

James Howison's Data Wrangling course from the Information School at the University of Texas at Austin.

There are three queries beyond SELECT:

  1. INSERT (to add new rows)
  2. UPDATE (to change data)
  3. DELETE (to remove rows)

Each of them reuses syntax from SELECT.

INSERT

INSERT requires us to provide a list of columns and a list of data to go into each column.

INSERT INTO table_name(column1, column2) VALUE (data1, data2)

We do not have to provide data for all columns and we don’t have to use the same order as the columns were created.

Imagine a database with Book habtm Author so we have three tables (books, authors, and books_authors). To insert the data that “Emily Bronte” wrote “Wuthering Heights” (assuming neither were in the database already) we’d do:

INSERT INTO books(id, title) VALUE (7, 'Wuthering Heights')
INSERT INTO authors(id, name) VALUE (23, 'Emily Bronte')
INSERT INTO books_authors(book_id, author_id) VALUE (7, 23)

In these examples I’m providing a manual id (7 and 23 above). However the database can also provide an integer value for id automatically. To do that we tell the database that the id column should be AUTO_INCREMENT. In phpmyadmin we do that by checking the “A_I” checkbox for the id column and the full word appears in the “extras” column in phpmyadmin’s Structure tab:

If id is set to AUTO_INCREMENT then we do not need to provide a value for the id in an insert statement (and we can leave id from the list of columns). So we could have just done:

INSERT INTO books(title) VALUE ('Wuthering Heights')

This can be very useful, but if we want to use the generated id to describe relationships we have to obtain the id that the server provides. We will cover how to do this via python.

Inserting lots of data using this approach can be very slow. I examine quicker ways to insert data into MySQL in the “Advanced Tips” Module toward the end of the course. The relevant screencast is Loading Very Large CSV files.

UPDATE data already in the database

UPDATE allows us to change data already in the database. These queries work by first identifying rows to be changed (which re-uses what we’ve learned about the WHERE clause already) and then getting new values for particular columns.

For example, we could change the spelling of Emily Bronte’s name to include a diaersis over the final e. We first find the record we want to update using SELECT.

SELECT authors.id
FROM authors
WHERE name = "Emily Bronte"

The we use the id returned (23) to effect the change:

UPDATE authors
SET name = "Emily Brontë" -- Change spelling to fancy style
WHERE authors.id = 23

Here we’re being cautious to minimize and check our edits to the database by using the SELECT first and using id to identify the records to change.

While it is often sensible to use an id (or a group of ids) in the WHERE clause to be sure of the rows to be changed, any WHERE clause will work. For example, if we had a last name column we could change all the Bronte to Brontê with one query (the database will tell you how many rows were changed).

UPDATE authors
SET lastname = "Brontë"
WHERE lastname = "Bronte" -- Change all the Brontes at once.

We can also change more than one column at a time, by providing a comma separated list of fields and value pairs in the SET clause. For example we could change both the venue and the band for a performance (keeping the start and end times).

UPDATE performances
SET venue_id = 23, band_id = 7
WHERE id = 2

btw, I think the lack of consistency between INSERT (two comma separated lists of columns and values) and UPDATE SET (one comma separated list column = value) is confusing. I don’t know why it is that way.

The question often arises about whether we can do UPDATEs based on joins (adding a FROM/JOIN clause). In general yes, that is possible. However, it is confusing (am I updating the join table or the underlying tables?) and can be hard to read and to check. I recommend finding the ids of the rows that you want to update (using a SELECT), then updating those using that SELECT in a sub-query.

For example if we wanted to update the fee for all bands playing in the BMI venue, adding $100 to each band’s fee.

-- Find the list of band ids.
SELECT bands.id
FROM bands
  JOIN performances
    ON bands.id = performances.band_id
  JOIN venues
    ON performances.venue_id = venues.id
WHERE venues.name = "BMI"

-- UPDATE using a sub-query to provide the list of ids.
UPDATE bands
SET bands.fee = bands.fee + 100
WHERE bands.id IN (
    SELECT bands.id
    FROM bands
      JOIN performances
        ON bands.id = performances.band_id
      JOIN venues
        ON performances.venue_id = venues.id
    WHERE venues.name = "BMI"
)

DELETE

Finally we can delete data in the database. The DELETE statement is a SELECT statement (to find the rows you want to delete) but rather than specifying the columns you want, change the SELECT clause to DELETE. Using SELECT first helps you ensure that you will delete the right rows! No Undo button!

SELECT * FROM books WHERE id = 23
DELETE FROM books WHERE id = 23

In MySQL setup as we have done in this class we will have to manually delete references to this rows from tables with foreign keys.

DELETE FROM books_authors WHERE book_id = 23

Note that one would need to check if the author had written any other books before deleting the author. UPDATEs and DELETEs can’t be undone, so only delete using ids and do SELECTs first.

It is possible to automatically delete linked rows using “cascading deletes” but usually there are other things that need to be checked and changed, so largely the logic for that has moved out of the database and is implemented through python, ruby, or php. It is worth, though, thinking briefly about those situations.

Delete Emily Bronte?

The instruction “Delete Emily Bronte” seems straightforward.

SELECT id
FROM people
WHERE people.name = "Emily Bronte"

DELETE
FROM people
WHERE people.name = "Emily Bronte"

However, deleting Emily from the people table does not remove her id from the database entirely. For example it would leave her id in the book_roles table associated with books she had written (and potentially in other tables) Even if we delete all of her book_roles what ought to happen to books that she co-authored with others, should they now also be removed? (Are we erasing her from history?) If we don’t then we have erroneous information about those books.

Let’s say that we want to remove any books (and book_roles) that were associated with Emily Bronte. In these situations it is best to work backwards from the further location (ie the Book), back towards the Person. (It’s easier to find the books that Emily Bronte was involved in before Emily Bronte’s record in the person table is deleted!)

-- Find all the books associated with Emily Bronte in any form.
SELECT *
FROM people
WHERE people.name = "Emily Bronte"

-- Navigate over to books, this gets any books associated with Emily.
SELECT *
FROM people
  JOIN book_roles
    ON book_roles.person_id = people.id
  JOIN books
    ON book_roles.book_id = books.id
WHERE people.name = "Emily Bronte"

-- Just get books.id
-- This returns a list of book ids like (7, 23, 45)
SELECT books.id
FROM people
  JOIN book_roles
    ON book_roles.person_id = people.id
  JOIN books
    ON book_roles.book_id = books.id
WHERE people.name = "Emily Bronte"

-- Now use that list to do the delete from books.
-- Could write the results of last query down
-- (I just made these numbers up) and use like this:
SELECT *
FROM books
WHERE books.id IN (7, 23, 45)

-- But it's better to use a sub-query that generates the list.
-- Here I've just substituted the query two up for the list of ids.
SELECT *
FROM books
WHERE books.id IN (
                    SELECT books.id
                    FROM people
                      JOIN book_roles
                        ON book_roles.person_id = people.id
                      JOIN books
                        ON book_roles.book_id = books.id
                    WHERE people.name = "Emily Bronte"
                   )

-- Now convert the SELECT * to DELETE.
DELETE
FROM books
WHERE books.id IN (
                    SELECT books.id
                    FROM people
                      JOIN book_roles
                        ON book_roles.person_id = people.id
                      JOIN book
                        ON book_roles.book_id = books.id
                    WHERE people.name = "Emily Bronte"
                   )

-- Now delete all book_roles associated with those books
--
SELECT *
FROM people
  JOIN book_roles
    ON book_roles.person_id = people.id
WHERE people.name = "Emily Bronte"

-- Get book_ids for those.
SELECT book_roles.book_id
FROM people
  JOIN book_roles
    ON book_roles.person_id = people.id
WHERE people.name = "Emily Bronte"

-- Now find all remaining records associated with those books.
SELECT *
FROM book_roles
WHERE book_roles.book_id IN (
                            SELECT book_roles.book_id
                            FROM people
                              JOIN book_roles
                                ON book_roles.person_id = people.id
                            WHERE people.name = "Emily Bronte"
)
-- Note that this deletes a supplier role as well. Possibility that
-- we "orphan" a supplier person if they are no longer associated
-- with the book. See below.

-- And convert to a delete.
DELETE
FROM book_roles
WHERE book_roles.book_id IN (
                            SELECT book_roles.book_id
                            FROM people
                              JOIN book_roles
                                ON book_roles.person_id = people.id
                            WHERE people.name = "Emily Bronte"
)

-- Finally delete Emily.
DELETE
FROM people
WHERE people.name = "Emily Bronte"

An advance would be to find all orphaned people (could be Emily’s translator or supplier if that was their last book). That type of query requires a form of JOIN that we haven’t discussed, called an OUTER JOIN. See https://blog.codinghorror.com/a-visual-explanation-of-sql-joins/