There are three queries beyond SELECT:
Each of them reuses syntax from SELECT.
INSERT requires us to provide a list of columns and a list of data to go into each column.
INSERT INTO table_name(column1, column2) VALUES (data1, data2)
We do not have to provide data for all columns and we don’t have to use the same order as the columns were created.
Imagine a database with
Book habtm Author so we have three tables (books, authors, and books_authors). A convenient syntax for quickly describing tables in text is `tablename(id_col,other_cols…):
books(id, title) authors_book(author_id,book_id) authors(id, name)
To create these tables we use the CREATE keyword, creating the database, switching to it then creating tables, naming each column and giving its datatype.
%LOAD postgresql host=localhost
CREATE DATABASE class_books;
%LOAD postgresql host=localhost dbname=class_books
Then we check that the tables we want to create aren’t already present (this will delete the table and all its data, so we would only run this at the start of the workflow)
CREATE TABLE books ( id SERIAL PRIMARY KEY, title TEXT )
CREATE TABLE authors_books ( author_id INTEGER, book_id INTEGER )
CREATE TABLE authors ( id SERIAL PRIMARY KEY, name TEXT )
To insert the data that “Emily Bronte” wrote “Wuthering Heights” (assuming neither were in the database already) we’d do:
INSERT INTO books(id, title) VALUES (7, 'Wuthering Heights')
INSERT INTO authors(id, name) VALUES (23, 'Emily Bronte')
INSERT INTO authors_books(book_id, author_id) VALUES (7, 23)
In these examples I’m providing a manual
23 above). However the database can also provide an integer value for
id automatically. To do that we tell the database that the
id column should be
SERIAL PRIMARY KEY.
id is set to
SERIAL PRIMARY KEY then we do not need to provide a value for the id in an insert statement (and we can leave id from the list of columns). So we could have just done:
INSERT INTO books(title) VALUES ('Wuthering Heights')
This can be very useful, but if we want to use the generated id to describe relationships we have to obtain the id that the server provides. We can get that in two ways.
INSERT INTO books(title) VALUES ('Wuthering Heights') RETURNING id
We will do this later in the course using python, as we read our CSV files, constructing the queries needed to insert the data from the csv files into the database design for our project.
Note that in the xsql notebooks we are currently using
RETURNING id does not show us the id, but we will be able to use
RETURNING when we use Python. In xsql we have to get it another way, running this query immediately after the
For example, imagine we have a csv file with these data:
author,title "Chimamanda Ngozi Adichie","We Should All Be Feminists" "Wu Cheng'en","Journey to the West" "Kim Man-jung","The Nine Cloud Dream"
For each row, we would have to do these queries (changing the text each time):
INSERT INTO books(title) VALUES ('Journey to the West');
then remember the id assigned (let’s say we got back
120) (Note that a doubled single quote is the way we tell postgres to include a single quote inside a text field)
INSERT INTO authors(name) VALUES ('Wu Cheng''en');
then remember the id assigned (let’s say we got back
Finally we can then make the connection:
INSERT INTO authors_books(book_id, author_id) VALUES (120, 1007);
So each row in that csv would require at least three queries adding data to three tables. so don’t think of a 1-to-1 mapping between a csv and a table in the database. We aren’t uploading csv files as we might to Box, we are going to read csv files, understand their meaning and transform them to express that meaning in the SQL tables.
Later in the course we will return to this, because it can take more than this number of queries, since we don’t want to create duplicate records (ie we create a book just once, we create an author just once). That means when we see the name again, we have to use SELECT queries with a WHERE clause to find the id that was assigned before.
We will cover these sequences of SELECT and INSERT queries when we are doing them via Python later in the course.
Note that inserting lots of data (like millions of rows) using this approach can be slow. I examine quicker ways to insert data into Postgress in the “Advanced Tips” Module toward the end of the course.
UPDATE allows us to change data already in the database. These queries work by first identifying rows to be changed (which re-uses what we’ve learned about the WHERE clause already) and then getting new values for particular columns.
For example, we could change the spelling of Emily Bronte’s name to include a diaersis over the final e. We first find the record we want to update using
SELECT authors.id FROM authors WHERE name = 'Emily Bronte'
The we use the
id returned (
23) to effect the change:
UPDATE authors SET name = 'Emily Brontë' -- Change spelling to fancy style WHERE authors.id = 23
Here we’re being cautious to minimize and check our edits to the database by using the
SELECT first and using
id to identify the records to change.
While it is often sensible to use an id (or a group of ids) in the WHERE clause to be sure of the rows to be changed, any WHERE clause will work. For example, if we had a last name column (which we don’t) we could change all the Bronte to Brontê with one query (the database will tell you how many rows were changed).
UPDATE authors SET lastname = 'Brontë' WHERE lastname = 'Bronte' -- Change all the Brontes at once.
We can also change more than one column at a time, by providing a comma separated list of fields and value pairs in the SET clause. For example we could change both the venue and the band for a performance (keeping the start and end times).
UPDATE authors SET firstname = 'Emile', lastname = 'Bronté' WHERE id = 2
btw, I think the lack of consistency between INSERT (two comma separated lists of columns and values) and UPDATE SET (one comma separated list column = value) is confusing. I don’t know why it is that way.
The question often arises about whether we can do
UPDATEs based on joins (adding a
JOIN clause). In general yes, that is possible. However, it is confusing (am I updating the temporary join table or the underlying tables?) and can be hard to read and to check. I recommend finding the ids of the rows that you want to update (using a
SELECT), then updating those using that SELECT in a sub-query.
For example if we wanted to update the fee for all bands playing in the BMI venue, adding $100 to each band’s fee.
-- Find the list of band ids. SELECT bands.id FROM bands JOIN performances ON bands.id = performances.band_id JOIN venues ON performances.venue_id = venues.id WHERE venues.name = 'BMI' -- UPDATE using a sub-query to provide the list of ids. UPDATE bands SET bands.fee = bands.fee + 100 WHERE bands.id IN ( SELECT bands.id FROM bands JOIN performances ON bands.id = performances.band_id JOIN venues ON performances.venue_id = venues.id WHERE venues.name = 'BMI' )
Finally we can delete data in the database. The DELETE statement is a SELECT statement (to find the rows you want to delete) but rather than specifying the columns you want, change the SELECT clause to DELETE. Using SELECT first helps you ensure that you will delete the right rows! No Undo button (at least not in the way we have SQL set up, undo is possible using a feature called “transactions”).
SELECT * FROM books WHERE id = 23 DELETE FROM books WHERE id = 23
In this class we will have to manually delete references to this rows from tables with foreign keys.
DELETE FROM books_authors WHERE book_id = 23
Note that one would need to check if the author had written any other books before deleting the author.
DELETEs can’t be undone, so only delete using
ids and do
It is possible to automatically delete linked rows using “cascading deletes” but usually there are other things that need to be checked and changed, so largely the logic for that has moved out of the database and is implemented through python, ruby, or php. It is worth, though, thinking briefly about those situations.
The instruction “Delete Emily Bronte” seems straightforward.
SELECT * FROM people WHERE people.name = 'Emily Bronte' -- Check that it all looks good and ok to delete, no unexpected rows. DELETE FROM people WHERE people.name = 'Emily Bronte'
However, deleting Emily from the people table does not remove her id from the database entirely. For example it would leave her id in the book_roles table associated with books she had written (and potentially in other tables) Even if we delete all of her book_roles what ought to happen to books that she co-authored with others, should they now also be removed? (Are we erasing her from history?) If we don’t then we have erroneous information about those books.
Let’s say that we want to remove any books (and book_roles) that were associated with Emily Bronte. In these situations it is best to work backwards from the further location (ie the Book), back towards the Person. (It’s easier to find the books that Emily Bronte was involved in before Emily Bronte’s record in the person table is deleted!)
-- Find all the books associated with Emily Bronte in any form. SELECT * FROM people WHERE people.name = 'Emily Bronte' -- Navigate over to books, this gets any books associated with Emily. SELECT * FROM people JOIN book_roles ON book_roles.person_id = people.id JOIN books ON book_roles.book_id = books.id WHERE people.name = 'Emily Bronte' -- Just get books.id -- This returns a list of book ids like (7, 23, 45) SELECT books.id FROM people JOIN book_roles ON book_roles.person_id = people.id JOIN books ON book_roles.book_id = books.id WHERE people.name = 'Emily Bronte' -- Now use that list to do the delete from books. -- Could write the results of last query down -- (I just made these numbers up) and use like this: SELECT * FROM books WHERE books.id IN (7, 23, 45) -- But it's better to use a sub-query that generates the list. -- Here I've just substituted the query two up for the list of ids. SELECT * FROM books WHERE books.id IN ( SELECT books.id FROM people JOIN book_roles ON book_roles.person_id = people.id JOIN books ON book_roles.book_id = books.id WHERE people.name = 'Emily Bronte' ) -- Now convert the SELECT * to DELETE. DELETE FROM books WHERE books.id IN ( SELECT books.id FROM people JOIN book_roles ON book_roles.person_id = people.id JOIN book ON book_roles.book_id = books.id WHERE people.name = 'Emily Bronte' ) -- Now delete all book_roles associated with those books -- SELECT * FROM people JOIN book_roles ON book_roles.person_id = people.id WHERE people.name = 'Emily Bronte' -- Get book_ids for those. SELECT book_roles.book_id FROM people JOIN book_roles ON book_roles.person_id = people.id WHERE people.name = 'Emily Bronte' -- Now find all remaining records associated with those books. SELECT * FROM book_roles WHERE book_roles.book_id IN ( SELECT book_roles.book_id FROM people JOIN book_roles ON book_roles.person_id = people.id WHERE people.name = 'Emily Bronte' ) -- Note that this deletes a supplier role as well. Possibility that -- we "orphan" a supplier person if they are no longer associated -- with the book. See below. -- And convert to a delete. DELETE FROM book_roles WHERE book_roles.book_id IN ( SELECT book_roles.book_id FROM people JOIN book_roles ON book_roles.person_id = people.id WHERE people.name = 'Emily Bronte' ) -- Finally delete Emily. DELETE FROM people WHERE people.name = 'Emily Bronte'
An advance would be to find all orphaned people (could be Emily’s translator or supplier if that was their last book). That type of query requires a form of JOIN that we haven’t discussed, called an OUTER JOIN. See https://blog.codinghorror.com/a-visual-explanation-of-sql-joins/