15 Processing CSV files
For your project, you will all have to process CSV files, working through row by row and making a series of SQL INSERT statements. Many rows will require multiple INSERT statements. Very often you will have to convert names into ids, so that you can tell the database the right foreign keys.
Eventually we will learn to do this with Python. For this assignment, though, you will complete the statements required for a few rows in the CSV files we examine below.
15.2 The input files:
You have received three csv files:
- advertiser_details.csv
name,industries,revenue,employees
Nike,Athletics;Golf,"51,469,000,000","83,700"
- bid_data.csv
advertiser,tag,bid
Nike,running,99
Nike,golf,88
- posts_with_resources.csv
username,post_date,post_name,post_text,file_paths,links,tags
Frodo,2023-10-04,"I love running","Walking, walking, then I run. See this pic of Mt Doom, visit at mtdoom.com","path/to/pics/mtdoom.png","mtdoom.com",running;Doomy
15.3 The INSERT log
The assignment is to complete the log below. You can test your log by running the database creation code above, then running your query log. Since this is multiple queries, be sure to end each query with a semi-colon. If you have everything correct your code will create the tables as per the table sketch above.
The log starts out with full queries, then requires you to complete queries (shown with ...
), then at the bottom just provides comments and you have to write the whole sequence of queries. Currently you will manually enter ids (don’t use sub-queries or CTEs). Later we will learn how to use Python to store returned ids and use them in later queries.
%%sql-- Processing advertiser_details row:
-- name,industries,revenue,employees
-- Nike,Athletics;Golf,"51,469,000,000","83,700"
SELECT id
FROM advertisers
WHERE name = 'Nike'; -- from name column in advertiser_details.csv
-- returns 0 rows. Therefore need to create record.
INSERT INTO advertisers(name) VALUES ('Nike') RETURNING id;
--returns '60' store this.
-- Check to see if industries exist.
-- Athletics first.
SELECT id
FROM industries
WHERE industries.name = 'Athletics';
-- Returns no rows. Therefore have to create.
INSERT INTO industries(name) VALUES ('Athletics') RETURNING id;
-- Now we have to make the connection between 'Nike' id 60 and 'Athletics' id 70.
INSERT INTO advertisers_industries(industry_id, advertiser_id) VALUES (70, 60);
-- Now 'Golf'
SELECT id
FROM industries
WHERE industries.name = 'Golf';
-- Returns no rows. Therefore have to create.
INSERT INTO industries(name) VALUES ('Golf') RETURNING id;
--returns '71'
-- Now we have to make the connection between 'Nike' id 60 and 'Golf' id 71.
INSERT INTO advertisers_industries(industry_id, advertiser_id) VALUES (71, 60);
-- Processing bid_data
-- advertiser,tag,bid
-- Nike,running,99
-- Need ids for Advertiser: Nike and tag: running. We have the bid amount
SELECT id
FROM advertisers
WHERE name = 'Nike';
-- This time it returns one row with id: 60. Hurray, we don't have to create this.
SELECT id
FROM tags
WHERE name = 'running';
-- Zero rows. Have to create.
INSERT INTO tags(name) VALUES ('running') RETURNING id;
-- returns "40". Now we have the ids we need!
INSERT INTO bids(amount, tag_id, advertiser_id) VALUES (99, 40, 60);
-- And so on for all rows in bid_data. creating records for tags when needed.
-- Now process posts_with_resources.csv
-- Find or create user.
SELECT id FROM user_accounts ...
-- Create post associated with user, store id.
INSERT INTO posts ...
-- Find or create resources (for pic and for link)
SELECT id
FROM resources
WHERE type = 'image'
AND path = ...
-- Associate post with resource ids (one for pic, one for link)
INSERT INTO posts_resources(post_id,resource_id) VALUES () ...
INSERT INTO posts_resources(post_id,resource_id) VALUES () ...
-- Find or create tag "running"
SELECT id FROM tags WHERE tags.name = ...
-- associate post with tag
INSERT INTO posts_tags(post_id,tag_id) VALUES () ...
-- Now find or create tag "Doomy"