15  Processing CSV files

For your project, you will all have to process CSV files, working through row by row and making a series of SQL INSERT statements. Many rows will require multiple INSERT statements. Very often you will have to convert names into ids, so that you can tell the database the right foreign keys.

Eventually we will learn to do this with Python. For this assignment, though, you will complete the statements required for a few rows in the CSV files we examine below.

15.1 The social media database

For this assignment we are using the social media database that we discussed earlier.

The conceptual ER is thus:

The sample table sketch is:

The code to create the tables (in DuckDB) is:

Remember to use our DuckDB setup code:

!pip install jupysql --quiet
!pip install duckdb-engine --quiet
import duckdb

%reload_ext sql
%config SqlMagic.autopandas = True
%config SqlMagic.feedback = False
%config SqlMagic.displaycon = False
%sql duckdb:///duckdb-file.db
%%sql
DROP TABLE IF EXISTS user_accounts;
DROP SEQUENCE IF EXISTS seq_user_accounts_id;
CREATE SEQUENCE seq_user_accounts_id START 1;
CREATE TABLE user_accounts (
    id INT PRIMARY KEY DEFAULT NEXTVAL('seq_user_accounts_id'),
    name TEXT
);

DROP TABLE IF EXISTS posts;
DROP SEQUENCE IF EXISTS seq_posts_id;
CREATE SEQUENCE seq_posts_id START 10;
CREATE TABLE posts (
    id INT PRIMARY KEY DEFAULT NEXTVAL('seq_posts_id'),
    name TEXT,
    "text" TEXT,
    user_accounts_id INT
);

DROP TABLE IF EXISTS posts_resources;
CREATE TABLE posts_resources (
    post_id INT,
    resource_id INT
);

DROP TABLE IF EXISTS resources;
DROP SEQUENCE IF EXISTS seq_resources_id;
CREATE SEQUENCE seq_resources_id START 20;
CREATE TABLE resources (
    id INT PRIMARY KEY DEFAULT NEXTVAL('seq_resources_id'),
    name TEXT,
    type TEXT,
    path TEXT
);

DROP TABLE IF EXISTS posts_tags;
CREATE TABLE posts_tags (
  post_id INT,
  tag_id INT
);

DROP TABLE IF EXISTS tags;
DROP SEQUENCE IF EXISTS seq_tags_id;
CREATE SEQUENCE seq_tags_id START 40;
CREATE TABLE tags (
    id INT PRIMARY KEY DEFAULT NEXTVAL('seq_tags_id'),
    name TEXT
);

DROP TABLE IF EXISTS bids;
DROP SEQUENCE IF EXISTS seq_bids_id;
CREATE SEQUENCE seq_bids_id START 50;
CREATE TABLE bids (
    id INT PRIMARY KEY DEFAULT NEXTVAL('seq_bids_id'),
    amount INT,
    tag_id INT,
    advertiser_id INT
);

DROP TABLE IF EXISTS advertisers;
DROP SEQUENCE IF EXISTS seq_advertiser_id;
CREATE SEQUENCE seq_advertiser_id START 60;
CREATE TABLE advertisers (
    id INT PRIMARY KEY DEFAULT NEXTVAL('seq_advertiser_id'),
    name TEXT
);

DROP TABLE IF EXISTS advertisers_industries;
CREATE TABLE advertisers_industries (
  advertiser_id INT,
  industry_id INT
);

DROP TABLE IF EXISTS industries;
DROP SEQUENCE IF EXISTS seq_industry_id;
CREATE SEQUENCE seq_industry_id START 70;
CREATE TABLE industries (
    id INT PRIMARY KEY DEFAULT NEXTVAL('seq_industry_id'),
    name TEXT
);

15.2 The input files:

You have received three csv files:

  1. advertiser_details.csv
name,industries,revenue,employees
Nike,Athletics;Golf,"51,469,000,000","83,700"
  1. bid_data.csv
advertiser,tag,bid
Nike,running,99
Nike,golf,88
  1. posts_with_resources.csv
username,post_date,post_name,post_text,file_paths,links,tags
Frodo,2023-10-04,"I love running","Walking, walking, then I run. See this pic of Mt Doom, visit at mtdoom.com","path/to/pics/mtdoom.png","mtdoom.com",running;Doomy

15.3 The INSERT log

The assignment is to complete the log below. You can test your log by running the database creation code above, then running your query log. Since this is multiple queries, be sure to end each query with a semi-colon. If you have everything correct your code will create the tables as per the table sketch above.

The log starts out with full queries, then requires you to complete queries (shown with ...), then at the bottom just provides comments and you have to write the whole sequence of queries. Currently you will manually enter ids (don’t use sub-queries or CTEs). Later we will learn how to use Python to store returned ids and use them in later queries.

%%sql
-- Processing advertiser_details row:
-- name,industries,revenue,employees
-- Nike,Athletics;Golf,"51,469,000,000","83,700"
SELECT id 
FROM advertisers
WHERE name = 'Nike'; -- from name column in advertiser_details.csv

-- returns 0 rows.  Therefore need to create record.
INSERT INTO advertisers(name) VALUES ('Nike') RETURNING id;

--returns '60' store this.

-- Check to see if industries exist.
-- Athletics first.
SELECT id
FROM industries
WHERE industries.name = 'Athletics';

-- Returns no rows.  Therefore have to create.
INSERT INTO industries(name) VALUES ('Athletics') RETURNING id;

-- Now we have to make the connection between 'Nike' id 60 and 'Athletics' id 70.
INSERT INTO advertisers_industries(industry_id, advertiser_id) VALUES (70, 60);

-- Now 'Golf'
SELECT id
FROM industries
WHERE industries.name = 'Golf';

-- Returns no rows.  Therefore have to create.
INSERT INTO industries(name) VALUES ('Golf') RETURNING id;
--returns '71'

-- Now we have to make the connection between 'Nike' id 60 and 'Golf' id 71.
INSERT INTO advertisers_industries(industry_id, advertiser_id) VALUES (71, 60);

-- Processing bid_data
-- advertiser,tag,bid
-- Nike,running,99

-- Need ids for Advertiser: Nike and tag: running.  We have the bid amount
SELECT id 
FROM advertisers
WHERE name = 'Nike'; 

-- This time it returns one row with id: 60.  Hurray, we don't have to create this.
SELECT id
FROM tags
WHERE name = 'running';

-- Zero rows.  Have to create.
INSERT INTO tags(name) VALUES ('running') RETURNING id;

-- returns "40". Now we have the ids we need!
INSERT INTO bids(amount, tag_id, advertiser_id) VALUES (99, 40, 60);

-- And so on for all rows in bid_data.  creating records for tags when needed.

-- Now process posts_with_resources.csv
-- Find or create user.
SELECT id FROM user_accounts ...

-- Create post associated with user, store id.
INSERT INTO posts ...

-- Find or create resources (for pic and for link)
SELECT id 
FROM resources 
WHERE type = 'image'
  AND path = ...


-- Associate post with resource ids (one for pic, one for link)
INSERT INTO posts_resources(post_id,resource_id) VALUES () ...
INSERT INTO posts_resources(post_id,resource_id) VALUES () ...

-- Find or create tag "running" 
SELECT id FROM tags WHERE tags.name = ...


-- associate post with tag
INSERT INTO posts_tags(post_id,tag_id) VALUES () ...

-- Now find or create tag "Doomy"