24  csv to sql via Python

In this class we’ll put all the elements we need for our projects together. For your project this will be combining the logical order from the SQL query sequence assignment together with parameterized queries, and loops to read each csv.

Do your notebook setup as at the top of Chapter 22.

We’ll read our csv files, split out data to relevant tables, manage foreign/primary keys and then run a query to export data for our analysis.

You should have your notebook set up as in

We were working to get this csv file

students_classes.csv
------------
class_name, student_name, semester
Divination, Hermione, Fall
Divination, Harry, Fall
Potions, Ron, Spring

into these tables:

students(id, name)
------------------
classes(id, name)
-----------------
enrollments(id, semester, student_id, class_id)
-----------------------------------------------

This code will create all three tables:

-- drop if they exist
DROP TABLE IF EXISTS students;
DROP SEQUENCE IF EXISTS seq_student_id;
-- create sequence then use it as default
CREATE SEQUENCE seq_student_id START 1;
CREATE TABLE students (
    id INT PRIMARY KEY DEFAULT NEXTVAL('seq_student_id'),
    name TEXT
);

DROP TABLE IF EXISTS classes;
DROP SEQUENCE IF EXISTS seq_class_id;
-- create sequence then use it as default
CREATE SEQUENCE seq_class_id START 1;
CREATE TABLE classes (
    id INT PRIMARY KEY DEFAULT NEXTVAL('seq_class_id'),
    name TEXT
);

DROP TABLE IF EXISTS enrollments;
DROP SEQUENCE IF EXISTS seq_enrollment_id;
-- create sequence then use it as default
CREATE SEQUENCE seq_enrollment_id START 1;
CREATE TABLE enrollments (
    id INT PRIMARY KEY DEFAULT NEXTVAL('seq_enrollment_id'),
    semester TEXT,
    student_id INT,
    class_id INT,
);

As we read each row of the csv file we need to do an INSERT into the enrollments table. With placeholders it is going to look like:

INSERT INTO enrollments(semester, student_id, class_id) 
VALUES ($semester, $student_id, $class_id)

But as we read the csv in row we only have the student_name and the class_name. So we have to convert those to id. And this means that we have to find or create the student and the class.

To find an existing row, we use a SELECT query. To check if we found it we use an if statement. If we couldn’t find it, we use an INSERT ... RETURNING id

First we add only the find.

with duckdb.connect('duckdb-file.db') as con:

    sql_find_student = "SELECT id FROM students WHERE name = $new_student"
 
    with open('students_classes.csv') as csvfile:
        myCSVReader = csv.DictReader(csvfile, delimiter=",", quotechar='"')

        for row in myCSVReader:
            
            param_dict = {'new_student': row['student_name']}
            con.execute(sql_find_student, param_dict)
            id_check = con.fetchone() # returns None if no rows found
            if (id_check):
                student_id = id_check[0]
                print(f"Found {student_id} for {row['student_name']}")
            else:
                print(f"Did not find a row for {row['student_name']}.")

Now we can add the INSERT into the else block.

with duckdb.connect('duckdb-file.db') as con:

    sql_find_student = "SELECT id FROM students WHERE name = $new_student"
    sql_insert_student = "INSERT INTO students(name) VALUES ($new_student) RETURNING id"
 
    with open('students_classes.csv') as csvfile:
        myCSVReader = csv.DictReader(csvfile, delimiter=",", quotechar='"')

        for row in myCSVReader:
            
            param_dict = {'new_student': row['student_name']}
            con.execute(sql_find_student, param_dict)
            id_check = con.fetchone() # returns None if no rows found
            if (id_check):
                student_id = id_check[0]
            else:
                con.execute(sql_insert_student, param_dict)
                student_id = con.fetchone()[0]
                
            print(f"Found or created {student_id} for {row['student_name']}")
Easy typo

Check the variable name for the SQL you send to con.execute (e.g., sql_find_student vs sql_insert_student).

When copying and pasting parts of this code, it is easy to introduce issues by forgetting to change this.

So now we have translated the string student_name to an id whether we’ve seen this student before or not.

But we still need to do the same thing for class_name before we can execute the INSERT for enrollment. This code looks very much the same.

with duckdb.connect('duckdb-file.db') as con:

    sql_find_class = "SELECT id FROM classes WHERE name = $new_class"
    sql_insert_class = "INSERT INTO classes(name) VALUES ($new_class) RETURNING id"
 
    with open('students_classes.csv') as csvfile:
        myCSVReader = csv.DictReader(csvfile, delimiter=",", quotechar='"')

        for row in myCSVReader:
            
            param_dict = {'new_class': row['class_name']}
            con.execute(sql_find_class, param_dict)
            id_check = con.fetchone() # returns None if no rows found
            if (id_check):
                class_id = id_check[0]
            else:
                con.execute(sql_insert_class, param_dict)
                class_id = con.fetchone()[0]
                
            print(f"Found or created {class_id} for {row['class_name']}")

This should output:

Found or created 1 for Divination
Found or created 1 for Divination
Found or created 2 for Potions

which is good because the code created “Divination” on the first row, but found that id (1) on the second row.

Now that we have both student_id and class_id for this row, we can finally execute our INSERT for enrollments. When strings get longer, using the """ multi-line string is helpful.

The relevant lines are extracted below:


sql_insert_enrollments = """INSERT INTO enrollments(semester, student_id, class_id) 
                            VALUES ($semester, $student_id, $class_id)"""

...

        param_dict = {'class_id': class_id,
                      'student_id': student_id,
                      'semester: row['semester']}
        con.execute(sql_insert_enrollments, param_dict)

When we build up the param_dict for this we get the ids from the local variables we created (class_id and student_id) but they have to go into param_dict with keys that match the sql placeholder template. Then we get the semester directly from row.

Putting this all together gives us:

with duckdb.connect('duckdb-file.db') as con:

    sql_find_class = "SELECT id FROM classes WHERE name = $new_class"
    sql_insert_class = "INSERT INTO classes(name) VALUES ($new_class) RETURNING id"
    
    sql_find_student = "SELECT id FROM students WHERE name = $new_student"
    sql_insert_student = "INSERT INTO students(name) VALUES ($new_student) RETURNING id"
 
    sql_insert_enrollments = """INSERT INTO enrollments(semester, student_id, class_id) 
                                VALUES ($semester, $student_id, $class_id)"""
                                          
    with open('students_classes.csv') as csvfile:
        myCSVReader = csv.DictReader(csvfile, delimiter=",", quotechar='"')

        for row in myCSVReader:
            
            # find or create class_id
            param_dict = {'new_class': row['class_name']}
            con.execute(sql_find_class, param_dict)
            id_check = con.fetchone() # returns None if no rows found
            if (id_check):
                class_id = id_check[0]
            else:
                con.execute(sql_insert_class, param_dict)
                class_id = con.fetchone()[0]
                
            # find or create student_id
            param_dict = {'new_student': row['student_name']}
            con.execute(sql_find_student, param_dict)
            id_check = con.fetchone() # returns None if no rows found
            if (id_check):
                student_id = id_check[0]
            else:
                con.execute(sql_insert_student, param_dict)
                student_id = con.fetchone()[0]
                
            # Either way we now have what we need to do INSERT into enrollments
            param_dict = {'class_id': class_id,
                      'student_id': student_id,
                      'semester': row['semester']}
            con.execute(sql_insert_enrollments, param_dict)    

You can check that the data goes in correctly using SELECT * FROM classes or joining them all back up using:

%%sql
SELECT students.name AS student_name, classes.name AS class_name, semester
FROM students
  JOIN enrollments ON students.id = enrollments.student_id
  JOIN classes ON enrollments.class_id = classes.id

This is the basic approach that we will take in the course. We are using python to drive the logical process that we implemented when we created the INSERT logs.

A few additional points:

  1. Only data added to the param_dict goes to the database. So ff the CSV has data you don’t want to insert, you just never reference it (python reads it into row but then forgets it when it processes the next row)

  2. The csv file can have entirely different names than the database. The CSV gets mapped to the param_dict when param_dict is created, and the database tables are mapped through the placeholders in the INSERT statement.

Within this basic structure we can handle many issues in the data.

For example, we may have some extra information from outside the CSV file we want to add into a query. Imagine that our source CSV file did not have semester information, just student_name and class_name. Instead we just knew that this file was all “spring” semester then we can include this data in the param_dict directly:

            param_dict = {'class_id': class_id,
                          'student_id': student_id,
                          'semester: "spring"} # semester as a constant
            con.execute(sql_insert_enrollments, param_dict)    

Later we will use a similar approach to learn to process a directory of CSV files with a similar format, taking data from each of the the file names.

We will also learn how to fix data inside fields, such as removing unwanted pieces of a string, or parsing datetimes.

25 Using UPDATE to add columns from second csv file

If we already have data in a table, but we want to add data from another csv file then we use UPDATE and not INSERT.

Often we have two CSV files about the same entity. For example a csv with data on the population of a country, and a second csv file with data on the area of a country. These chunks of code will write these small CSV files to your notebook file area.

with open("country_population.csv", "w") as text_file:
    text_file.write("""
country,population_millions
Australia,18.9
USA,332.4
Indonesia,273.50
""".lstrip())

with open("country_area.csv.csv", "w") as text_file:
    text_file.write("""
country,area_sq_km
Australia,7692020
USA,9147420
Indonesia,1877519
""".lstrip())

And we want to load these into a single table, created with this SQL:

%%sql
DROP TABLE IF EXISTS countries;
DROP SEQUENCE IF EXISTS seq_country_id;
CREATE SEQUENCE seq_country_id START 1;
CREATE TABLE countries (
    id INT PRIMARY KEY DEFAULT NEXTVAL('seq_country_id'),
    name TEXT,
    population_millions DECIMAL,
    area_sq_km INTEGER
);

So that eventually we want to see rows like:

id country population_millions area_sq_km
1 Australia 18.9 7692020

We can read and INSERT the rows from the first csv we read, but the second (and subsequent) csvs we will need to use UPDATE.

with duckdb.connect('duckdb-file.db') as con:

        with open('country_population.csv') as csvfile:
            #country,population_millions
            #Australia,18.9
            
            myCSVReader = csv.DictReader(csvfile)
            for row in myCSVReader:
                con.execute("""
                    INSERT INTO countries (
                                    name, 
                                    population_millions
                                ) VALUES (
                                    $country,
                                    $population_millions
                                )
                    """,
                    row
                )

This creates a row for each country and fills out the population_millions column. However it leaves the area_sq_km column full of NULL values. (Note that we don’t use param_dict here and just use row directly in the execute call; this only works because row has all the keys in the placeholder SQL and only the keys in the placeholder SQL).

SELECT *
FROM countries

shows us

id country population_millions area_sq_km
1 Australia 18.9 NULL
2 USA 332.4 NULL
3 Indonesia 273.50 NULL

If we use an INSERT when loading the second file we get new rows with the area_sq_km filled out, but the population_millions will be all NULL. And we’ll have six rows instead of three.

id country population_millions area_sq_km
1 Australia 18.9 NULL
2 USA 332.4 NULL
3 Indonesia 273.50 NULL
4 Australia NULL 7692020
5 USA NULL 9147420
6 Indonesia NULL 1877519

To avoid this we use UPDATE when processing the subsequent csv files.

with duckdb.connect('duckdb-file.db') as con:

        with open('country_area.csv') as csvfile:
            #country,area_sq_km
            #Australia,7692020
            
            myCSVReader = csv.DictReader(csvfile)
            for row in myCSVReader:
                con.execute("""
                    UPDATE countries 
                    SET area_sq_km = $area_sq_km
                    WHERE name =  $country
                    """,
                    row
                )

And we end up with:

id country population_millions area_sq_km
1 Australia 18.9 7692020
2 USA 332.4 9147420
3 Indonesia 273.50 1877519

Note that in this example we aren’t handling variants of country names. If we had to do that, say if the two data files used different words for countries, we would need to look up the country_id in a country_names table, then use that in the INSERT and the UPDATE.

You may encounter this situation in your project. You will have to choose an order to process your CSV files. That order is up to whatever makes the most logical sense in your specific data, but if you’ve already created the relevant rows, then you’ll have to use UPDATE in this way to add more data to those rows.