24 csv to sql via Python
In this class we’ll put all the elements we need for our projects together. For your project this will be combining the logical order from the SQL query sequence assignment together with parameterized queries, and loops to read each csv.
Do your notebook setup as at the top of Chapter 22.
We’ll read our csv files, split out data to relevant tables, manage foreign/primary keys and then run a query to export data for our analysis.
You should have your notebook set up as in
We were working to get this csv file
students_classes.csv
------------
class_name, student_name, semester
Divination, Hermione, Fall
Divination, Harry, Fall
Potions, Ron, Spring
into these tables:
students(id, name)
------------------
classes(id, name)
-----------------
enrollments(id, semester, student_id, class_id)
-----------------------------------------------
This code will create all three tables:
-- drop if they exist
DROP TABLE IF EXISTS students;
DROP SEQUENCE IF EXISTS seq_student_id;
-- create sequence then use it as default
CREATE SEQUENCE seq_student_id START 1;
CREATE TABLE students (
id INT PRIMARY KEY DEFAULT NEXTVAL('seq_student_id'),
name TEXT
);
DROP TABLE IF EXISTS classes;
DROP SEQUENCE IF EXISTS seq_class_id;
-- create sequence then use it as default
CREATE SEQUENCE seq_class_id START 1;
CREATE TABLE classes (
id INT PRIMARY KEY DEFAULT NEXTVAL('seq_class_id'),
name TEXT
);
DROP TABLE IF EXISTS enrollments;
DROP SEQUENCE IF EXISTS seq_enrollment_id;
-- create sequence then use it as default
CREATE SEQUENCE seq_enrollment_id START 1;
CREATE TABLE enrollments (
id INT PRIMARY KEY DEFAULT NEXTVAL('seq_enrollment_id'),
semester TEXT,INT,
student_id INT,
class_id );
As we read each row of the csv file we need to do an INSERT into the enrollments table. With placeholders it is going to look like:
INSERT INTO enrollments(semester, student_id, class_id)
VALUES ($semester, $student_id, $class_id)
But as we read the csv in row
we only have the student_name and the class_name. So we have to convert those to id
. And this means that we have to find or create the student and the class.
To find an existing row, we use a SELECT
query. To check if we found it we use an if statement. If we couldn’t find it, we use an INSERT ... RETURNING id
First we add only the find.
with duckdb.connect('duckdb-file.db') as con:
= "SELECT id FROM students WHERE name = $new_student"
sql_find_student
with open('students_classes.csv') as csvfile:
= csv.DictReader(csvfile, delimiter=",", quotechar='"')
myCSVReader
for row in myCSVReader:
= {'new_student': row['student_name']}
param_dict
con.execute(sql_find_student, param_dict)= con.fetchone() # returns None if no rows found
id_check if (id_check):
= id_check[0]
student_id print(f"Found {student_id} for {row['student_name']}")
else:
print(f"Did not find a row for {row['student_name']}.")
Now we can add the INSERT
into the else block.
with duckdb.connect('duckdb-file.db') as con:
= "SELECT id FROM students WHERE name = $new_student"
sql_find_student = "INSERT INTO students(name) VALUES ($new_student) RETURNING id"
sql_insert_student
with open('students_classes.csv') as csvfile:
= csv.DictReader(csvfile, delimiter=",", quotechar='"')
myCSVReader
for row in myCSVReader:
= {'new_student': row['student_name']}
param_dict
con.execute(sql_find_student, param_dict)= con.fetchone() # returns None if no rows found
id_check if (id_check):
= id_check[0]
student_id else:
con.execute(sql_insert_student, param_dict)= con.fetchone()[0]
student_id
print(f"Found or created {student_id} for {row['student_name']}")
Check the variable name for the SQL you send to con.execute (e.g., sql_find_student
vs sql_insert_student
).
When copying and pasting parts of this code, it is easy to introduce issues by forgetting to change this.
So now we have translated the string student_name
to an id
whether we’ve seen this student before or not.
But we still need to do the same thing for class_name
before we can execute the INSERT
for enrollment. This code looks very much the same.
with duckdb.connect('duckdb-file.db') as con:
= "SELECT id FROM classes WHERE name = $new_class"
sql_find_class = "INSERT INTO classes(name) VALUES ($new_class) RETURNING id"
sql_insert_class
with open('students_classes.csv') as csvfile:
= csv.DictReader(csvfile, delimiter=",", quotechar='"')
myCSVReader
for row in myCSVReader:
= {'new_class': row['class_name']}
param_dict
con.execute(sql_find_class, param_dict)= con.fetchone() # returns None if no rows found
id_check if (id_check):
= id_check[0]
class_id else:
con.execute(sql_insert_class, param_dict)= con.fetchone()[0]
class_id
print(f"Found or created {class_id} for {row['class_name']}")
This should output:
Found or created 1 for Divination
Found or created 1 for Divination
Found or created 2 for Potions
which is good because the code created “Divination” on the first row, but found that id (1) on the second row.
Now that we have both student_id
and class_id
for this row, we can finally execute our INSERT
for enrollments. When strings get longer, using the """
multi-line string is helpful.
The relevant lines are extracted below:
= """INSERT INTO enrollments(semester, student_id, class_id)
sql_insert_enrollments VALUES ($semester, $student_id, $class_id)"""
...
= {'class_id': class_id,
param_dict 'student_id': student_id,
'semester: row['semester']}
con.execute(sql_insert_enrollments, param_dict)
When we build up the param_dict for this we get the ids from the local variables we created (class_id
and student_id
) but they have to go into param_dict with keys that match the sql placeholder template. Then we get the semester directly from row.
Putting this all together gives us:
with duckdb.connect('duckdb-file.db') as con:
= "SELECT id FROM classes WHERE name = $new_class"
sql_find_class = "INSERT INTO classes(name) VALUES ($new_class) RETURNING id"
sql_insert_class
= "SELECT id FROM students WHERE name = $new_student"
sql_find_student = "INSERT INTO students(name) VALUES ($new_student) RETURNING id"
sql_insert_student
= """INSERT INTO enrollments(semester, student_id, class_id)
sql_insert_enrollments VALUES ($semester, $student_id, $class_id)"""
with open('students_classes.csv') as csvfile:
= csv.DictReader(csvfile, delimiter=",", quotechar='"')
myCSVReader
for row in myCSVReader:
# find or create class_id
= {'new_class': row['class_name']}
param_dict
con.execute(sql_find_class, param_dict)= con.fetchone() # returns None if no rows found
id_check if (id_check):
= id_check[0]
class_id else:
con.execute(sql_insert_class, param_dict)= con.fetchone()[0]
class_id
# find or create student_id
= {'new_student': row['student_name']}
param_dict
con.execute(sql_find_student, param_dict)= con.fetchone() # returns None if no rows found
id_check if (id_check):
= id_check[0]
student_id else:
con.execute(sql_insert_student, param_dict)= con.fetchone()[0]
student_id
# Either way we now have what we need to do INSERT into enrollments
= {'class_id': class_id,
param_dict 'student_id': student_id,
'semester': row['semester']}
con.execute(sql_insert_enrollments, param_dict)
You can check that the data goes in correctly using SELECT * FROM classes
or joining them all back up using:
%%sql
SELECT students.name AS student_name, classes.name AS class_name, semester
FROM studentsid = enrollments.student_id
JOIN enrollments ON students.= classes.id JOIN classes ON enrollments.class_id
This is the basic approach that we will take in the course. We are using python to drive the logical process that we implemented when we created the INSERT
logs.
A few additional points:
Only data added to the
param_dict
goes to the database. So ff the CSV has data you don’t want to insert, you just never reference it (python reads it intorow
but then forgets it when it processes the next row)The csv file can have entirely different names than the database. The CSV gets mapped to the param_dict when param_dict is created, and the database tables are mapped through the placeholders in the INSERT statement.
Within this basic structure we can handle many issues in the data.
For example, we may have some extra information from outside the CSV file we want to add into a query. Imagine that our source CSV file did not have semester information, just student_name and class_name. Instead we just knew that this file was all “spring” semester then we can include this data in the param_dict directly:
= {'class_id': class_id,
param_dict 'student_id': student_id,
'semester: "spring"} # semester as a constant
con.execute(sql_insert_enrollments, param_dict)
Later we will use a similar approach to learn to process a directory of CSV files with a similar format, taking data from each of the the file names.
We will also learn how to fix data inside fields, such as removing unwanted pieces of a string, or parsing datetimes.
25 Using UPDATE to add columns from second csv file
If we already have data in a table, but we want to add data from another csv file then we use UPDATE
and not INSERT
.
Often we have two CSV files about the same entity. For example a csv with data on the population of a country, and a second csv file with data on the area of a country. These chunks of code will write these small CSV files to your notebook file area.
with open("country_population.csv", "w") as text_file:
"""
text_file.write(country,population_millions
Australia,18.9
USA,332.4
Indonesia,273.50
""".lstrip())
with open("country_area.csv.csv", "w") as text_file:
"""
text_file.write(country,area_sq_km
Australia,7692020
USA,9147420
Indonesia,1877519
""".lstrip())
And we want to load these into a single table, created with this SQL:
%%sqlDROP TABLE IF EXISTS countries;
DROP SEQUENCE IF EXISTS seq_country_id;
CREATE SEQUENCE seq_country_id START 1;
CREATE TABLE countries (
id INT PRIMARY KEY DEFAULT NEXTVAL('seq_country_id'),
name TEXT,DECIMAL,
population_millions INTEGER
area_sq_km );
So that eventually we want to see rows like:
id | country | population_millions | area_sq_km |
---|---|---|---|
1 | Australia | 18.9 | 7692020 |
We can read and INSERT the rows from the first csv we read, but the second (and subsequent) csvs we will need to use UPDATE.
with duckdb.connect('duckdb-file.db') as con:
with open('country_population.csv') as csvfile:
#country,population_millions
#Australia,18.9
= csv.DictReader(csvfile)
myCSVReader for row in myCSVReader:
"""
con.execute( INSERT INTO countries (
name,
population_millions
) VALUES (
$country,
$population_millions
)
""",
row )
This creates a row for each country and fills out the population_millions column. However it leaves the area_sq_km column full of NULL values. (Note that we don’t use param_dict here and just use row directly in the execute
call; this only works because row has all the keys in the placeholder SQL and only the keys in the placeholder SQL).
SELECT *
FROM countries
shows us
id | country | population_millions | area_sq_km |
---|---|---|---|
1 | Australia | 18.9 | NULL |
2 | USA | 332.4 | NULL |
3 | Indonesia | 273.50 | NULL |
If we use an INSERT when loading the second file we get new rows with the area_sq_km filled out, but the population_millions will be all NULL. And we’ll have six rows instead of three.
id | country | population_millions | area_sq_km |
---|---|---|---|
1 | Australia | 18.9 | NULL |
2 | USA | 332.4 | NULL |
3 | Indonesia | 273.50 | NULL |
4 | Australia | NULL | 7692020 |
5 | USA | NULL | 9147420 |
6 | Indonesia | NULL | 1877519 |
To avoid this we use UPDATE when processing the subsequent csv files.
with duckdb.connect('duckdb-file.db') as con:
with open('country_area.csv') as csvfile:
#country,area_sq_km
#Australia,7692020
= csv.DictReader(csvfile)
myCSVReader for row in myCSVReader:
"""
con.execute( UPDATE countries
SET area_sq_km = $area_sq_km
WHERE name = $country
""",
row )
And we end up with:
id | country | population_millions | area_sq_km |
---|---|---|---|
1 | Australia | 18.9 | 7692020 |
2 | USA | 332.4 | 9147420 |
3 | Indonesia | 273.50 | 1877519 |
Note that in this example we aren’t handling variants of country names. If we had to do that, say if the two data files used different words for countries, we would need to look up the country_id in a country_names table, then use that in the INSERT
and the UPDATE
.
You may encounter this situation in your project. You will have to choose an order to process your CSV files. That order is up to whatever makes the most logical sense in your specific data, but if you’ve already created the relevant rows, then you’ll have to use UPDATE
in this way to add more data to those rows.