My project is to find the correlation between population, college education, and salary by state from the statistics in 2021. The population might affect the education resources one state has. Hence, after comparing population to college education, I want to see if any columns correlate to a state's salary or how these three aspects affect each other.
The population statistic from the World Population Review organization contains the population rank of each state and the population growth since 2010. The college statistics have different state's school amounts, acceptance rates, SAT scores, tuition, student populations, graduate rates, and students to faculty rates from UNIVSTATS. The average salary by the state is from ZipRecruiter, telling us about the annual salary and the monthly, weekly, and hourly wage. The state table is from statetable.com, which provides downloads of US states, Canadian provinces, territories, and minor possessions.
The first three statistics are converted into CSV files using the online conversion site https://www.convertcsv.com/html-table-to-csv.htm by entering the sites' URLs. The state table provides a CSV file for direct download.
In the US States College data source, I want to import the student population and graduation rate to make the comparison. However, I need to transform the population from a string into an integer by clearing the comma. Also, the graduation rate encounters a more complicated problem: it needs to change the percentage string into a correct number of the rate.
In the US States Salary data source, I want to import the annual salary. In this case, I have to clean the $ and the comma to make it an integer.
First, I create a newproject database for my project. Then I made five tables: states, state_names, populations, colleges, and salaries for data importing. After that, I extracted data from the four CSV files and inserted the key columns into the tables. At the same time, I cleaned the data I needed to make it workable. After finishing data importing, I wrote SQL queries to select the columns I wanted to use and joined them into a more completed table. Besides, I wrote a query that grouped different regions in the states to analyze further. In the final stage, I tried to compare the two variables in my SQL queries using SciPy. The other analysis I conducted used maplotlib to show the graph for different regions.
State has_many StateName in this situation. First, I imported the state data source into the state table and extracted its id. Then, I used the returning id as the state_id foreign key. I put different state names like Texas and its abbreviation TX into the same column in the statename table. After that, I inserted the other data into tables and used queries to join other tables with state_id as the foreign key they share.
# create newproject database
import psycopg2
from psycopg2.extras import RealDictCursor
import csv
import psycopg2
from psycopg2.extensions import ISOLATION_LEVEL_AUTOCOMMIT
conn = psycopg2.connect(host='localhost');
conn.set_isolation_level(ISOLATION_LEVEL_AUTOCOMMIT);
cursor= conn.cursor();
cursor.execute(
'''
CREATE DATABASE newproject
'''
)
cursor.close()
conn.close()
# create tables
with psycopg2.connect(host='localhost', dbname='newproject') as conn:
with conn.cursor(cursor_factory=RealDictCursor) as cursor:
cursor.execute(
'''
DROP TABLE IF EXISTS states;
CREATE TABLE states (
id SERIAL PRIMARY KEY,
label TEXT,
census_region_name TEXT,
census_division_name TEXT
);
DROP TABLE IF EXISTS salaries;
CREATE TABLE salaries (
id SERIAL PRIMARY KEY,
name TEXT,
annual_salary INTEGER,
state_id INTEGER
);
DROP TABLE IF EXISTS populations;
CREATE TABLE populations (
id SERIAL PRIMARY KEY,
name TEXT,
population_num INTEGER,
growth FLOAT,
state_id INTEGER
);
DROP TABLE IF EXISTS colleges;
CREATE TABLE colleges (
id SERIAL PRIMARY KEY,
name TEXT,
student_pop INTEGER,
graduation_rate FLOAT,
state_id INTEGER
);
DROP TABLE IF EXISTS state_names;
CREATE TABLE state_names (
id SERIAL PRIMARY KEY,
name TEXT,
state_id INTEGER
);
'''
)
import psycopg2
from psycopg2.extras import RealDictCursor
import csv
# insert states.csv data into the states table and create many statename in the state_names table
with psycopg2.connect(host='localhost', dbname='newproject') as conn:
with conn.cursor(cursor_factory=RealDictCursor) as cursor:
state_sql = """
INSERT INTO states(
label,
census_region_name,
census_division_name
) VALUES(
%(name)s,
%(census_region_name)s,
%(census_division_name)s
)
RETURNING id
"""
statename_sql = """
INSERT INTO state_names(
name,
state_id
) VALUES(
%(name)s,
%(state_id)s
)
RETURNING id
"""
with open('states.csv') as csvfile:
myCSVReader = csv.DictReader(csvfile)
for row in myCSVReader:
cursor.execute(state_sql, row)
state_id = cursor.fetchone()['id']
param_dict = { "name": row["name"],
"state_id": state_id }
cursor.execute(statename_sql, param_dict)
param_dict = { "name": row["abbreviation"],
"state_id": state_id }
cursor.execute(statename_sql, param_dict)
# insert the salaries.csv data into the salaries table and clean the data
sql = """
SELECT state_id
FROM state_names
WHERE name = %(State)s
"""
with psycopg2.connect(host='localhost', dbname='newproject') as conn:
with conn.cursor(cursor_factory=RealDictCursor) as cursor:
with open('salaries.csv') as csvfile:
myCSVReader = csv.DictReader(csvfile)
for row in myCSVReader:
import re
cleaned_salary = re.sub('[\$,]','',row["Annual Salary"])
cursor.execute(sql, row)
if (cursor.rowcount > 0):
state_id = cursor.fetchone()['state_id']
param_dict = {"state_id": state_id,
"annual_salary": cleaned_salary,
"State": row["State"]}
cursor.execute("""
INSERT INTO salaries(
name,
annual_salary,
state_id
) VALUES (
%(State)s,
%(annual_salary)s,
%(state_id)s
)
""",
param_dict
)
# insert the populations.csv data into the populations table
sql = """
SELECT state_id
FROM state_names
WHERE name = %(State)s
"""
with psycopg2.connect(host='localhost', dbname='newproject') as conn:
with conn.cursor(cursor_factory=RealDictCursor) as cursor:
with open('populations.csv') as csvfile:
myCSVReader = csv.DictReader(csvfile)
for row in myCSVReader:
cursor.execute(sql, row)
if (cursor.rowcount > 0):
state_id = cursor.fetchone()['state_id']
param_dict = {"state_id": state_id,
"population_num": row["Pop"],
"growth": row["Growth"],
"State": row["State"]}
cursor.execute("""
INSERT INTO populations(
name,
population_num,
growth,
state_id
) VALUES (
%(State)s,
%(population_num)s,
%(growth)s,
%(state_id)s
)
""",
param_dict
)
# insert the colleges.csv data into the colleges table and clean the data
sql = """
SELECT state_id
FROM state_names
WHERE name = %(State)s
"""
with psycopg2.connect(host='localhost', dbname='newproject') as conn:
with conn.cursor(cursor_factory=RealDictCursor) as cursor:
with open('colleges.csv') as csvfile:
myCSVReader = csv.DictReader(csvfile)
for row in myCSVReader:
import re
cleaned_studentpop = re.sub('[\$,]','',row["Student Population"])
cleaned_gradrate = re.sub('[\%,]','',row["Graduation Rate"])
cleaned_gradrate2 = float(cleaned_gradrate) / 100
cursor.execute(sql, row)
if (cursor.rowcount > 0):
state_id = cursor.fetchone()['state_id']
param_dict = {'state_id': state_id,
'student_pop': cleaned_studentpop,
'graduation_rate': cleaned_gradrate2,
'State': row['State']}
cursor.execute("""
INSERT INTO colleges(
name,
student_pop,
graduation_rate,
state_id
) VALUES (
%(State)s,
%(student_pop)s,
%(graduation_rate)s,
%(state_id)s
)
""",
param_dict
)
# First analysis: compare population data to college data
import scipy.stats
sql = """
SELECT states.id AS id,
state_names.name AS name,
states.census_region_name AS region,
states.census_division_name AS division,
populations.population_num AS population,
populations.growth AS growth,
colleges.student_pop AS student_population,
populations.population_num / colleges.student_pop AS student_ratio,
colleges.graduation_rate AS graduation_rate,
salaries.annual_salary
FROM states
JOIN state_names
ON states.id = state_names.state_id
JOIN populations
ON states.id = populations.state_id
JOIN colleges
ON states.id = colleges.state_id
JOIN salaries
ON states.id = salaries.state_id
ORDER BY id
"""
with psycopg2.connect(host='localhost', dbname='newproject') as conn:
with conn.cursor(cursor_factory=RealDictCursor) as cursor:
cursor.execute(sql)
student_ratio = []
graduation_rate = []
for row in cursor:
student_ratio.append(row["student_ratio"])
graduation_rate.append(row["graduation_rate"])
print("scipy.stats.pearsonr:")
print(scipy.stats.pearsonr(student_ratio,graduation_rate)[0])
print("--------")
print("scipy.stats.spearmanr:")
print(scipy.stats.spearmanr(student_ratio,graduation_rate)[0])
scipy.stats.pearsonr: -0.3600290872795981 -------- scipy.stats.spearmanr: -0.3823016760956874
I created a variable called student_ratio, the population number of a state divided by the student population. Then, I compared the student_ratio with the graduation_rate to see the correlation between the two variables. I used SciPy in the code, and the correlation and confidence interval turned out to be pearsonr: -0.360 and spearmanr: -0.382. Both of them indicate a slightly negative correlation. When the student ratio goes up, the graduation rate might decrease a little.
# Second analysis: compare population data to salary data in different regions
%load_ext sql
%env DATABASE_URL=postgresql://localhost/newproject
env: DATABASE_URL=postgresql://localhost/newproject
%%sql
SELECT states.census_region_name AS region,
Sum(populations.population_num) AS population,
Sum(colleges.student_pop) AS student_population,
Sum(salaries.annual_salary) / count(states.census_region_name) AS salary
FROM states
JOIN state_names
ON states.id = state_names.state_id
JOIN populations
ON states.id = populations.state_id
JOIN colleges
ON states.id = colleges.state_id
JOIN salaries
ON states.id = salaries.state_id
GROUP BY states.census_region_name
ORDER BY states.census_region_name
4 rows affected.
region | population | student_population | salary |
---|---|---|---|
Midwest | 137027512 | 8361870 | 60993 |
Northeast | 111710338 | 6924086 | 63859 |
South | 253776650 | 13898568 | 58892 |
West | 158744328 | 10382710 | 62934 |
pip install matplotlib
! pip install pandas
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
width = .55
m1_t = pd.DataFrame({
'student_population' : [8361870, 6924086, 13898568, 10382710],
'salary' : [60993, 63859, 58892, 62934]})
m1_t[['student_population']].plot(kind='bar', width = width, color = 'lightgray')
m1_t['salary'].plot(secondary_y=True, color = 'salmon')
ax = plt.gca()
plt.xlim([-width, len(m1_t['student_population'])-width])
ax.set_xticklabels(('Midwest','Northeast','South','West'))
plt.title("Student Population and Salary in Different Regions")
plt.show()
I used SQL to group the data into four regions in the states. Then I queried the population, student population, and annual salary. After that, I put the outcome into the matplotlib to see if the graph indicates their correlation. The two bars are the populations, and the line is the annual salary. However, there seems to be no correlation between these three factors. The population amount doesn't affect a region's annual salary.
I'm glad that I learned a lot by solving the many problems I encountered during this project. In the beginning, I didn't know how to look up foreign keys while inserting data, so I inserted other data with no foreign keys first, then inserted the foreign key later, which resulted in duplicate records. After lots of trial and error, I started to understand how to use if to count rows and the logic behind fetching id. Also, the meeting with the professor helped a lot. I finally could insert the data in the right way. I also learned to use regex to clean the fields. At first, I didn't notice that I couldn't use my data because it's a string. After the queries, I discovered the issues and searched how to remove extra signs on the Internet and the course syllabus. I tried using re.sub and str.replace, and I chose regex afterward to make the code look clearer. Besides these, I had other challenges like not knowing how to create state has many statename in the tables and other minor errors. It's a long journey, but I feel a sense of fulfillment when seeing the code run successfully.
First, I learned the Scipy by looking into the course materials. After I understood what Scipy means, I searched and explored the Scipy documentation. In the document, I learned the difference between Pearson and Spearman correlations. Second, I learned how to use matplotlib to demonstrate the graph. I searched on the Internet and found much helpful information on matplotlib documentation. It shows the different kinds of charts it can produce. I tried the graph, but I wanted to add a line to the graph, so I searched on other Q&A platforms and found the solution to the issue. Both tools are handy in the analysis, and I learned a lot by running them with my data.