Anita Tsai's Project

Project Description

My project is to find the correlation between population, college education, and salary by state from the statistics in 2021. The population might affect the education resources one state has. Hence, after comparing population to college education, I want to see if any columns correlate to a state's salary or how these three aspects affect each other.

My project goals are:

  1. Compare the data between population and education resources by state.
  2. Compare the data between education resources and the average salary by state.
  3. Analyze the results from the above data sets to see if there's a special connection between them.
  4. Visualize the discovery from the research.

Data Sources

  1. US States - Ranked by Population - World Population Review
  2. College Statistics By State - UNIVSTATS
  3. What Is the Average Average Salary by State - ZipRecruiter
  4. State Table

The population statistic from the World Population Review organization contains the population rank of each state and the population growth since 2010. The college statistics have different state's school amounts, acceptance rates, SAT scores, tuition, student populations, graduate rates, and students to faculty rates from UNIVSTATS. The average salary by the state is from ZipRecruiter, telling us about the annual salary and the monthly, weekly, and hourly wage. The state table is from statetable.com, which provides downloads of US states, Canadian provinces, territories, and minor possessions.

The first three statistics are converted into CSV files using the online conversion site https://www.convertcsv.com/html-table-to-csv.htm by entering the sites' URLs. The state table provides a CSV file for direct download.

CSV sceenshot of US States Population

population

CSV screenshot of US States College

college

In the US States College data source, I want to import the student population and graduation rate to make the comparison. However, I need to transform the population from a string into an integer by clearing the comma. Also, the graduation rate encounters a more complicated problem: it needs to change the percentage string into a correct number of the rate.

CSV screenshot of US States Salary

salary

In the US States Salary data source, I want to import the annual salary. In this case, I have to clean the $ and the comma to make it an integer.

CSV screenshot of US States Table

state

Workflow

workflow

First, I create a newproject database for my project. Then I made five tables: states, state_names, populations, colleges, and salaries for data importing. After that, I extracted data from the four CSV files and inserted the key columns into the tables. At the same time, I cleaned the data I needed to make it workable. After finishing data importing, I wrote SQL queries to select the columns I wanted to use and joined them into a more completed table. Besides, I wrote a query that grouped different regions in the states to analyze further. In the final stage, I tried to compare the two variables in my SQL queries using SciPy. The other analysis I conducted used maplotlib to show the graph for different regions.

Database Design

ER Diagram

er

Relational Vocab

rv

Sample Tables

tables

Manage Synonyms

State has_many StateName in this situation. First, I imported the state data source into the state table and extracted its id. Then, I used the returning id as the state_id foreign key. I put different state names like Texas and its abbreviation TX into the same column in the statename table. After that, I inserted the other data into tables and used queries to join other tables with state_id as the foreign key they share.

The code used to load the data into the database

Analysis

I created a variable called student_ratio, the population number of a state divided by the student population. Then, I compared the student_ratio with the graduation_rate to see the correlation between the two variables. I used SciPy in the code, and the correlation and confidence interval turned out to be pearsonr: -0.360 and spearmanr: -0.382. Both of them indicate a slightly negative correlation. When the student ratio goes up, the graduation rate might decrease a little.

I used SQL to group the data into four regions in the states. Then I queried the population, student population, and annual salary. After that, I put the outcome into the matplotlib to see if the graph indicates their correlation. The two bars are the populations, and the line is the annual salary. However, there seems to be no correlation between these three factors. The population amount doesn't affect a region's annual salary.

Challenge

I'm glad that I learned a lot by solving the many problems I encountered during this project. In the beginning, I didn't know how to look up foreign keys while inserting data, so I inserted other data with no foreign keys first, then inserted the foreign key later, which resulted in duplicate records. After lots of trial and error, I started to understand how to use if to count rows and the logic behind fetching id. Also, the meeting with the professor helped a lot. I finally could insert the data in the right way. I also learned to use regex to clean the fields. At first, I didn't notice that I couldn't use my data because it's a string. After the queries, I discovered the issues and searched how to remove extra signs on the Internet and the course syllabus. I tried using re.sub and str.replace, and I chose regex afterward to make the code look clearer. Besides these, I had other challenges like not knowing how to create state has many statename in the tables and other minor errors. It's a long journey, but I feel a sense of fulfillment when seeing the code run successfully.

Analysis Tool Learing

First, I learned the Scipy by looking into the course materials. After I understood what Scipy means, I searched and explored the Scipy documentation. In the document, I learned the difference between Pearson and Spearman correlations. Second, I learned how to use matplotlib to demonstrate the graph. I searched on the Internet and found much helpful information on matplotlib documentation. It shows the different kinds of charts it can produce. I tried the graph, but I wanted to add a line to the graph, so I searched on other Q&A platforms and found the solution to the issue. Both tools are handy in the analysis, and I learned a lot by running them with my data.