For my project, I have chosen to explore team/stadium locations, player salaries, and match performance data for the 2021 season of U.S. Major Leage Soccer in attempt to visually analyze correlations--if any--between geographic location and player compensation budgets on overall team/match performance across one regular season of play.
To that end, I've gathered three primary data documents pulled from two separate online resources:
Sports stadium geographic coordinates (CSV) for the year 2021 compiled from individual Wikipedia records by Kaggle user Logan Donaldson -- containing all team names and home stadium latitude/longitude coordinates in Major League Baseball (MLB), National Football League (NFL), National Basketball Association (NBA), National Hockey League (NHL), and Major League Soccer (MLS) - https://www.kaggle.com/logandonaldson/sports-stadium-locations
Complete major league soccer match data (teams, location, scores, etc) from 1996 to early 2022, compiled in a CSV file by Kaggle user Joseph Mohr from public data published by Major League Soccer and ESPN - https://www.kaggle.com/josephvm/major-league-soccer-dataset?select=matches.csv
2021 player salaries published direcly online (HTML table) by the MLS Players Association - https://mlsplayers.org/resources/salary-guid
From this data, I have produced a few simple charts to help visually explore the following correlations:
Compare cumulative team salaries based on latitude and longitude. Do salary budgets vary from north to south/east to west?
Compare each team's cummulative success rate (goals scored/goals conceded) in realation to its player salary budget. Does more money spent equal more success where it matters on the field?
Assess the apparent impact of home field advantage across the league. Where does being on home soil seem to have the most positive or negative impact on game performance?
As mentioned previously, my first two data sources (stadium locations and match data) were compiled from other sources and shared by users of Kaggle.com. As such, I was able to download both of them directly from Kaggle in CSV format, thereby requiring no additional processing before uploading to my JupyterHub - as demonstrated in these snapshots highlighting both the original raw data and the spreadsheet format displayed in JupyterLab after they were uploaded:
Despite being prepackaged and readily digestible in comma delimeted format, I nonetheless encountered my first setback with this data when trying to complete my homework assignment for week 11 - wherein we were asked to write a python script to read one of our files and parse a few different data elements into meaningful sentences. While the .csv documents displayed perfectly through the JupyterLabs csv viewer, my attempts to read through the file and parse call on pieces of data by their corresponding dictionary key kept failing. After much trial and error, I uncovered the appearance of certain invisible characters such as '\ufeff' at certain key points in my document. After reading a few discussion boards, I eventually learned that this is due to a discrepancy between how the text was originally encoded vs the default reading parameters in python's open() function. I was finally able to overcome it by simply adding an utf-8 encoding parameter within the function before reading the file into Python. By default, I just went ahead and applied that parameter for each file I read thereafter as well just in case. I suspect it isn't necessary in every case, but I haven't observed any other sneaky character problems that would prompt me to reconsider this decision.
The player salary data was not readily available in .csv format, but rather couched within a scrollable html table on the MLS Player's Association website. I simply used the HTML to CSV table converter (https://www.convertcsv.com/html-table-to-csv.htm) to extract this data and export it into .csv format before uploading to JupyterHub - as seen in these raw vs uploaded verson screenshots:
Data Collection
SQL Database Initialization
Parsing Data and Loading Into SQL Database
Export For Analysis
Visual Analysis
I've chosen to spread this data out across the essential components of gameplay: the player, the team, and the match.
Though my player and team location datasets are confined to 2021, I've arranged them in a habtm relationship with a shared relational table in between them in order to acknowledge the temporal nature of any player/team assignment and compensation package. Were I to obtain new data for subquequent years, this would hopefully make it easier track a player's location and pay rate over time without overly confusing the players table. Likewise, placing the team_match_appearances in between teams and matches makes it simpler to parse out and analyze each individual team's performance within each game while still having a dedicated repository for more basic match data like data/time, attendance, and location (via the home_team_id).
Were I to pull out all of the additional player data available in the 'MLS_Matches.csv', I would almost certainly need to resolve quite a few player name aliases. However, I found enough unresolvable inconsistencies with the player information contained within that matches dataset as to render it too useless for such additional effort. Instead the only attribute name variants I had to deal with came in the form of team names - i.e. some data sources simply list the city, others may choose to include or not include foreign language accent marks or punctuation (ex. Washington DC vs D.C.), and then -because it's American soccer afterall- there are any number of opinions as to whether or not a team should attach an FC of CF to its name.
To reconcile these differing names so that their corresponding data all become attached to the same team entity, I've created a separate team_names table specifically to contain all possible team name variants I may come across in my data. I've chosen to treat the teams table as the sort of definitive record for each team - containing there what I choose to consider its "standard_name" along with its location etc. As such the teams.id is the definitive identifier for each recognized team, which is therefore attached to each team name variant in the team_names table as a way of ensuring that any data source that references a given name variant will draw a consistent team_id as we write its data across various associated tables.
Originally, populating all the known team name variants into the team_names table was a rather manual affair. I copied each corresponding team column from each dataset into another separate excel spreadsheet, wherein I eliminated all the duplicate entries and whittled it down to a tidy little naming key of sorts with each name variant represented in their own row along with whatever their corresponding StandardizedName ought to be - even if that happens to be the same as the variant itself. That last part was necessary for the matching process to work correctly after this homemade name variant dataset was copied into the team_names table where it would then serve as a key for future data reads. Essentially, when I read the name variant csv, I told Python to add each NameVariant to the team_names table and then use the StandardizedName column to match up with the standard_name in the teams table and attach its teams.id alongside the name variant in the team_names table. The manually compiled name variant spreadsheet and the code I used to populate it into my database looked like this:
Now then, why am I referring to this in the past tense? Well... While this strategy seemed to work just swell -and in fact was one of the strategies Dr. Howison had advised us to use- the somewhat awkward and potentially error-prone method that I had derived that homemade variant list just didn't sit well with me over time. What if I had missed one of the variants somewhere? What if I copied in an updated dataset that contains yet another variant I might have overlooked? For these reasons, I ended up wasting WAY too much of my last day of work on this project throwing out my old name variant csv and wrestling with Python logic I only barely understand to try and automate the process of catching any new variant that pops up in my data and populating the team_names table (or not) accordingly. The code is of course listed below, but essentially I've just incorporated a pre-read of all of my csv document to review the team names before looking at any of the other data. As mentioned before, I'm treating the names used in the stadiums dataset as my standard, so that gets read and written to the team_names table first - I do this at the same time I write the data to the teams table. Then with each new document, I tell Python to read through each line, check the team name against the team_names table, and if the name isn't there I get a prompt telling me the name variant in question, displaying a list of acceptable standard_names along with their teams.id, and asking me which one to match the name variant up with. If the data is irrelevant in some way, there's also the option to pass and ignore that team name altogether. The version I landed on in the end surely isn't the best possible solution to this problem -and in fact it still involves a manual id assignment step- but it gets the job done and makes me feel more confident that I've actually caught everything and could continue to catch them in each future data import if there were any.
The momentary interaction ends up looking like this:
import psycopg2
from psycopg2.extensions import ISOLATION_LEVEL_AUTOCOMMIT
conn = psycopg2.connect(host='localhost');
conn.set_isolation_level(ISOLATION_LEVEL_AUTOCOMMIT);
cursor = conn.cursor();
cursor.execute(
'''
CREATE DATABASE draft_mls_database
'''
)
cursor.close()
conn.close()
import psycopg2
from psycopg2.extras import RealDictCursor
with psycopg2.connect(host='localhost', dbname='draft_mls_database') as conn:
with conn.cursor(cursor_factory=RealDictCursor) as cursor:
cursor.execute('''
DROP TABLE IF EXISTS teams;
CREATE TABLE teams (id SERIAL PRIMARY KEY,
standard_name TEXT,
division TEXT,
latitude NUMERIC,
longitude NUMERIC);
DROP TABLE IF EXISTS matches;
CREATE TABLE matches (id SERIAL PRIMARY KEY,
date_time_utc TIMESTAMP,
home_team_id INTEGER,
away_team_id INTEGER,
home_score INTEGER,
away_score INTEGER,
tie_breaker TEXT,
attendance INTEGER);
DROP TABLE IF EXISTS team_match_appearances;
CREATE TABLE team_match_appearances (id SERIAL PRIMARY KEY,
match_id INTEGER,
team_id INTEGER,
season INTEGER,
comp_stage TEXT,
home_away TEXT,
goals_for INTEGER,
goals_against INTEGER,
yellow_cards NUMERIC,
red_cards NUMERIC,
possession NUMERIC);
DROP TABLE IF EXISTS players;
CREATE TABLE players (id SERIAL PRIMARY KEY,
first_name TEXT,
last_name TEXT);
DROP TABLE IF EXISTS players_teams;
CREATE TABLE players_teams (id SERIAL PRIMARY KEY,
player_id TEXT,
team_id INTEGER,
season INTEGER,
compensation NUMERIC);
DROP TABLE IF EXISTS team_names;
CREATE TABLE team_names (id SERIAL PRIMARY KEY,
team_id INTEGER,
name TEXT);
''')
import psycopg2
from psycopg2.extras import RealDictCursor
import csv
with psycopg2.connect(host='localhost', dbname='draft_mls_database') as conn:
with conn.cursor(cursor_factory=RealDictCursor) as cursor:
with open('2021_Stadiums.csv', encoding="utf-8-sig") as csvfile:
myCSVReader = csv.DictReader(csvfile)
sql_select_teams = """
SELECT id
FROM teams
WHERE standard_name = %(Team)s
"""
sql_select_team_names = """
SELECT id
FROM team_names
WHERE name = %(Team)s
"""
sql_insert_team = """
INSERT INTO teams(standard_name,
division,
latitude,
longitude)
VALUES (%(Team)s,
%(Division)s,
%(Lat)s,
%(Long)s)
RETURNING id
"""
sql_insert_team_name = """
INSERT INTO team_names(team_id,
name)
VALUES (%(team_id)s,
%(name)s)
"""
for row in myCSVReader:
if row['League'] == 'MLS': #dataset contains other sports leagues, but I only want MLS.
cursor.execute(sql_select_teams, row)
results = cursor.fetchall()
if (len(results) == 0): #avoiding duplicate in teams table.
cursor.execute(sql_insert_team, row) #write data to teams table if not already present.
results = cursor.fetchall()
team_id = results[0]["id"] #return the new teams.id and assign to variable team_id
param_dict = {"team_id": team_id,
"name": row["Team"]}
cursor.execute(sql_select_team_names, row)
if (cursor.rowcount == 0):
cursor.execute(sql_insert_team_name, param_dict) #write team name variant to team_names along with corresponding team_id
import psycopg2
from psycopg2.extras import RealDictCursor
import csv
with psycopg2.connect(host='localhost', dbname='draft_mls_database') as conn:
with conn.cursor(cursor_factory=RealDictCursor) as cursor:
sql_team_id_menu = """
SELECT id, standard_name
FROM teams
ORDER BY standard_name
"""
print("Current list of standard MLS team names:")
cursor.execute(sql_team_id_menu, row)
results = cursor.fetchall()
for menu_item in results:
print(menu_item["id"], menu_item["standard_name"]) #print current list of standard team names and ids to serve
#as a menu reference if any new variants need to be matched.
print("=======================================\n")
with open('MLS_2021_Player_Salaries.csv', encoding="utf-8-sig") as csvfile:
myCSVReader = csv.DictReader(csvfile)
print("_VERIFYING TEAM NAMES IN 'MLS_2021_Player_Salaries.csv'.\n")
sql_select_player_team_id = """
SELECT team_id
FROM team_names
WHERE name = %(Club)s
"""
sql_insert_team_name = """
INSERT INTO team_names(team_id,
name)
VALUES (%(team_id)s,
%(NameVariant)s)
"""
skiplist = [] #initializing a list to temporarily remember the names I pass on
#so I won't have to repeat it for every iteration.
for row in myCSVReader:
cursor.execute(sql_select_player_team_id, row)
results = cursor.fetchall()
if (len(results) == 0): #checking if team name is in team_names already.
if row["Club"] in skiplist: #if not, check if it's already been recorded as passed in my skiplist.
pass
else:
#ask user to compare new team name variant with standard team name list and select the appropriate matching id.
response = input("\n>>> " + "'" + row["Club"] + "'" + " is not in your list of standard team names.\n choose an appropriate matching team id from the list above \n or enter PASS to ignore this team: ").upper()
if response == "PASS":
skiplist.append(row["Club"]) #if user passes, add varaint to skiplist so it won't come up again.
else:
team_id = response
param_dict = {"team_id": team_id,
"NameVariant": row["Club"]}
cursor.execute(sql_insert_team_name, param_dict) #write new team name variant to team_names along with user selected team_id match.
print(">New team name variant added.\n")
else:
pass
print("\n_'MLS_2021_Player_Salaries.csv' TEAM NAME VERIFICATION COMPLETED.")
with open('MLS_Matches.csv', encoding="utf-8-sig") as csvfile:
myCSVReader = csv.DictReader(csvfile)
print("\n_VERIFYING TEAM NAMES IN 'MLS_Matches.csv'.\n")
sql_select_home_team_id = """
SELECT team_id
FROM team_names
WHERE name = %(home)s
"""
sql_select_away_team_id = """
SELECT team_id
FROM team_names
WHERE name = %(away)s
"""
sql_insert_team_name = """
INSERT INTO team_names(team_id,
name)
VALUES (%(team_id)s,
%(NameVariant)s)
"""
skiplist = [] #initializing a list to temporarily remember the names I pass on
#so I won't have to repeat it for every iteration.
for row in myCSVReader:
if row["league"] == "2021 Major League Soccer":
cursor.execute(sql_select_home_team_id, row)
results = cursor.fetchall()
if (len(results) == 0): #checking if team name is in team_names already.
if row["home"] in skiplist: #if not, check if it's already been recorded as passed in my skiplist.
pass
else:
#ask user to compare new team name variant with standard team name list and select the appropriate matching id.
response = input("\n>>> " + "'" + row["home"] + "'" + " is not in your list of standard team names.\n choose an appropriate matching team id from the list above \n or enter PASS to ignore this team: ").upper()
if response == "PASS":
skiplist.append(row["home"]) #if user passes, add varaint to skiplist so it won't come up again.
else:
team_id = response
param_dict = {"team_id": team_id,
"NameVariant": row["home"]}
cursor.execute(sql_insert_team_name, param_dict)
print("New team name variant added.\n")
else:
pass
cursor.execute(sql_select_away_team_id, row)
results = cursor.fetchall()
if (len(results) == 0):
if row["away"] in skiplist: #if not, check if it's already been recorded as passed in my skiplist.
pass
else:
#ask user to compare new team name variant with standard team name list and select the appropriate matching id.
response = input("\n>>> " + "'" + row["away"] + "'" + " is not in your list of standard team names.\n choose an appropriate matching team id from the list above \n or enter PASS to ignore this team: ").upper()
if response == "PASS":
skiplist.append(row["away"]) #if user passes, add varaint to skiplist so it won't come up again.
else:
team_id = response
param_dict = {"team_id": team_id,
"NameVariant": row["away"]}
cursor.execute(sql_insert_team_name, param_dict)
print("New team name variant added.\n")
else:
pass
print("_'MLS_Matches.csv' TEAM NAME VERIFIFICATION COMPLETED.")
Current list of standard MLS team names: 14 Atlanta United FC 19 Austin FC 22 CF Montréal 23 Chicago Fire FC 7 Colorado Rapids 12 Columbus Crew 2 D.C. United 26 FC Cincinnati 25 FC Dallas 17 Houston Dynamo FC 9 Inter Miami CF 8 LA Galaxy 3 Los Angeles FC 1 Minnesota United FC 15 Nashville SC 11 New England Revolution 27 New York City FC 20 New York Red Bulls 10 Orlando City SC 24 Philadelphia Union 18 Portland Timbers 21 Real Salt Lake 16 San Jose Earthquakes 13 Seattle Sounders FC 6 Sporting Kansas City 5 Toronto FC 4 Vancouver Whitecaps FC ======================================= _VERIFYING TEAM NAMES IN 'MLS_2021_Player_Salaries.csv'.
>New team name variant added.
>New team name variant added.
>New team name variant added.
>New team name variant added.
>New team name variant added.
>New team name variant added.
>New team name variant added.
>New team name variant added.
>New team name variant added.
_'MLS_2021_Player_Salaries.csv' TEAM NAME VERIFICATION COMPLETED. _VERIFYING TEAM NAMES IN 'MLS_Matches.csv'. _'MLS_Matches.csv' TEAM NAME VERIFIFICATION COMPLETED.
It's working!
import psycopg2
from psycopg2.extras import RealDictCursor
import csv
import re
with psycopg2.connect(host='localhost', dbname='draft_mls_database') as conn:
with conn.cursor(cursor_factory=RealDictCursor) as cursor:
with open('MLS_2021_Player_Salaries.csv', encoding="utf-8-sig") as csvfile:
myCSVReader = csv.DictReader(csvfile)
sql_select_player_team_id = """
SELECT team_id from team_names
WHERE name = %(Club)s
"""
sql_insert_player = """
INSERT INTO players(first_name,
last_name)
VALUES (%(first_name)s,
%(last_name)s)
RETURNING id
"""
sql_insert_player_team = """
INSERT INTO players_teams(player_id,
team_id,
season,
compensation)
VALUES (%(player_id)s,
%(team_id)s,
%(season)s,
%(compensation)s)
"""
for row in myCSVReader:
cursor.execute(sql_select_player_team_id, row)
results = cursor.fetchall()
if (len(results) == 0): #verify player's team is in teams table and only proceed if so.
pass #player data includes players hired to not-yet active teams or non-specific 'mls' assigments.
else:
param_dict_player = {"first_name": row["First Name"],
"last_name": row["Last Name"]}
cursor.execute(sql_insert_player, param_dict_player)
new_player_id = cursor.fetchone()['id'] #write player name to players table and fetch the new id
cursor.execute(sql_select_player_team_id, row)
results = cursor.fetchall()
player_team_id = results[0]["team_id"] #fetch matching team_id from team_names
season = "2021" #current dataset only pertains to 2021. making that explicit in case future data comes along.
cleaned_compensation = re.sub('[\$,]','',row["Guaranteed Compensation"]) #remove dollar sign and commas from compensation data.
param_dict_playerteam = {"player_id": new_player_id,
"team_id": player_team_id,
"season": season,
"compensation": cleaned_compensation}
cursor.execute(sql_insert_player_team, param_dict_playerteam) #write player team assignment info to players_teams
import psycopg2
from psycopg2.extras import RealDictCursor
import csv
import re
from datetime import datetime
with psycopg2.connect(host='localhost', dbname='draft_mls_database') as conn:
with conn.cursor(cursor_factory=RealDictCursor) as cursor:
sql_select_hometeam_id = """
SELECT team_id from team_names
WHERE name = %(home)s
"""
sql_select_awayteam_id = """
SELECT team_id from team_names
WHERE name = %(away)s
"""
sql_insert_matches = """
INSERT INTO matches(date_time_utc,
home_team_id,
away_team_id,
home_score,
away_score,
tie_breaker,
attendance)
VALUES (%(date_time_utc)s,
%(home_team_id)s,
%(away_team_id)s,
%(home_score)s,
%(away_score)s,
%(tie_breaker)s,
%(attendance)s)
RETURNING id
"""
sql_insert_team_match_apps = """
INSERT INTO team_match_appearances(season,
comp_stage,
match_id,
team_id,
home_away,
goals_for,
goals_against,
red_cards,
yellow_cards,
possession)
VALUES (%(season)s,
%(comp_stage)s,
%(match_id)s,
%(team_id)s,
%(home_away)s,
%(goals_for)s,
%(goals_against)s,
%(red_cards)s,
%(yellow_cards)s,
%(possession)s)
"""
with open('MLS_Matches.csv') as csvfile:
myCSVReader = csv.DictReader(csvfile)
for row in myCSVReader:
if row["league"] == "2021 Major League Soccer":
cursor.execute(sql_select_hometeam_id, row)
results = cursor.fetchall()
hometeam_id = results[0]["team_id"]
cursor.execute(sql_select_awayteam_id, row)
results = cursor.fetchall()
awayteam_id = results[0]["team_id"]
home_possession = float(row["home_possessionPct"][:-1]) #remove '%' from possession data so it can be counted as a number
away_possession = float(row["away_possessionPct"][:-1])
if len(row["attendance"]) == 0:
attendance = 0
else:
attendance = int(row["attendance"].replace(',', '')) #remove commas from attendance data so it can be counted as a number
if row["shootout"] == "True":
tiebreaker = "shootout"
else:
tiebreaker = "none"
temp_match_time = (row["date"] + ", " + row["year"] + " " + row["time (utc)"]) #merge date, year, and time fields into one string, then...
match_time = datetime.strptime(temp_match_time, "%A, %B %d, %Y %H:%M") #convert date/time string into standard timestamp format
param_dict_matches = {"date_time_utc": match_time,
"home_team_id": hometeam_id,
"away_team_id": awayteam_id,
"home_score": row["home_score"].strip(), #some fields seem to have extra leading/trailing spaces,
"away_score": row["away_score"].strip(), #using .strip() function as a precaution.
"tie_breaker": tiebreaker,
"attendance": attendance}
cursor.execute(sql_insert_matches, param_dict_matches)
new_match_id = cursor.fetchone()['id']
#print("New id for match: {}".format(new_match_id))
param_dict_home = {"season": row["year"],
"comp_stage": row["part_of_competition"].strip(),
"match_id": new_match_id,
"team_id": hometeam_id,
"home_away": "home",
"goals_for": row["home_score"].strip(),
"goals_against": row["away_score"].strip(),
"red_cards": row["home_redCards"].strip(),
"yellow_cards": row["home_yellowCards"].strip(),
"possession": home_possession}
param_dict_away = {"season": row["year"],
"comp_stage": row["part_of_competition"].strip(),
"match_id": new_match_id,
"team_id": awayteam_id,
"home_away": "away",
"goals_for": row["away_score"].strip(),
"goals_against": row["home_score"].strip(),
"red_cards": row["away_redCards"].strip(),
"yellow_cards": row["away_yellowCards"].strip(),
"possession": away_possession}
cursor.execute(sql_insert_team_match_apps, param_dict_home)
cursor.execute(sql_insert_team_match_apps, param_dict_away)
(team name / total player compensations by team / team latitude / team longitude)
import psycopg2
from psycopg2.extras import RealDictCursor
import csv
with psycopg2.connect(host='localhost', dbname='draft_mls_database') as conn:
with conn.cursor(cursor_factory=RealDictCursor) as cursor:
cursor.execute("""
SELECT teams.standard_name AS team,
SUM(players_teams.compensation) AS total_compensation,
teams.latitude AS latitude,
teams.longitude AS longitude
FROM teams
JOIN players_teams
ON teams.id = players_teams.team_id
GROUP BY team, latitude, longitude
ORDER BY total_compensation DESC
"""
)
with open('OUT-Team_Compensation_Location.csv', 'w',encoding="utf-8-sig") as csvfile:
column_names = ["team", "total_compensation", "latitude", "longitude"]
myCsvWriter = csv.DictWriter(csvfile, fieldnames=column_names)
myCsvWriter.writeheader()
for row in cursor:
myCsvWriter.writerow(row)
Done writing csv
(team name / total player compensation / total goals for / total goals against / goal differential / total wins)
import psycopg2
from psycopg2.extras import RealDictCursor
import csv
with psycopg2.connect(host='localhost', dbname='draft_mls_database') as conn:
with conn.cursor(cursor_factory=RealDictCursor) as cursor:
cursor.execute("""
SELECT team,
total_compensation,
SUM(team_match_appearances.goals_for) AS goals_scored,
SUM(team_match_appearances.goals_against) AS goals_conceded,
(SUM(team_match_appearances.goals_for) - SUM(team_match_appearances.goals_against)) AS differential
FROM (
SELECT teams.id AS team_id,
teams.standard_name AS team,
SUM(players_teams.compensation) AS total_compensation
FROM teams
JOIN players_teams
ON teams.id = players_teams.team_id
GROUP BY teams.id, team
) subquery
JOIN team_match_appearances
ON subquery.team_id = team_match_appearances.team_id
WHERE team_match_appearances.comp_stage = 'Regular Season'
GROUP BY team, total_compensation
ORDER BY total_compensation DESC
"""
)
with open('OUT-Team_Compensation_Performance.csv', 'w', encoding="utf-8-sig") as csvfile:
column_names = ["team", "total_compensation", "goals_scored", "goals_conceded", "differential"]
myCsvWriter = csv.DictWriter(csvfile, fieldnames=column_names)
myCsvWriter.writeheader()
for row in cursor:
myCsvWriter.writerow(row)
Done writing csv
(host stadium team name / host latitude / host longitude / home goal total / away goal total / combined goal total / goal differential)
import psycopg2
from psycopg2.extras import RealDictCursor
import csv
with psycopg2.connect(host='localhost', dbname='draft_mls_database') as conn:
with conn.cursor(cursor_factory=RealDictCursor) as cursor:
cursor.execute("""
SELECT host_stadium,
latitude,
longitude,
COUNT(matches.home_score) AS match_count,
SUM(matches.home_score) AS total_home_goals,
SUM(matches.away_score) AS total_away_goals,
(SUM(matches.home_score) + SUM(matches.away_score)) AS all_goals,
(SUM(matches.home_score) - SUM(matches.away_score)) AS differential
FROM (
SELECT teams.id AS team_id,
teams.standard_name AS host_stadium,
teams.latitude AS latitude,
teams.longitude AS longitude
FROM teams
GROUP BY teams.id, host_stadium, latitude, longitude
) subquery
JOIN matches
ON subquery.team_id = matches.home_team_id
GROUP BY host_stadium, latitude, longitude
"""
)
with open('OUT-MatchLocation_GoalCounts.csv', 'w', encoding="utf-8-sig") as csvfile:
column_names = ["host_stadium", "latitude", "longitude", "match_count", "total_home_goals", "total_away_goals", "all_goals", "differential"]
myCsvWriter = csv.DictWriter(csvfile, fieldnames=column_names)
myCsvWriter.writeheader()
for row in cursor:
myCsvWriter.writerow(row)
Done writing csv
Having exported these three .csv files, I chose to try out Tableau for the data visualizations.
Unsurprisingly, the geographical comparisons of salary and goal output proved rather unenlightening. The relatively even spread of salary rankings and goal outputs across the full spectrum of latutudes and longitudes suggests barely a hint of correlation there.
While less of a true analysis and more of a straightforard ranking of performance, I found the home stadium goal differential chart to be somewhat more interesting than the others. On a personal level as a soccer fan, it's interesting to see just how much better or worse certain teams tend to perform in their own home stadiums. Anecdotally, Seattle is understood to be a fortress of impenatrable home team success. Yet, at least in 2021, they appeared to perform only half as well at home as NY City. Unlike the previous charts, this one provides a very clear answer as to which teams performed well at home last year and which did not.
I specifically chose Tableau because I was unfamiliar with it. I've encountered the name periodically over the last couple of years and always in a rather matter of fact way that suggests I should at least know something about it. So, rather than taking my freshly wrangled data straight to trusty ol' Excel, I dove head first into Tableau. While I was ultimately able to drag some things around and tweak some knobs here and there to make my charts look reasonably professional, I found the interfact to be somewhat overwhelming and not very intuitive. Having very little time to delve too deeply into tuturial videos and whatnot, I really just adopted a fake it till you make it approach, which turned out better than I would've expected. That said, my charts are very simple and probably could have been done just as well, if not better, in Excel. But, then I still wouldn't have any idea what Tableau is all about.
Now that I realize I have free usage of it while I'm a student, I'll no doubt do some more exploring to try and unlock it's more advanced capabilities. For this project though, I found the python and sql wrangling to be far more satisfying than the final visual results.
Though I've only dabbled in very little coding in my adult life, my experiences with this project seemed consistent with most of those other experiences -- which is to say, it's the little things that really add up and take a lot of time to sort out. After days/hours of head scratching, I all too often look back on my work and all I see is a handful of relatively simple lines of code that couldn't have possible caused me that much grief! More often than not, those little things in this project involved invisible characters and/or renegade spaces in my string data -- both of which are just so hard to identify directly. They're like black holes - all I can do is detect what's going on around them and deduce that they're in there. There were so many instances where I would run a SELECT * queery and see certain output right there in front of me, but then if I add a WHERE clause specifying the very same data I just saw there, I would suddenly get a blank response. I wish I could say I've learned a specific tool or two for tracking down exactly what the invisible thing is, but the truth is it remains trial and error. That said, I've now at least been around that block enough times that I can at least spot rather quickly that I do in fact have an invisible problem, even if I can't quite make it visible at this point.
During the date importing stage, I rather enjoyed playing with the datetime module which allowed me to pull together data from three separate columns and convert it into one tidy timestamp. Likewise, I found the export stage provided some useful opportunities to practice and better understand some subqueries and grouping strategies that sometimes tie my brain in knots.
Aside from that, it's kind of hard for me to think of any one major eureka moment where a concept really clicked and I finally overcame an obstacle. It was more of a steady stream of small challenges trying to get the syntax right here and there or dialing in the right regex tweak. As I'm typing this though, I realize there was one eureka I'm overlooking, which in fact had me feeling rather dumb just a couple of weeks ago; and that's dealing with name variations via my database design.
Early on in this development process when this was first discussed in class and in a couple of private meetings with Dr. Howison, I just couldn't seem to wrap my head around how having an extra table in my database was going to help me bring together entities with totally different names. It made sense on an intuitive level, but I just couldn't connect the dots to put it into practice. I had such a conceptual blockage going on that my mind just kept trying to come up with other solutions that I COULD wrap my head around. Like, why not just build a python dictionary and process the name variants at the point of .csv reading - before they even reach the database? Interestinglpy enough, I think it was precisely that sort of mental avoidance that finally helped me to understand how it's supposed to work. Clearly I had wrapped my head around the utility of building a temporary python dictionary in order to help pair the variant names with the standardized names, so it was just a small step (or trip really) that led me to realize that that extra team_names table basically is my python dictionary! It's serving the exact same role I was envisioning python should do instead. Like so many things, I really just needed to think my way around the problem to finally see what I was looking at. From there, the rest of my design really started to click into place.
While wrapping my head around that concept was perhaps my biggest (and somewhat embarrassing) challenge in this project, my favorite accomplishment probably was the name variant identification tool I ended up creating in order to implement that name_variant table. It's really not that all that elegant and I'm sure the same effect can probably be accomplished with a fraction of the code were I more experienced. But, it was a satisfying exercise taking what little python knowledge I have and awkwardly hammering it into my problem until I arrived at a solution that does the essential things I need it to do. It identifies the all the new name variants in my datasets without me having to look over them manually. It gives me the opportunity to match them directly with the appropriate standard name. It facilitates the discarding of non-useful team name variants. And, it remembers what I've discarded so I don't have to do so repeatedely for every line of data it may appear in. It's about all I could ask for short of having python analyze the text directly and select the best match for me -- which I wouldn't trust to begin with. For what I wanted to accomplish in this moment, that bit of code turned out perfect. And it feels good to make something useful with tools I barely understand.
All in all, I'd say this was a successful project for me. Though I'm not particularly proud of the somehwat uninspiring final data analysis portion, I found this to be a fruitful exercise that helped me to better understand all that we've covered over the course of this semester.