INF 385T.9: Data Wrangling

Final Project Report: Geographic & Economic Correlations with Competitive Success in Major League Soccer

Anon

Spring 2022

__

Project Summary

For my project, I have chosen to explore team/stadium locations, player salaries, and match performance data for the 2021 season of U.S. Major Leage Soccer in attempt to visually analyze correlations--if any--between geographic location and player compensation budgets on overall team/match performance across one regular season of play.

To that end, I've gathered three primary data documents pulled from two separate online resources:

  1. Sports stadium geographic coordinates (CSV) for the year 2021 compiled from individual Wikipedia records by Kaggle user Logan Donaldson -- containing all team names and home stadium latitude/longitude coordinates in Major League Baseball (MLB), National Football League (NFL), National Basketball Association (NBA), National Hockey League (NHL), and Major League Soccer (MLS) - https://www.kaggle.com/logandonaldson/sports-stadium-locations

  2. Complete major league soccer match data (teams, location, scores, etc) from 1996 to early 2022, compiled in a CSV file by Kaggle user Joseph Mohr from public data published by Major League Soccer and ESPN - https://www.kaggle.com/josephvm/major-league-soccer-dataset?select=matches.csv

  3. 2021 player salaries published direcly online (HTML table) by the MLS Players Association - https://mlsplayers.org/resources/salary-guid

From this data, I have produced a few simple charts to help visually explore the following correlations:

  1. Compare cumulative team salaries based on latitude and longitude. Do salary budgets vary from north to south/east to west?

  2. Compare each team's cummulative success rate (goals scored/goals conceded) in realation to its player salary budget. Does more money spent equal more success where it matters on the field?

  3. Assess the apparent impact of home field advantage across the league. Where does being on home soil seem to have the most positive or negative impact on game performance?

Data Sources

As mentioned previously, my first two data sources (stadium locations and match data) were compiled from other sources and shared by users of Kaggle.com. As such, I was able to download both of them directly from Kaggle in CSV format, thereby requiring no additional processing before uploading to my JupyterHub - as demonstrated in these snapshots highlighting both the original raw data and the spreadsheet format displayed in JupyterLab after they were uploaded:

stadium_data_raw.jpg stadium_jupyter_view.jpg

match_data_raw.jpg match_data_jupyter_view.jpg

Despite being prepackaged and readily digestible in comma delimeted format, I nonetheless encountered my first setback with this data when trying to complete my homework assignment for week 11 - wherein we were asked to write a python script to read one of our files and parse a few different data elements into meaningful sentences. While the .csv documents displayed perfectly through the JupyterLabs csv viewer, my attempts to read through the file and parse call on pieces of data by their corresponding dictionary key kept failing. After much trial and error, I uncovered the appearance of certain invisible characters such as '\ufeff' at certain key points in my document. After reading a few discussion boards, I eventually learned that this is due to a discrepancy between how the text was originally encoded vs the default reading parameters in python's open() function. I was finally able to overcome it by simply adding an utf-8 encoding parameter within the function before reading the file into Python. By default, I just went ahead and applied that parameter for each file I read thereafter as well just in case. I suspect it isn't necessary in every case, but I haven't observed any other sneaky character problems that would prompt me to reconsider this decision.

The player salary data was not readily available in .csv format, but rather couched within a scrollable html table on the MLS Player's Association website. I simply used the HTML to CSV table converter (https://www.convertcsv.com/html-table-to-csv.htm) to extract this data and export it into .csv format before uploading to JupyterHub - as seen in these raw vs uploaded verson screenshots:

player_data_raw.jpg player_data_jupyter_view.jpg

Workflow

Screenshot 2022-05-05 073340.jpg

Data Collection

SQL Database Initialization

Parsing Data and Loading Into SQL Database

Export For Analysis

Visual Analysis

ER Diagram

Screenshot 2022-05-04 031147.jpg Screenshot 2022-05-04 031351.jpg

I've chosen to spread this data out across the essential components of gameplay: the player, the team, and the match.
Though my player and team location datasets are confined to 2021, I've arranged them in a habtm relationship with a shared relational table in between them in order to acknowledge the temporal nature of any player/team assignment and compensation package. Were I to obtain new data for subquequent years, this would hopefully make it easier track a player's location and pay rate over time without overly confusing the players table. Likewise, placing the team_match_appearances in between teams and matches makes it simpler to parse out and analyze each individual team's performance within each game while still having a dedicated repository for more basic match data like data/time, attendance, and location (via the home_team_id).

Were I to pull out all of the additional player data available in the 'MLS_Matches.csv', I would almost certainly need to resolve quite a few player name aliases. However, I found enough unresolvable inconsistencies with the player information contained within that matches dataset as to render it too useless for such additional effort. Instead the only attribute name variants I had to deal with came in the form of team names - i.e. some data sources simply list the city, others may choose to include or not include foreign language accent marks or punctuation (ex. Washington DC vs D.C.), and then -because it's American soccer afterall- there are any number of opinions as to whether or not a team should attach an FC of CF to its name.

To reconcile these differing names so that their corresponding data all become attached to the same team entity, I've created a separate team_names table specifically to contain all possible team name variants I may come across in my data. I've chosen to treat the teams table as the sort of definitive record for each team - containing there what I choose to consider its "standard_name" along with its location etc. As such the teams.id is the definitive identifier for each recognized team, which is therefore attached to each team name variant in the team_names table as a way of ensuring that any data source that references a given name variant will draw a consistent team_id as we write its data across various associated tables.

Originally, populating all the known team name variants into the team_names table was a rather manual affair. I copied each corresponding team column from each dataset into another separate excel spreadsheet, wherein I eliminated all the duplicate entries and whittled it down to a tidy little naming key of sorts with each name variant represented in their own row along with whatever their corresponding StandardizedName ought to be - even if that happens to be the same as the variant itself. That last part was necessary for the matching process to work correctly after this homemade name variant dataset was copied into the team_names table where it would then serve as a key for future data reads. Essentially, when I read the name variant csv, I told Python to add each NameVariant to the team_names table and then use the StandardizedName column to match up with the standard_name in the teams table and attach its teams.id alongside the name variant in the team_names table. The manually compiled name variant spreadsheet and the code I used to populate it into my database looked like this:

team_variant_csv_example.jpg

Now then, why am I referring to this in the past tense? Well... While this strategy seemed to work just swell -and in fact was one of the strategies Dr. Howison had advised us to use- the somewhat awkward and potentially error-prone method that I had derived that homemade variant list just didn't sit well with me over time. What if I had missed one of the variants somewhere? What if I copied in an updated dataset that contains yet another variant I might have overlooked? For these reasons, I ended up wasting WAY too much of my last day of work on this project throwing out my old name variant csv and wrestling with Python logic I only barely understand to try and automate the process of catching any new variant that pops up in my data and populating the team_names table (or not) accordingly. The code is of course listed below, but essentially I've just incorporated a pre-read of all of my csv document to review the team names before looking at any of the other data. As mentioned before, I'm treating the names used in the stadiums dataset as my standard, so that gets read and written to the team_names table first - I do this at the same time I write the data to the teams table. Then with each new document, I tell Python to read through each line, check the team name against the team_names table, and if the name isn't there I get a prompt telling me the name variant in question, displaying a list of acceptable standard_names along with their teams.id, and asking me which one to match the name variant up with. If the data is irrelevant in some way, there's also the option to pass and ignore that team name altogether. The version I landed on in the end surely isn't the best possible solution to this problem -and in fact it still involves a manual id assignment step- but it gets the job done and makes me feel more confident that I've actually caught everything and could continue to catch them in each future data import if there were any.

The momentary interaction ends up looking like this:

Screenshot 2022-05-05 091348.jpg

_

Loading Data Into My Database

>Create my database

>Create each of my tables according to my ER Diagram sketch

>Read '2021_Stadiums.csv' and write data to the teams table.

Also, add each standard name variant to the team_names table along with its corresponding teams.id

Verifying data load with a SQL notebook:

Screenshot 2022-05-05 034645.jpg Screenshot 2022-05-05 034854.jpg

>Review team names in the rest of my datasets and add new team name variants to team_names when applicable.

This is the extra unnecessary side project I described previously, which is meant to replace the manually created name variant spreadsheet and allow for a bit more consistent automated process.

One thing I failed to mention previously is that by implementing this and testing to make sure it was working correctly, I actually discovered an extra name variant which had somehow been omitted in my previous drafts. It seems my uneasiness with that previous strategy had at least some merit. Enough to justify the amount of time I wasted chipping away at this replacement when I should have been crafting more meaningful analyses? That I couldn't say.

Verify new team name variants are populating team_names via SQL notebook:

Screenshot 2022-05-05 035640.jpg

It's working!

>Populate players table with names from MLS_2021_Player_Salaries.csv. Add player id and team assignment data to players_teams table.

Verify players and players_teams table population with SQL notebook:

Screenshot 2022-05-05 043257.jpg Screenshot 2022-05-05 043333.jpg

>Populate matches table and team_match_appearances table with data from matches.csv -

matching relevant team_id for each home and away team

Verify matches and team_match_appearances table population with SQL notebook

Screenshot 2022-05-05 044423.jpg Screenshot 2022-05-05 044559.jpg

__

EXPORTING FOR ANALYSIS

CSV Export #1: Team Salaries vs Geographic Location

(team name / total player compensations by team / team latitude / team longitude)

CSV Export #2: Team Salaries vs Regular Season Match Performance

(team name / total player compensation / total goals for / total goals against / goal differential / total wins)

CSV Export #3: Goal Counts by Stadium Location

(host stadium team name / host latitude / host longitude / home goal total / away goal total / combined goal total / goal differential)

Visual Analysis

Having exported these three .csv files, I chose to try out Tableau for the data visualizations.

Comparing total team compensation rates from east to west and south to north.

compensation_vs_longitude.jpg

compensation_vs_latitude.jpg

Comparing total team compensation with cumulative goal output and cumulative goals conceded.

performance_v_compensation.jpg

Evaluation of Homefield Advantage - Goals Scored vs Goals Conceded at Home

homefield_advantage.jpg

Takeaways from these charts:

Unsurprisingly, the geographical comparisons of salary and goal output proved rather unenlightening. The relatively even spread of salary rankings and goal outputs across the full spectrum of latutudes and longitudes suggests barely a hint of correlation there.

While less of a true analysis and more of a straightforard ranking of performance, I found the home stadium goal differential chart to be somewhat more interesting than the others. On a personal level as a soccer fan, it's interesting to see just how much better or worse certain teams tend to perform in their own home stadiums. Anecdotally, Seattle is understood to be a fortress of impenatrable home team success. Yet, at least in 2021, they appeared to perform only half as well at home as NY City. Unlike the previous charts, this one provides a very clear answer as to which teams performed well at home last year and which did not.

My Chosen Analysis Tool: Tableau

I specifically chose Tableau because I was unfamiliar with it. I've encountered the name periodically over the last couple of years and always in a rather matter of fact way that suggests I should at least know something about it. So, rather than taking my freshly wrangled data straight to trusty ol' Excel, I dove head first into Tableau. While I was ultimately able to drag some things around and tweak some knobs here and there to make my charts look reasonably professional, I found the interfact to be somewhat overwhelming and not very intuitive. Having very little time to delve too deeply into tuturial videos and whatnot, I really just adopted a fake it till you make it approach, which turned out better than I would've expected. That said, my charts are very simple and probably could have been done just as well, if not better, in Excel. But, then I still wouldn't have any idea what Tableau is all about.

Now that I realize I have free usage of it while I'm a student, I'll no doubt do some more exploring to try and unlock it's more advanced capabilities. For this project though, I found the python and sql wrangling to be far more satisfying than the final visual results.

__

Challenges

Though I've only dabbled in very little coding in my adult life, my experiences with this project seemed consistent with most of those other experiences -- which is to say, it's the little things that really add up and take a lot of time to sort out. After days/hours of head scratching, I all too often look back on my work and all I see is a handful of relatively simple lines of code that couldn't have possible caused me that much grief! More often than not, those little things in this project involved invisible characters and/or renegade spaces in my string data -- both of which are just so hard to identify directly. They're like black holes - all I can do is detect what's going on around them and deduce that they're in there. There were so many instances where I would run a SELECT * queery and see certain output right there in front of me, but then if I add a WHERE clause specifying the very same data I just saw there, I would suddenly get a blank response. I wish I could say I've learned a specific tool or two for tracking down exactly what the invisible thing is, but the truth is it remains trial and error. That said, I've now at least been around that block enough times that I can at least spot rather quickly that I do in fact have an invisible problem, even if I can't quite make it visible at this point.

During the date importing stage, I rather enjoyed playing with the datetime module which allowed me to pull together data from three separate columns and convert it into one tidy timestamp. Likewise, I found the export stage provided some useful opportunities to practice and better understand some subqueries and grouping strategies that sometimes tie my brain in knots.

Aside from that, it's kind of hard for me to think of any one major eureka moment where a concept really clicked and I finally overcame an obstacle. It was more of a steady stream of small challenges trying to get the syntax right here and there or dialing in the right regex tweak. As I'm typing this though, I realize there was one eureka I'm overlooking, which in fact had me feeling rather dumb just a couple of weeks ago; and that's dealing with name variations via my database design.

Early on in this development process when this was first discussed in class and in a couple of private meetings with Dr. Howison, I just couldn't seem to wrap my head around how having an extra table in my database was going to help me bring together entities with totally different names. It made sense on an intuitive level, but I just couldn't connect the dots to put it into practice. I had such a conceptual blockage going on that my mind just kept trying to come up with other solutions that I COULD wrap my head around. Like, why not just build a python dictionary and process the name variants at the point of .csv reading - before they even reach the database? Interestinglpy enough, I think it was precisely that sort of mental avoidance that finally helped me to understand how it's supposed to work. Clearly I had wrapped my head around the utility of building a temporary python dictionary in order to help pair the variant names with the standardized names, so it was just a small step (or trip really) that led me to realize that that extra team_names table basically is my python dictionary! It's serving the exact same role I was envisioning python should do instead. Like so many things, I really just needed to think my way around the problem to finally see what I was looking at. From there, the rest of my design really started to click into place.

While wrapping my head around that concept was perhaps my biggest (and somewhat embarrassing) challenge in this project, my favorite accomplishment probably was the name variant identification tool I ended up creating in order to implement that name_variant table. It's really not that all that elegant and I'm sure the same effect can probably be accomplished with a fraction of the code were I more experienced. But, it was a satisfying exercise taking what little python knowledge I have and awkwardly hammering it into my problem until I arrived at a solution that does the essential things I need it to do. It identifies the all the new name variants in my datasets without me having to look over them manually. It gives me the opportunity to match them directly with the appropriate standard name. It facilitates the discarding of non-useful team name variants. And, it remembers what I've discarded so I don't have to do so repeatedely for every line of data it may appear in. It's about all I could ask for short of having python analyze the text directly and select the best match for me -- which I wouldn't trust to begin with. For what I wanted to accomplish in this moment, that bit of code turned out perfect. And it feels good to make something useful with tools I barely understand.

All in all, I'd say this was a successful project for me. Though I'm not particularly proud of the somehwat uninspiring final data analysis portion, I found this to be a fruitful exercise that helped me to better understand all that we've covered over the course of this semester.