As part of my preparation for the Intro to MySQL class I decided to put together a dataset we’d be able to explore over the course of the class. While trying to think of an interesting dataset to use I remembered I had a script that scraped Yahoo’s fantasy football projections for the 2014 seasons that I used to prepare for my draft. The only issue was that the script generated a CSV file so I had to go through a series of steps to turn it into a clean, relational database. I thought it would be useful to share the commands below and provide some context for those interested in learning more about MySQL and the data import/cleanup process.
The first step is to create the table that we’ll be loading the CSV file into
Now we load the CSV file into the table making sure to specify the options properly. In my case this took a few attempts to deal with the line endings.
Next step is to create the tables we want to end up with. In my case I wanted to normalize the data which required designed a new set of tables. A big assumption made here was that a player will not get traded from one team to another. This is definitely not correct in the real world but it is good enough for this exercise. If we wanted to allow for trades we would have a separate table that would map a player to a team by week.
Now it’s on to the hard part. We want to take the data in the original stats table and convert into a properly normalized data set. The strategy here is to start with the simple tables and work our way up to the more complicated ones leveraging the normalized data we created at each step. The first two tables are teams and positions and we can derive them from the “position” field in the original stats table by splitting the position field into two and realizing that the left side is the team and the right side is the position of given the player.
To generate the players table, we get the player position and team from the stats table and then find the associated ids from the teams and positions tables. The key assumption here is that there are no two players with the same name, on the same team, and the same position.
We can figure out the schedule by getting a list of the unique matchups in the original stats table. Since the games are symmetric we only need to look at the rows that are home games.
Putting everything together we generate the new stats table by doing the relevant lookups in the tables we created. We can ignore the redundant fields (name, position, opponent) and the only thing we need to watch out for is duplicate players. In this case there are two names, Zach Miller and Alex Smith, that need to be made “unique” by also looking at their team.