I recently watched Miss Representation which documents how the portrayal of women in the media affects women’s roles in society. It raised many interesting points and definitely got me thinking. If you haven’t seen it already you should definitely check it out. One of the points was that there’s a huge pressure to cast female roles with young actresses whereas it doesn’t matter so much for the male. I was sure this was true but I wanted to see how big of a deal it actually was, take a coding break, and play around with some data. The goal was to replicate the results as well as provide some tools for others to do similar analyses.
I took a quick look at the IMDB site and realized that they did not have an API available. I looked at a few open source alternatives but they all seemed like overkill for what I wanted to do so I decided to just write a quick Python script to scrape the pages I needed. I started by pulling the top 50 movies for each decade (via http://www.imdb.com/chart/1910s - http://www.imdb.com/chart/2010s) and then pulling the top 5 cast members for each movie (via http://www.imdb.com/title/tt1375666/fullcredits#cast). I had to actually look at the actor/actress pages as well in order to pull the birth dates as well as the sex. After loading this data into a database it was a very simple query to run the analysis and then Google Spreadsheets to clean it up.
Not surprisingly, it turns out that over the past 11 decades, the average actor is 41 while the average actress is 32. Interestingly, during the 1980s they were almost the same but the gap has been widening since then.
I may be a bit late to the “Don’t learn to code” debate but I think this illustrates that coding is a pretty useful skill to have. It’s not about being able to develop enterprise applications but more about automating some work and being able to scratch a curiosity itch. If this data were publicly available and everyone had the tools and abilities to do these types of analyses I believe we’d be in much better shape. Maybe someone can take the work I started and leverage it to discover something new.
Note: The code to scrape IMDB is posted on github but note that it’s definitely crude and hackish at times. My goal was to get the analysis done as quickly as possible so I didn’t spend too much time refactoring.