The tech world is conflicted about how much math a developer needs. Engineers working on quantitative systems or data science clearly require advanced math and there are also countless engineering roles where math is unnecessary. My experience is that even if you don’t use math, having a mathematical mindset makes you significantly more productive. You’re able to quickly estimate the complexity of various tasks and hone your intuition. You’re also able to quickly recognize patterns when refactoring, especially when working in a functional language. A basic understanding of probability and statistics is a great way to analyze the performance of your code as well as help you model and understand your application behavior. I wanted to share a quick story of how a mathematical approach came in handy when working on Pressi.
First, a little bit of background. Pressi is a social media mashup page that takes the content a user posted across a variety of social media networks and creates a “Flipboard” style web page to showcase it. At launch, we had a simple cron job that would run every hour and pull new data for each of our users. Over time, we migrated to a task system that let us run these retrieval tasks in parallel and split across multiple machines. Using this approach, we were able to scale well and handle the increased volume but our hosting costs saw a big jump so we went looking for a solution.
Luckily for us, we tracked the history of each social network data pull (containing user, network, datetime, and # of items pulled) and doing a quick query told us that close to 92% of our requests resulted in no data being retrieved. The intuition behind this is that most people will not be posting on every social network every hour. By eliminating these calls we’d be able to drastically cut our hosting costs.
This analysis got us thinking about the ideal case which is for us to pull a moment immediately after it’s posted. One way to achieve it was to leverage the push updates that some of the social networks supported but we wanted to find a more general way that could tell us when we should pull the data for a particular user/network pair.
To figure this out, we looked at another distribution: the average number of moments shared by a user on a network per day. This let us look at the number of users who were extremely active on social media down to the users that pretty much only had accounts. We then dumped this data into Excel in order to come up with ranges that we’d use to segment our user/network pairs in order to see how often we should attempt to pull their data. For example, a user that on average posted 20 updates a day on Facebook would have their Facebook data pulled every 4 hours but a user who posted on Instagram less than once a day would have their data pulled once a day. This also gave us a way to estimate how many fewer calls we’d need to make compared to what we were currently doing and therefore approximate the cost savings. The result of this update was that we dropped the number of useless requests from ~92% to just over 40%. This was by no means perfect but gave us improvement we needed. An additional update we modeled out but but didn’t get a chance to implement was to look at day of week and hourly patterns in order to identify when users were actually posting rather than treat every day and hour the same way. The data clearly showed that users had well defined schedules which would have led to another nice improvement.
The key lesson here was that we started by leveraging the data we collected to identify the major cause of our cost increase and then identified the metric we wanted to optimize. In our case it was to reduce the volume of empty requests we were making while making sure that we did not significantly increase the average number of moments that were retrieved for non empty calls. Otherwise, we could make 1 request a week for each user/network which would pretty much drop the number of useless requests to zero but blow up the number of average moments retrieved per call. We could have chosen a variety of other metrics but went with this one since it was intuitive, easy to model, and easy to test. The other neat property is that it’s self correcting so if a user changes their behavior on a particular network we’d shift them into another bucket.
None of the math we used was very complicated and although we tried playing around with a few statistical distributions to model out the user posting behavior we ended up quickly abandoning those when we saw the impact we’d get from a simple approach. I’d bet that almost every code base has something that can be improved with a little bit of mathematical analysis.