Almost a year ago, AWS launched Athena which allowed you to query data directly off of S3. I loved the idea since it would allow us to simplify our workflow by reducing the need for Spark and Redshift while also cutting our costs. In theory queries that were being run via Spark or Redshift could just be run on top of data stored in S3 without having to load it into any system.
I messed around with a few toy examples and was able to get up and running pretty quickly. Yet I was left disappointed when I threw a real problem at it. We have a variety of jobs that require us scanning a month’s worth of filtered impression data and I wasn’t able to get Athena to successfully complete the query no matter how many different permutations I tried. Every time I thought I had a clever workaround I ended up being disappointed when Athena failed to execute the query.
I’m optimistic that Athena will work through its kinks but at the moment we’re still sticking with our tried and true solutions in Spark and Redshift. Over the next few weeks I’m going to give Redshift Spectrum a shot which seems to be the best of both worlds.