Identifying duplicate bills across states

2013-02-05 2 min read

    This past weekend I participated in the Bicoastal Datafest hackathon that brought together journalists and hackers with the goal of analyzing money’s influence in politics. I came in with the idea of analyzing the evolution of a bill in order to see which politician made the various changes and relate that to campaign contributions. I quickly discovered that that wouldn’t be very easy, especially in two days, but I did meet Llewellyn, a journalist/hacker, who had a more practical idea of programmatically identifying bills across states that used the same language. The intuition behind this being that it would identify bills that were unlikely to have been written independently of one another and likely to have been influenced by a 3rd party.

    We ended up with the following approach that we were able to code up during the weekend:

    1. Use the OpenStates API to get the URL of the bills
    2. Download the bills and convert each to raw text - from PDF and HTML
    3. Extract 8 word phrases from each bill, excluding stopwords
    4. See which phrases were duplicated across states
    5. Examine the duplicate phrases to see which bills are most likely duplicates

    Somewhat surprisingly, this approach led us to discover the following duplicate bills:

    Firearms Freedom Acts

    Shared the phrase: manufactured without inclusion significant parts imported another state

    Prohibit US government officials from enforcing firearm-related acts

    Shared the phrase: accessory ammunition owned manufactured commercially privately state remains

    Prevent pharmaceutical substitution of opioid drugs

    Shared the phrase: bear labeling claim respect reduction tampering abuse abuse

    The code’s up on Github so if you have any ideas or improvements - contribute and help out. In two days we were able to get something useful done and it’s exciting to see what we can discover if we stick with it.