Super Simple Storage for Social Web Data with MongoDB (Computing Twitter Influence, Part 4)

In the last few posts for this series on computing twitter influence, we’ve reviewed some of the considerations in calculating a base metric for influence and how to acquire the necessary data to begin analysis. This post finishes up all of the prerequisite machinery before the real data science fun begins by introducing MongoDB as a staple in your social web mining toolkit and showing how to employ it for storing social data such as Twitter API responses.

As Easy As It Should Be

MongoDB is an excellent option to consider if you need a quick and easy fix for your data science experiments, and if you like Python, there’s a good chance you’ll enjoy MongoDB as well. Much like Python, MongoDB easy to pick up along the way, it scales up fairly well as the size of your data grows without too much fuss, the online documentation is excellent, the community is robust, language bindings are plentiful, and it’s generally just as easy as it should be to do a lot of data manipulation to/from Python.

MongoDB is an excellent option to consider if you need a quick and easy fix for your data science experiments…

MongoDB document-oriented, which (for our purposes) basically means that it stores JSON data, enabling you to easily archive the responses that you get back from most social web APIs. It’s easy enough to query the data with the standard find() operator, but a more powerful aggregation framework is available for constructing more nuanced data pipelines.

A primer of MongoDB is unwarranted, but if you have a copy of the book on hand, Chapter 6 (Mining Mailboxes) introduces a MongoDB as a sort of surrogate API for mail data. (The first half of this chapter focuses on normalizing arbitrarily sourced mail data so that it can be ingested into MongoDB for standardized analysis.)

Saving and accessing JSON data with MongoDB (Example 9-7 from the Twitter Cookbook) introduces two functions for storing and retrieving Twitter API data from MongoDB that we’ll adapt in the next section for our immediate needs. Take a moment to review this recipe if you haven’t previously encountered it. The functions that it provides are little more than load/store convenience wrappers.

Storing Millions of Twitter Followers

Recall from the last post in this series that a recipe like Getting all friends or followers for a user (Example 9-19 from the Twitter Cookbook) is fundamentally limited by the amount of memory that’s available. It buffers API responses in memory and accumulates 75,000 long integer values every 15 minutes, and although this is fine for a user with a “reasonable” number of followers, it won’t work at all for celebrity users with millions of followers. Even if we did have unlimited heap space, we’d still want to strive for a low memory profile as well as maintain a persistent archive for more convenient analysis that’s unconstrained by rate limits and network latency. After all, once you have the data, you won’t want to go to the trouble of fetching it again unless absolutely necessary since this process can be quite time consuming.

To illustrate just how easy it is to adapt a recipe from the cookbook like Example 9-19, take a look at this revised version of get_friends_followers_ids that’s been renamed to store_friends_followers_ids and compare it back to the original version. The primary substance of the change is simply the introduction of a save_to_mongo call for persisting each API response (along with a few tweaks to make this possible.)

def store_friends_followers_ids(twitter_api, screen_name=None, user_id=None,
                              friends_limit=maxint, followers_limit=maxint, database=None):

    # Must have either screen_name or user_id (logical xor)
    assert (screen_name != None) != (user_id != None), "Must have screen_name or user_id, but not both"

    # See https://dev.twitter.com/docs/api/1.1/get/friends/ids  and
    # See https://dev.twitter.com/docs/api/1.1/get/followers/ids for details on API parameters

    get_friends_ids = partial(make_twitter_request, twitter_api.friends.ids, count=5000)
    get_followers_ids = partial(make_twitter_request, twitter_api.followers.ids, count=5000)

    for twitter_api_func, limit, label in [
                                 [get_friends_ids, friends_limit, "friends"],
                                 [get_followers_ids, followers_limit, "followers"]
                             ]:

        if limit == 0: continue

        total_ids = 0
        cursor = -1
        while cursor != 0:

            # Use make_twitter_request via the partially bound callable...
            if screen_name:
                response = twitter_api_func(screen_name=screen_name, cursor=cursor)
            else: # user_id
                response = twitter_api_func(user_id=user_id, cursor=cursor)

            if response is not None:
                ids = response['ids']
                total_ids += len(ids)
                save_to_mongo({"ids" : [_id for _id in ids ]}, database, label + "_ids")
                cursor = response['next_cursor']

            print >> sys.stderr, 'Fetched {0} total {1} ids for {2}'.format(total_ids, label, (user_id or screen_name))
            sys.stderr.flush()

            # Consider storing the ids to disk during each iteration to provide an
            # an additional layer of protection from exceptional circumstances

            if len(ids) >= limit or response is None:
                break
                print >> sys.stderr, 'Last cursor', cursor
                print >> sts.stderr, 'Last response', response

# Sample usage follows...

screen_names = ['SocialWebMining', 'LadyGaga']

twitter_api = oauth_login()

for screen_name in screen_names:

    store_friends_followers_ids(twitter_api, screen_name=screen_name,
                                friends_limit=0, database=screen_name)

print "Done"

That’s really all that there is to it. We’re now to the point that we can reliably harvest and store arbitrary volumes of Twitter data.

It may be worthwhile to review the prior posts in this series as a reminder for just how far we’ve come so far. Now having all of the necessary machinery and prerequisite discussion in place, we’ll return to the original proposition of computing Twitter influence with an initial review of some data for a few well-known Twitter accounts in the next post in this series.

Recommended Resources

3 Comments on “Super Simple Storage for Social Web Data with MongoDB (Computing Twitter Influence, Part 4)”

  1. Matthew, is it possible to make tutorials on how to integrate django and mining to show tweets i mined in web interface?
    The problem here that Django is highly relational in its nature, and even if i would use http://django-mongodb-engine.readthedocs.org/en/latest/tutorial.html or http://mongoengine-odm.readthedocs.org/en/latest/tutorial.html i still need to make a model classes with relations (thats ruins advantages of using such approach).
    What i want is to store tweets in Mongodb as json and then (without writing models) show them in the admin and user interface (maybe only screen names, date, retweet status) – but i still need to save all tweet, not only certain fields.
    What’s your advice?

    • To be honest, I probably don’t understand enough about your situation to offer too much prescriptive advice here, and a lot of my initial questions back would be directed at figuring out if you *really* need Django+MongoDB specifically, or if there are existing admin UIs for MongoDB that might work or other approaches all together.

      Not sure if this link provides any administration UIs that might be useful to your situation, but I thought I’d go ahead and pass it on in case you hadn’t run across it: http://docs.mongodb.org/ecosystem/tools/administration-interfaces/

      More to your point, though, I don’t have any great suggestions on how to do what you are asking apart from finding an existing admin UI for MongoDB or writing some custom model code.

Leave a comment