Writing Paranoid Code (Computing Twitter Influence, Part 2)
In the previous post of this series, we aspired to compute the influence of a Twitter account and explored some relevant variables to arriving at a base metric. This post continues the conversation by presenting some sample code for making “reliable” requests to Twitter’s API to facilitate the data collection process.
Given a Twitter screen name, it’s (theoretically) quite simple to get all of the account profiles that follow the screen name. Perhaps the most economical route is to use the GET /followers/ids API to request all of the follower IDs in batches of 5,000 per response, followed by the GET /users/lookup API to retrieve full account profiles for up to Y of those IDs in batches of 100 per response. Thus, if an account has X followers, you’d need to anticipate making ceiling(X/5000) API calls to GET /followers/ids and ceiling(X/100) API calls to GET /users/lookup. Although most Twitter accounts may not have enough followers that the total number of requests to each API resource presents rate-limiting problems, you can rest assured that the most popular accounts will trigger rate-limiting enforcements that manifest as an HTTP error in RESTful APIs.
Although it seems more satisfying to have all of the data you could ever want, you really should ask yourself if you really need every follower profile for an account of interest, or if a sufficiently large random sample will do. However, be advised that in order to truly collect a random sample of followers for an account, you must sample from the full population of all follower IDs as opposed to just taking the first N follower IDs. The reason is that Twitter’s API docs state that IDs are currently returned with “the most recent following first” but the order may change with little to no notice. Even in the latter case, there’s no expectation or guarantee of randomness. We’ll revisit this topic in the next post in which we begin harvesting profiles.
Write Paranoid Code
Only a few things are guaranteed in life: taxes, death, and that you will encounter inconvenient HTTP error codes when trying to acquire remote data. It’s never quite as simple as assuming that there won’t be any “unexpected” errors associated with code that makes network requests, because the very nature of making calls to remote web server inherently introduces the possibility of failure.
Only a few things are guaranteed in life: taxes, death, and that you will encounter inconvenient HTTP error codes when trying to acquire remote data.
In order to successfully harvest non-trivial amounts of remote data, you must employ robust code that expects errors to happen as a normal occurrence as opposed being an exceptional case that “probably won’t happen.” Write code that expects a mysterious kind of network error to crop up deep somewhere deep in the guts of the underlying HTTP library that you are using, be prepared for service disruptions such as Twitter’s “fail whale,” and by all means, ensure that your code accounts for rate-limiting and all other well-documented HTTP error codes that the API documentation provides.
Finally, ensure that you don’t experience any data loss if your code fails despite your best efforts by persisting the data that is returned from each request so that your code doesn’t run for an extended duration only to fail and leave you with nothing at all to show for it — even though you might otherwise be able to easily recover by restarting from the point of failure as opposed to starting from scratch. For what it’s worth, I’ve found that consistently being able to think about writing code that behaves this way is a little easier said than done, but like anything else, it gets easier with a little bit of practice.)
Making Paranoid Twitter API Requests
Example 9-16 [viewable IPython Notebook link from Mining the Social Web’s GitHub repository] presents a pattern for making paranoid Twitter API requests and is reproduced below. It accounts for the HTTP errors in Twitter’s API documentation as well as a couple of other errors (such as urllib2’s infamous BadStatusLine exception) that sometimes appear, seemingly without rhyme or reason. Take a moment to study the code to see how it works.
import sys import time from urllib2 import URLError from httplib import BadStatusLine import json import twitter def oauth_login(): # XXX: Go to http://twitter.com/apps/new to create an app and get values # for these credentials that you'll need to provide in place of these # empty string values that are defined as placeholders. # See https://dev.twitter.com/docs/auth/oauth for more information # on Twitter's OAuth implementation. CONSUMER_KEY = '' CONSUMER_SECRET = '' OAUTH_TOKEN = '' OAUTH_TOKEN_SECRET = '' auth = twitter.oauth.OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET, CONSUMER_KEY, CONSUMER_SECRET) twitter_api = twitter.Twitter(auth=auth) return twitter_api def make_twitter_request(twitter_api_func, max_errors=10, *args, **kw): # A nested helper function that handles common HTTPErrors. Return an updated # value for wait_period if the problem is a 500 level error. Block until the # rate limit is reset if it's a rate limiting issue (429 error). Returns None # for 401 and 404 errors, which requires special handling by the caller. def handle_twitter_http_error(e, wait_period=2, sleep_when_rate_limited=True): if wait_period > 3600: # Seconds print >> sys.stderr, 'Too many retries. Quitting.' raise e # See https://dev.twitter.com/docs/error-codes-responses for common codes if e.e.code == 401: print >> sys.stderr, 'Encountered 401 Error (Not Authorized)' return None elif e.e.code == 404: print >> sys.stderr, 'Encountered 404 Error (Not Found)' return None elif e.e.code == 429: print >> sys.stderr, 'Encountered 429 Error (Rate Limit Exceeded)' if sleep_when_rate_limited: print >> sys.stderr, "Retrying in 15 minutes...ZzZ..." sys.stderr.flush() time.sleep(60*15 + 5) print >> sys.stderr, '...ZzZ...Awake now and trying again.' return 2 else: raise e # Caller must handle the rate limiting issue elif e.e.code in (500, 502, 503, 504): print >> sys.stderr, 'Encountered %i Error. Retrying in %i seconds' % \ (e.e.code, wait_period) time.sleep(wait_period) wait_period *= 1.5 return wait_period else: raise e # End of nested helper function wait_period = 2 error_count = 0 while True: try: return twitter_api_func(*args, **kw) except twitter.api.TwitterHTTPError, e: error_count = 0 wait_period = handle_twitter_http_error(e, wait_period) if wait_period is None: return except URLError, e: error_count += 1 print >> sys.stderr, "URLError encountered. Continuing." if error_count > max_errors: print >> sys.stderr, "Too many consecutive errors...bailing out." raise except BadStatusLine, e: error_count += 1 print >> sys.stderr, "BadStatusLine encountered. Continuing." if error_count > max_errors: print >> sys.stderr, "Too many consecutive errors...bailing out." raise # Sample usage twitter_api = oauth_login() # See https://dev.twitter.com/docs/api/1.1/get/users/lookup for # twitter_api.users.lookup response = make_twitter_request(twitter_api.users.lookup, screen_name="SocialWebMining") print json.dumps(response, indent=1)
In the next post, we’ll continue the conversation by using make_twitter_request to acquire account profiles so that the data science/mining can begin. Stay tuned!
If you missed the first post in this series (Computing Twitter Influence, Part 1: Arriving at a Base Metric), you can find it here.
Read more about the journey of authoring Mining the Social Web, 2nd Edition and how I tried to apply lean practices to make it the best possible product for you in Reflections on Authoring a Minimum Viable Book.
Pingback: How To Harvest Millions of Twitter Profiles Without Violating the ToS (Computing Twitter Influence, Part 3) | Mining the Social Web