How To Harvest Millions of Twitter Profiles Without Violating the ToS (Computing Twitter Influence, Part 3)
In the last post in this continuing series on computing Twitter influence, we developed a wrapper function called make_twitter_request that handles the various sorts of HTTP error codes and network failures that you are likely to experience as you aspire to acquire non-trivial amounts of data from Twitter’s API. Although you are somewhat unlikely to need a wrapper function like make_twitter_request if you are just making a few ad-hoc API requests, you’re guaranteed to experience HTTP error codes when making non-trivial numbers of requests if for no other reason than exceeding the notorious Twitter API rate limits that allot you a fixed number of requests per rate-limit time interval (currently defined as 15-minutes.)
Although it may have seemed like an unnecessary detour, the beauty make_twitter_request will soon start to shine, because it allows us to write code, walk away, and rest assured that the computer is still hard at work accumulating the data we desire. Without its benefit, you are much more likely to come back to your console only to discover a stack trace that prevented you from getting the data that you would much rather have seen. It’s not fun experiencing these types of errors when they happen half-way into harvesting many millions of followers, because there’s not always a good way to recover and pick back up from the point of failure.
Harvesting Account IDs
In terms of computing Twitter influence, we previously determined that the problem can be framed as a data mining exercise against a collection of followers for an account, so let’s think about how to start harvesting what might be potentially massive numbers of followers. The first step is enumerating the list of follower IDs for a screen name of interest, and Twitter’s GET /followers/ids API does a nice job of taking care of this for you. Given a screen name, it returns up to 5,000 follower IDs per API request, and you are allotted 15 requests per rate-limit window.
When you do the math, you’ll find that you can pull down 75,000 IDs per 15-minute window, 300,000 IDs per hour, and ultimately accrue about 7.2 million user IDs per day. The most popular and most Twitter users such as @LadyGaGa or @BarackObama have upwards of 40 million followers, so you’d spend the better part of a week pulling down all of the data for one of those accounts as an upward bound.
But do really need to pull down all of the followers or can you just request, say, just the first N accounts? It depends on the assumptions that you can make about the sample that you’d get by requesting only the first N accounts. As it turns out, the account IDs are currently documented to be returned in the order in which the follow interaction occurred, which means that they are not necessarily in random order.
If you are planning to do some rigorous statistical analysis that is predicated upon random sampling assumptions, you might find that the lack of guarantee in randomness by fetching only the first N accounts just isn’t good enough. If you need guarantees about randomness, you’ll probably want to go ahead and pay the price for harvesting all of an account’s follower IDs so that you can randomly sample from it in the next step, which is using the user ID to fetch an account profile. (All that said, bear in mind that you probably shouldn’t make any rigorous assumptions about the order in which follower IDs are returned either, since the API docs state that it may change at a moment’s notice.)
Harvesting Account Profiles
Given a collection of account IDs, Twitter’s GET /users/lookup API returns up to 100 profiles per request with an allotted 180 user profile requests per rate limit interval. When you do the math, that works out to be 18,000 profiles per 15-minute interval, which means that you can ultimately collect 72,000 profiles per hour or up to 1,728,000 account profiles per day.
Let’s take a moment to think about what this means: for the vast majority of Twitter users, you’ll be able to collect all of the profile data that you need in minutes or hours. Many experiments that involve random samples require little more than 400 items in the sample, but you could easily work with 4,000 or even 40,000 items in the sample without encountering too many problems so far as wait times are concerned so long as you aren’t analyzing ultra-popular users.
Even a popular tech leader such as @timoreilly has right at 1.7 million followers, so it would only require a day or so to collect the totality of his followers’ profiles. The most popular Twitter users such as @LadyGaGa or @BarackObama, however, have upwards of 40 million users, so you probably want to start a background process on a server or desktop machine that will have a reliable and constant Internet connection, or rely on random sampling to pull full profiles from a collection of account IDs.
Conceptually, pulling all of the follower IDs or profiles for an account is just a couple of tight loops around make_twitter_request as previously described. Examples 9-19 and 9-17 from Mining the Social Web introduce the get_user_profile and get_friends_followers_ids functions that take care of the heavy lifting for these tasks as part of a “Twitter Cookbook.” Sample invocations for these functions follow that illustrate how to use these functions. (Take a look at the full source code for the Twitter Cookbook for all of the details.)
# Create an API connection twitter_api = oauth_login() # Pull all of the friend and follower IDs for an account friends_ids, followers_ids = get_friends_followers_ids(twitter_api, screen_name="ptwobrussell") # XXX: Store the ids... # Pull all of the profiles for friends/followers friends_profiles = get_user_profile(twitter_api, user_ids=friends_ids) followers_profiles = get_user_profile(twitter_api, user_ids=followers_ids) # XXX: Store the profiles...
Did you notice that the sample invocations define variables like friends_ids or followers_profiles that could potentially contain far too much data to hold in memory and blow the heap? In the next post, we’ll wrap up the data collection process by introducing MongoDB, a document-oriented database that’s ideal for storing the kind of JSON data that’s returned by Twitter’s API and use it to ensure that memory requirements for our data collection process remain modest. We’ll also package of the code that’s been introduced up to that point into a convenient general-purpose utility that you can easily invoke to harvest data with little more than a few keystrokes.
Having then aspired to compute influence and acquired the necessary data , we’ll be able to analyze and summarize our findings as part of our 4-step general-purpose framework for mining social web data like a pro.
Pingback: Super Simple Storage for Social Web Data with MongoDB (Computing Twitter Influence, Part 4) | Mining the Social Web