Getting Started with Twitter’s API: From Zero to Firehose in ~2.5 Minutes

Mining the Social Web‘s goal is to teach you how to transform curiosity into insight, and its virtual machine features two IPython Notebooks that are designed to get you up and running with Twitter’s API as quickly as possible. The following ~2.5 minute screencast shows how to generate OAuth credentials, establish a Twitter API connection, and make API requests for all sorts of things. By the end of the video, you’ll be able to tap into Twitter’s Streaming API to create filters for @mentions, #hashtags, stock symbols, and more.

Screen Shot 2013-11-12 at 12.34.07 PM

This short screencast, teaches you how to access Twitter’s API. In less than ~2.5 minutes, you’ll be able to tapping into the Streaming API to query for screen names, hashtags, stock tickers, and more.

 

The Chapter 1 (Mining Twitter) notebook provides an orientation and gentle introduction to Twitter’s API while the Chapter 9 (Twitter Cookbook) notebook includes a collection of more than two dozen recipes that are designed to solve recurring problems that typically come up as part of any data analysis. First, follow along with the first notebook and learn the fundamentals. Then, copy, paste, and massage the code in the second notebook to create data processing pipelines as part of your own data science experiments and analyses.

 

Twitter Could Be So Much Better Than An Advertising Company

If you’re a business with enough users, you can probably make some money by placing advertisements. Advertising drives commerce, and commerce is fundamental to a healthy economy. It’s a great and wonderful thing that profits are earned, jobs are created, taxes are paid, and a virtuous cycle develops around the commerce that results from advertising. However, just because Twitter could make a lot of money in advertising doesn’t mean that advertising is where it should concentrate the majority of its efforts or where its most fundamental value proposition lies.

For example, one can now gather from Twitter’s IPO that it’s fundamentally postured as an advertising company, but its real value isn’t in advertising. Twitter’s most fundamental value rests squarely within data analytics. More specifically, Twitter’s most fundamental value is in the overall collective intelligence of its user base when interpreted as an interest graph. Think of an interest graph as a mapping of people to their interests. In other words, if you follow an account on Twitter, what you’re really saying is that you’re interested in that account. Even though there’s lots to be gleaned in all of the little 140 character quips associated with a particular account, there’s a good bit you can tell about a person by solely examining the accounts that the person follows.

Twitter’s most fundamental value lies within the overall collective intelligence of its user base when interpreted as an interest graph.

Remember the old adage “birds of a feather flock together?” It’s true in life, and it’s just as true in the virtual realm. Twitter makes it easy to tap into that data in a  (mostly) developer-friendly manner that lends itself to innovation because the data and terms of use are so liberal. It’s a prime frontier for creating predictive analytics for just about any domain, because the cross-section of that interest graph is so wide and vast. Other popular social web properties have comparable data, but the nice thing about Twitter is that everything occurs (mostly) all out in public, there’s real power in the openness of that kind of data, and a significant fraction of the world’s population uses Twitter.

While it’s certainly the case that Twitter is littered with plenty of low-value content (both human and spambot generated), plenty enough signal can be teased out of the interest graph derived from Twitter’s ~900 million registered accounts (with ~232 million accounts being active monthly) to create fundamental value of mammoth proportions. Twitter can tap into that collective intelligence to bootstrap recommender services and place more effective ads while people banter back and forth, but ads would be a pretty base end game when you consider the nobler purposes that you can achieve with that kind of data.

Twitter should not be an advertising company; ads should just be the means to an end. Twitter should embrace its nobler identity as a data analytics company and focus on the stuff that really matters.

Mining Social Web APIs with IPython Notebook [Slides]

Thanks so much to everyone who attended the Mining Social Web APIs with IPython Notebook workshop. It was really inspiring to see so many of you get your hands dirty hacking on data (as opposed to just talking or thinking about it.) It’s a lot of work to design a 3 hour workshop for such a large and diverse audience, but it was all totally worth it, and I’m grateful for the time that we had together.

If you attended the workshop, I hope that you’ll continue hacking on the code and stay in touch. I’ll do anything that I possibly can to help you be successful. If you couldn’t make it, the slides are available on SlideShare, and I’ll post an update once the video from the session is available.

Gratefully yours,

-MAR

Now Serving: Full-Text Sampler in IPython Notebook Format

The 2nd Edition of Mining the Social Web has officially soft-launched (the “hard-launch is at my Strata workshop next week), and as of late last week you could download either a PDF file or view an ebook excerpt of the first chapter that introduces data mining with Twitter’s API. Additionally, as of just a few hours ago, the full-text of the first chapter is now also available as an IPython Notebook (ipynb) file!

ipynb

Enjoy the full-text of Chapter 1 (Mining Twitter) as an IPython Notebook!

If you’ve been following this project at all, you already know that there’s a turn-key virtual machine that provides all of the sample code from the book in a convenient and interactive IPython Notebook format, so that you can get right to business without so much fuss around Python configuration management and Python development tools. With most of that foundation in place, I then began wondering why the entire text of the book couldn’t be offered in that same convenient format. What makes an IPython Notebook distribution particularly exciting is that it’s such a natural experience to read and interactively execute code examples within the same user interface. While perhaps not everyone would want to consume the book as a collection of ipynb files, there’s certainly a lot of value to be had by offering it as an additional option beyond the standard ebook formats.

What makes an IPython Notebook distribution particularly exciting is that it’s such a natural experience to read and interactively execute code examples within the same user interface.

After a few late nights of XML hacking and a series of conversations with some innovating folks at O’Reilly, we’ve decided to offer the sampler in IPython Notebook format to find out whether there is in fact really demand for an IPython Notebook distribution. If the feedback is substantially positive, then there’s a reasonable chance we’ll find a way to offer the full text in IPython Notebook format. You should definitely leave a comment or tweet your support with the sidebar widget (and mention @OReillyMedia) if this is something you’d like to see happen with this book or any others.

Enjoy, and know that I’m always here coveting your honest feedback, especially in the form of thoughtful book reviews.

How To Harvest Millions of Twitter Profiles Without Violating the ToS (Computing Twitter Influence, Part 3)

In the last post in this continuing series on computing Twitter influence, we developed a wrapper function called make_twitter_request that handles the various sorts of HTTP error codes and network failures that you are likely to experience as you aspire to acquire non-trivial amounts of data from Twitter’s API. Although you are somewhat unlikely to need a wrapper function like make_twitter_request if you are just making a few ad-hoc API requests, you’re guaranteed to experience HTTP error codes when making non-trivial numbers of requests if for no other reason than exceeding the notorious Twitter API rate limits that allot you a fixed number of requests per rate-limit time interval (currently defined as 15-minutes.)

Although it may have seemed like an unnecessary detour, the beauty make_twitter_request will soon start to shine, because it allows us to write code, walk away, and rest assured that the computer is still hard at work accumulating the data we desire. Without its benefit, you are much more likely to come back to your console only to discover a stack trace that prevented you from getting the data that you would much rather have seen. It’s not fun experiencing these types of errors when they happen half-way into harvesting many millions of followers, because there’s not always a good way to recover and pick back up from the point of failure.

Harvesting Account IDs

In terms of computing Twitter influence, we previously determined that the problem can be framed as a data mining exercise against a collection of followers for an account, so let’s think about how to start harvesting what might be potentially massive numbers of followers. The first step is enumerating the list of follower IDs for a screen name of interest, and Twitter’s GET /followers/ids API does a nice job of taking care of this for you. Given a screen name, it returns up to 5,000 follower IDs per API request, and you are allotted 15 requests per rate-limit window.

When you do the math, you’ll find that you can pull down 75,000 IDs per 15-minute window, 300,000 IDs per hour, and ultimately accrue about 7.2 million user IDs per day. The most popular and most Twitter users such as @LadyGaGa or @BarackObama have upwards of 40 million followers, so you’d spend the better part of a week pulling down all of the data for one of those accounts as an upward bound.

But do really need to pull down all of the followers or can you just request, say, just the first N accounts? It depends on the assumptions that you can make about the sample that you’d get by requesting only the first N accounts. As it turns out, the account IDs are currently documented to be returned in the order in which the follow interaction occurred, which means that they are not necessarily in random order.

If you are planning to do some rigorous statistical analysis that is predicated upon random sampling assumptions, you might find that the lack of guarantee in randomness by fetching only the first N accounts just isn’t good enough. If you need guarantees about randomness, you’ll probably want to go ahead and pay the price for harvesting all of an account’s follower IDs so that you can randomly sample from it in the next step, which is using the user ID to fetch an account profile. (All that said, bear in mind that you probably shouldn’t make any rigorous assumptions about the order in which follower IDs are returned either, since the API docs state that it may change at a moment’s notice.)

Harvesting Account Profiles

Given a collection of account IDs, Twitter’s GET /users/lookup API returns up to 100 profiles per request with an allotted 180 user profile requests per rate limit interval. When you do the math, that works out to be 18,000 profiles per 15-minute interval, which means that you can ultimately collect 72,000 profiles per hour or  up to 1,728,000 account profiles per day.

Let’s take a moment to think about what this means: for the vast majority of Twitter users, you’ll be able to collect all of the profile data that you need in minutes or hours.  Many experiments that involve random samples require little more than 400 items in the sample, but you could easily work with 4,000 or even 40,000 items in the sample without encountering too many problems so far as wait times are concerned so long as you aren’t analyzing ultra-popular users.

Even a popular tech leader such as @timoreilly has right at 1.7 million followers, so it would only require a day or so to collect the totality of his followers’ profiles. The most popular Twitter users such as @LadyGaGa or @BarackObama, however, have upwards of 40 million users, so you probably want to start a background process on a server or desktop machine that will have a reliable and constant Internet connection, or rely on random sampling to pull full profiles from a collection of account IDs.

Sample Code

Conceptually, pulling all of the follower IDs or profiles for an account is just a couple of tight loops around make_twitter_request as previously described. Examples 9-19 and 9-17 from Mining the Social Web introduce the get_user_profile and get_friends_followers_ids functions that take care of the heavy lifting for these tasks as part of a “Twitter Cookbook.” Sample invocations for these functions follow that illustrate how to use these functions. (Take a look at the full source code for the Twitter Cookbook for all of the details.)

# Create an API connection
twitter_api = oauth_login()

# Pull all of the friend and follower IDs for an account
friends_ids, followers_ids = get_friends_followers_ids(twitter_api, screen_name="ptwobrussell")

# XXX: Store the ids...

# Pull all of the profiles for friends/followers
friends_profiles = get_user_profile(twitter_api, user_ids=friends_ids)
followers_profiles = get_user_profile(twitter_api, user_ids=followers_ids)

# XXX: Store the profiles...

Next Time

Did you notice that the sample invocations define variables like friends_ids or followers_profiles that could potentially contain far too much data to hold in memory and blow the heap? In the next post, we’ll wrap up the data collection process by introducing MongoDB, a document-oriented database that’s ideal for storing the kind of JSON data that’s returned by Twitter’s API and use it to ensure that memory requirements for our data collection process remain modest. We’ll also package of the code that’s been introduced up to that point into a convenient general-purpose utility that you can easily invoke to harvest data with little more than a few keystrokes.

Having then aspired to compute influence and acquired the necessary data , we’ll be able to analyze and summarize our findings as part of our 4-step general-purpose framework for mining social web data like a pro.

Why Is Twitter All the Rage?

Next week, I’ll be presenting a short webcast entitled Why Twitter Is All the Rage: A Data Miner’s Perspective that is loosely adapted from material that appears early in Mining the Social Web (2nd Ed). Given that the webcast is now less than a week away, I wanted to share out the content that inspired the topic. This remainder of this post is a slightly abridged reproduction of a section that appears early in Chapter 1. If you enjoy it, you can download all of Chapter 1 as a free PDF to learn more about mining Twitter data.

Why Is Twitter All the Rage?

How would you define Twitter?

There are many ways to answer this question, but let’s consider it from an overarching angle that addresses some fundamental aspects of our shared humanity that any technology needs to account for in order to be useful and successful. After all, the purpose of technology is to enhance our human experience.

As humans, what are some things that we want that technology might help us to get?

  • We want to be heard.
  • We want to satisfy our curiosity.
  • We want it easy.
  • We want it now.

In the context of the current discussion, these are just a few observations that are generally true of humanity. We have a deeply rooted need to share our ideas and experiences, which gives us the ability to connect with other people, to be heard, and to feel a sense of worth and importance. We are curious about the world around us and how to organize and manipulate it, and we use communication to share our observations, ask questions, and engage with other people in meaningful dialogues about our quandaries.

The last two bullet points highlight our inherent intolerance to friction. Ideally, we don’t want to have to work any harder than is absolutely necessary to satisfy our curiosity or get any particular job done; we’d rather be doing “something else” or moving on to the next thing because our time on this planet is so precious and short. Along similar lines, we want things now and tend to be impatient when actual progress doesn’t happen at the speed of our own thought.

One way to describe Twitter is as a microblogging service that allows people to communicate with short, 140-character messages that roughly correspond to thoughts or ideas. In that regard, you could think of Twitter as being akin to a free, high-speed, global text-messaging service. In other words, it’s a glorified piece of valuable infrastructure that enables rapid and easy communication. However, that’s not all of the story. It doesn’t adequately address our inherent curiosity and the value proposition that emerges when you have over 500 million curious people registered, with over 100 million of them actively engaging their curiosity on a regular monthly basis.

Besides the macro-level possibilities for marketing and advertising—which are always lucrative with a user base of that size—it’s the underlying network dynamics that created the gravity for such a user base to emerge that are truly interesting, and that’s why Twitter is all the rage. While the communication bus that enables users to share short quips at the speed of thought may be a necessary condition for viral adoption and sustained engagement on the Twitter platform, it’s not a sufficient condition. The extra ingredient that makes it sufficient is that Twitter’s asymmetric following model satisfies our curiosity. It is the asymmetric following model that casts Twitter as more of an interest graph than a social network, and the APIs that provide just enough of a framework for structure and self-organizing behavior to emerge from the chaos.

In other words, whereas some social websites like Facebook and LinkedIn require the mutual acceptance of a connection between users (which usually implies a real-world connection of some kind), Twitter’s relationship model allows you to keep up with the latest happenings of any other user, even though that other user may not choose to follow you back or even know that you exist. Twitter’s following model is simple but exploits a fundamental aspect of what makes us human: our curiosity. Whether it be an infatuation with celebrity gossip, an urge to keep up with a favorite sports team, a keen interest in a particular political topic, or a desire to connect with someone new, Twitter provides you with boundless opportunities to satisfy your curiosity.

Think of an interest graph as a way of modeling connections between people and their arbitrary interests. Interest graphs provide a profound number of possibilities in the data mining realm that primarily involve measuring correlations between things for the objective of making intelligent recommendations and other applications in machine learning. For example, you could use an interest graph to measure correlations and make recommendations ranging from whom to follow on Twitter to what to purchase online to whom you should date. To illustrate the notion of Twitter as an interest graph, consider that a Twitter user need not be a real person; it very well could be a person, but it could also be an inanimate object, a company, a musical group, an imaginary persona, an impersonation of someone (living or dead), or just about anything else.

For example, the @HomerJSimpson account is the official account for Homer Simpson, a popular character from The Simpsons television show. Although Homer Simpson isn’t a real person, he’s a well-known personality throughout the world, and the @HomerJSimpson Twitter persona acts as an conduit for him (or his creators, actually) to engage his fans. Likewise, although this book will probably never reach the popularity of Homer Simpson, @SocialWebMining is its official Twitter account and provides a means for a community that’s interested in its content to connect and engage on various levels. When you realize that Twitter enables you to create, connect, and explore a community of interest for an arbitrary topic of interest, the power of Twitter and the insights you can gain from mining its data become much more obvious.

There is very little governance of what a Twitter account can be aside from the badges on some accounts that identify celebrities and public figures as “verified accounts” and basic restrictions in Twitter’s Terms of Service agreement, which is required for using the service. It may seem very subtle, but it’s an important distinction from some social websites in which accounts must correspond to real, living people, businesses, or entities of a similar nature that fit into a particular taxonomy. Twitter places no particular restrictions on the persona of an account and relies on self-organizing behavior such as following relationships and folksonomies that emerge from the use of hashtags to create a certain kind of order within the system.

==

If you found this content interesting and want to learn more about how to mine Twitter and other social media data, you can download all of Chapter 1 as a free PDF.

All source code for the book is available at GitHub and screencasts are available to help get you started as part of the book’s turn-key virtual machine experience.

Writing Paranoid Code (Computing Twitter Influence, Part 2)

In the previous post of this series, we aspired to compute the influence of a Twitter account and explored some relevant variables to arriving at a base metric. This post continues the conversation by presenting some sample code for making “reliable” requests to Twitter’s API to facilitate the data collection process.

Given a Twitter screen name, it’s (theoretically) quite simple to get all of the account profiles that follow the screen name. Perhaps the most economical route is to use the GET /followers/ids API to request all of the follower IDs in batches of 5,000 per response, followed by the GET /users/lookup API to retrieve full account profiles for up to Y of those IDs in batches of 100 per response. Thus, if an account has X followers, you’d need to anticipate making ceiling(X/5000) API calls to GET /followers/ids and ceiling(X/100) API calls to GET /users/lookup. Although most Twitter accounts may not have enough followers that the total number of requests to each API resource presents rate-limiting problems, you can rest assured that the most popular accounts will trigger rate-limiting enforcements that manifest as an HTTP error in RESTful APIs.

Although it seems more satisfying to have all of the data you could ever want, you really should ask yourself if you really need every follower profile for an account of interest, or if a sufficiently large random sample will do. However, be advised that in order to truly collect a random sample of followers for an account, you must sample from the full population of all follower IDs as opposed to just taking the first N follower IDs. The reason is that Twitter’s API docs state that IDs are currently returned with “the most recent following first” but the order may change with little to no notice. Even in the latter case, there’s no expectation or guarantee of randomness. We’ll revisit this topic in the next post in which we begin harvesting profiles.

Write Paranoid Code

Only a few things are guaranteed in life: taxes, death, and that you will encounter inconvenient HTTP error codes when trying to acquire remote data. It’s never quite as simple as assuming that there won’t be any “unexpected” errors associated with code that makes network requests, because the very nature of making calls to remote web server inherently introduces the possibility of failure.

Only a few things are guaranteed in life: taxes, death, and that you will encounter inconvenient HTTP error codes when trying to acquire remote data.

In order to successfully harvest non-trivial amounts of remote data, you must employ robust code that expects errors to happen as a normal occurrence as opposed being an exceptional case that “probably won’t happen.” Write code that expects a mysterious kind of network error to crop up deep somewhere deep in the guts of the underlying HTTP library that you are using, be prepared for service disruptions such as Twitter’s “fail whale,” and by all means, ensure that your code accounts for rate-limiting and all other well-documented HTTP error codes that the API documentation provides.

Finally, ensure that you don’t experience any data loss if your code fails despite your best efforts by persisting the data that is returned from each request so that your code doesn’t run for an extended duration only to fail and leave you with nothing at all to show for it — even though you might otherwise be able to easily recover by restarting from the point of failure as opposed to starting from scratch. For what it’s worth, I’ve found that consistently being able to think about writing code that behaves this way is a little easier said than done, but like anything else, it gets easier with a little bit of practice.)

Making Paranoid Twitter API Requests

Example 9-16 [viewable IPython Notebook link from Mining the Social Web’s GitHub repository] presents a pattern for making paranoid Twitter API requests and is reproduced below. It accounts for the HTTP errors in Twitter’s API documentation as well as a couple of other errors (such as urllib2’s infamous BadStatusLine exception) that sometimes appear, seemingly without rhyme or reason. Take a moment to study the code to see how it works.

import sys
import time
from urllib2 import URLError
from httplib import BadStatusLine
import json
import twitter

def oauth_login():
    # XXX: Go to http://twitter.com/apps/new to create an app and get values
    # for these credentials that you'll need to provide in place of these
    # empty string values that are defined as placeholders.
    # See https://dev.twitter.com/docs/auth/oauth for more information
    # on Twitter's OAuth implementation.

    CONSUMER_KEY = ''
    CONSUMER_SECRET = ''
    OAUTH_TOKEN = ''
    OAUTH_TOKEN_SECRET = ''

    auth = twitter.oauth.OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET,
                               CONSUMER_KEY, CONSUMER_SECRET)

    twitter_api = twitter.Twitter(auth=auth)
    return twitter_api

def make_twitter_request(twitter_api_func, max_errors=10, *args, **kw):

    # A nested helper function that handles common HTTPErrors. Return an updated
    # value for wait_period if the problem is a 500 level error. Block until the
    # rate limit is reset if it's a rate limiting issue (429 error). Returns None
    # for 401 and 404 errors, which requires special handling by the caller.
    def handle_twitter_http_error(e, wait_period=2, sleep_when_rate_limited=True):

        if wait_period > 3600: # Seconds
            print >> sys.stderr, 'Too many retries. Quitting.'
            raise e

        # See https://dev.twitter.com/docs/error-codes-responses for common codes

        if e.e.code == 401:
            print >> sys.stderr, 'Encountered 401 Error (Not Authorized)'
            return None
        elif e.e.code == 404:
            print >> sys.stderr, 'Encountered 404 Error (Not Found)'
            return None
        elif e.e.code == 429:
            print >> sys.stderr, 'Encountered 429 Error (Rate Limit Exceeded)'
            if sleep_when_rate_limited:
                print >> sys.stderr, "Retrying in 15 minutes...ZzZ..."
                sys.stderr.flush()
                time.sleep(60*15 + 5)
                print >> sys.stderr, '...ZzZ...Awake now and trying again.'
                return 2
            else:
                raise e # Caller must handle the rate limiting issue
        elif e.e.code in (500, 502, 503, 504):
            print >> sys.stderr, 'Encountered %i Error. Retrying in %i seconds' % \
                (e.e.code, wait_period)
            time.sleep(wait_period)
            wait_period *= 1.5
            return wait_period
        else:
            raise e

    # End of nested helper function

    wait_period = 2
    error_count = 0

    while True:
        try:
            return twitter_api_func(*args, **kw)
        except twitter.api.TwitterHTTPError, e:
            error_count = 0
            wait_period = handle_twitter_http_error(e, wait_period)
            if wait_period is None:
                return
        except URLError, e:
            error_count += 1
            print >> sys.stderr, "URLError encountered. Continuing."
            if error_count > max_errors:
                print >> sys.stderr, "Too many consecutive errors...bailing out."
                raise
        except BadStatusLine, e:
            error_count += 1
            print >> sys.stderr, "BadStatusLine encountered. Continuing."
            if error_count > max_errors:
                print >> sys.stderr, "Too many consecutive errors...bailing out."
                raise

# Sample usage

twitter_api = oauth_login()

# See https://dev.twitter.com/docs/api/1.1/get/users/lookup for
# twitter_api.users.lookup

response = make_twitter_request(twitter_api.users.lookup,
                                screen_name="SocialWebMining")

print json.dumps(response, indent=1)

In the next post, we’ll continue the conversation by using make_twitter_request to acquire account profiles so that the data science/mining can begin. Stay tuned!

===

If you missed the first post in this series (Computing Twitter Influence, Part 1: Arriving at a Base Metric), you can find it here.

Read more about the journey of authoring Mining the Social Web, 2nd Edition and how I tried to apply lean practices to make it the best possible product for you in Reflections on Authoring a Minimum Viable Book.

Mining the Social Web Like a Pro: Four Steps to Success [Slides]

[Update – 8 October 2013: The data journalism team at La Nación expanded upon the analysis presented in the slides and put together a really nice article that tells a story about the data. Definitely check it out, and if you don’t read Spanish, try translating with Chrome or paste the URL into Google Translate.]

I had the pleasure of presenting Mining the Social Web Like a Pro: Four Steps to Success [slides link] to a group of incredibly talented Latin American investigative journalists from nearly a dozen countries these past two days in Quito, Ecuador as part of an annual GDA seminar. The entire experience has been remarkable in so many ways. I attended the seminar to present as an invited speaker, but am definitely leaving with a lot of new ideas and fresh perspective on data, investigative journalism, and so much more.

In addition to presenting and learning an incredible amount of new and exciting information about data-oriented and investigative journalism, I was also able to move an emerging project forward as a developer/coach during a brief hack-a-thon, I dusted off my Español (it’s amazing what two days of full immersion will do for you), and I made a lot of new friends.

If we met at the seminar, let’s please stay in touch. It was a humbling experience to meet you all and be part of such a terrific community.

You can download my slides at SlideShare; however, keep in mind that there’s a bit of context missing since the slides were designed as a guide for the presentation as opposed to being a comprehensive reference. Leave a comment or send a tweet using the sidebar widget if you have questions.

Gracias.

Arriving at a Base Influence Metric (Computing Twitter Influence, Part 1)

This post introduces a series that explores the problem of approximating a Twitter account’s influence. With the ubiquity of social media and its effects on everything from how we shop to how we vote at the polls, it’s critical that we be able to employ reasonably accurate and well-understood measurements for approximating influence from social media signals.

[ 24 Sept 2013 – Made a few light edits in preparation for a cross-post on the O’Reilly Programming Blog]

Unlike social networks such as LinkedIn and Facebook in which connections between entities are symmetric and typically correspond to a real world connection, Twitter’s underlying data model is fundamentally predicated upon asymmetric following relationships. Another way of thinking about a following relationship is to consider that it’s little more than a subscription to a feed about some content of interest. In other words, when you follow another Twitter user, you are expressing interest in that other user and are opting-in to whatever content it would like to place in your home timeline. As such, Twitter’s underlying network structure can be interpreted as an interest graph and mined for insights about the relative popularity of one user when compared to another.

…Twitter’s underlying network structure can be interpreted as an interest graph…

There is tremendous value in being able to apply competitive metrics for identifying key influencers, and there’s no better time to get started than right now since you can’t improve something until after you’re able to measure it. Before we can put some accounts under the microscope and start measuring influence, however, we’ll need to think through the problem of arriving at a base metric.

Subtle Variables Affecting a Base Metric

The natural starting point for approximating a Twitter account’s influence is to simply consider its number of followers. After all, it’s reasonable to think that the more followers an account has accumulated, then the more popular it must be in comparison to some other account. On the surface, this seems fine, but it doesn’t account for a few subtle variables that turn out to be critical once you begin to really understand the data. Consider the following subtle variables (amongst many others) that affect “number of followers” as a base metric:

  1. Spam bot accounts that effectively are zombies and can’t be harnessed for any utility at all
  2. Inactive or abandoned accounts that can’t influence or be influenced since they are not in use
  3. Accounts that follow so many other accounts that the likelihood of getting noticed (and thus influencing) is practically zero
  4. The network effects of retweets by accounts that are active and can be influenced to spread a message

Even though some non-trivial caveats exist, the good news is that we can take all of these variables into account and still arrive at a reasonable set of features from the data that could be implemented, measured, and improved as an influence metric. Let’s consider each of these issues and think about how to appropriately handle them.

Forging a Base Metric

The cases of (1) and (2) present what is effectively the same challenge with regard to computing an influence score, and although there’s not a single API that we can use to detect whether or not an account is a spam bot or inactive, we can use some simple heuristics that turn out to work remarkably well for determining if an account is effectively irrelevant. For example, if an account is following fewer than X accounts, hasn’t tweeted in Y days, or hasn’t retweeted any other account more than Z times (or some combination thereof), then it’s probably not an account of relevance in predicting influence. Reasonable initial values for parameterizing the heuristics might be some weighting of X=10, Y=30, and Z=2; however, it will take some data science experiments to arrive at optimal values.

In the case of (3), we can also take into account the total number of retweets associated with the account and even hone in on whether the it has ever retweeted the other account in question. For example, if a very popular account is following you, but it’s also following tens of thousands of other people (or more) and seldom (or never) retweets anyone (especially you), then you probably shouldn’t count on influencing it with any reasonable probability.

By the way, this shouldn’t surprise you; it’s just not humanly possible to do much with Twitter’s chronologically-oriented view of tweets as displayed in a home timeline. However, despite the sheer lack of the home timeline’s usability for following more than trivial numbers of users, Twitter does offer a coping mechanism: you can organize users of interest into lists and monitor the lists as opposed to the home timeline. The number of times a user is “listed” is certainly an important variable worth keeping in mind during data science experiments to arrive at an influence metric. (However, be advised that spam bots are increasingly using it as well these days as a means of getting noticed.)

In the case of (4), it would be remiss not to consider network effects such as what happens when you get retweeted, because this can completely change the dynamics of the situation. For example, even though an account of interest might have relatively few followers of its own, all it takes is for one of those followers to be popular enough for a retweet to light the initial spark and reach a larger audience. Consider the case in which an account has fewer than 100 followers, but one or more of those followers have tens of thousands of their own followers and opts to retweet as a case in point.

…even though an account of interest might have relatively few followers of its own, all it takes is for one of those followers to be popular enough for a retweet to light the initial spark and reach a larger audience…

As a final consideration, let’s just go ahead and acknowledge the serendipity of Twitter. The percentage of “active” followers who will probably even see any particular tweet for someone that they’re not very intentionally keeping up with is generally going to be a small fraction of what is theoretically possible. After all, most people have a lot more to do in life than carefully and thoughtfully monitor Twitter feeds. Furthermore, the popular users that would create the most significant network effects from a retweet must have done something to earn their “popular” status, which probably means that they’re quite busy and are unlikely to notice any given tweet on any given day.

To make matters worse, even if they do notice your tweet, they may opt to mark it as a “favorite” instead of retweeting it, which is another variable that we should consider in arriving at a base metric. Getting “favorited” is certainly a compliment, is useful data to consider for certain analytics, and serves a purpose of validation; however, it’s secondary effects don’t compare to a retweet because of the comparatively little visibility available to favorites as opposed to retweets.

Next Time

In the next post, we’ll introduce some turn-key example code for making robust Twitter requests in preparation to acquire and store all of the follower profiles for one or more users of interest so that we can eventually mine the profiles and try out some variations of our follower metric. Stay tuned…

Surprising Stats From Mining One Million Tweets About #Syria

I’ve been filtering Twitter’s firehose for tweets about “#Syria” for about the past week in order to accumulate a sizable volume of data about an important current event. As of Friday, I noticed that the tally has surpassed one million tweets, so it seemed to be a good time to apply some techniques from Mining the Social Web and explore the data.

While some of the findings from a preliminary analysis confirm common intuition, others are a bit surprising. The remainder of this post explores the tweets with a cursory analysis addressing the “Who?, What?, Where?, and When?” of what’s in the data.

If you haven’t been keeping up with the news about what’s happening in Syria, you might benefit from a piece by the Washington Post entitled 9 questions about Syria you were too embarrassed to ask as helpful background knowledge.

Filtering the Firehose

In addition to an introduction for mining Twitter data that’s presented in Chapter 1 (Mining Twitter) of Mining the Social Web, 2nd Edition, a cookbook of more than two dozen recipes for mining Twitter data is featured in Chapter 9 (Twitter Cookbook.) The recipes are fairly atomic and designed to be composed as simple building blocks that can be copied, pasted, minimally massaged in order to get you on your way. (In a nutshell, that’s actually the purpose of the entire book for the broader social web: to give you the tools that you need to transform curiosity into insight as quickly and easily as possible.)

You can adapt concepts from three primary recipes to filter and archive tweets from Twitter’s firehose:

  • Accessing Twitter’s API for Development Purposes (Example 9-1)
  • Saving and Accessing JSON Data with MongoDB (Example 9-7)
  • Sampling the Twitter Firehose with the Streaming API (Example 9-8)

Although there is a little bit of extra robustness you may want to add to the code for certain exceptional circumstances, the essence of the combined recipes is quite simple as expressed in the following Python code example:

import twitter
import pymongo

# Our query of interest
q = '#Syria'

# See Example 9-1 (Accessing Twitter's API...) for API access
twitter_stream = twitter.TwitterStream(auth=twitter.oauth.OAuth(...))

# See https://dev.twitter.com/docs/streaming-apis for more options
# in filtering the firehose
stream = twitter_stream.statuses.filter(track=q)

# Connect to the database
client = pymongo.MongoClient()
db = client["StreamingTweets"]

# Read tweets from the stream and store them to the database
for tweet in stream:
    db["Syria"].insert(tweet)

In other words, you just request authorization, filter the public stream for a search term, and stash away the results to a convenient medium like MongoDB. It really is that easy! In just over a week, I’ve collected in excess one million tweets (and counting) using exactly this approach, and you could do the very same thing for any particular topic that interests you. See the IPython Notebook featuring the Twitter Cookbook for all of the finer details.)

As discussed at length in Chapter 1 of Mining the Social Web, there is roughly 5KB of metadata that accompanies those 140 characters that you commonly think of as a tweet! A few of the metadata fields that we’ll leverage as part of exploring the data in this post include:

  • Who: The author’s screen name and language
  • What: Tweet entities such as #hashtags, @mentions, and URLs
  • Where: Geo-Coordinates for where the tweet was authored
  • When: The date and time the tweet was authored

Of course, there are some other details tucked away in the 5KB of metadata that could also be useful, but we’ll limit ourselves to just using these fields for this post. The goal is just to do some initial exploration and compute some basic statistics as opposed to compute anything scholarly or definitive.

…there is roughly 5KB of metadata that accompanies those 140 characters that you commonly think of as a tweet…

The remainder of this section presents some of the initial findings from mining the data with Python and techniques from the social web mining toolbox.

Who?

The underlying frequency distribution for the authors of the tweets reveals that there just over 305,000 accounts that have contributed the 1.1 million tweets in the data set. The underlying frequency distribution shown below reveals a long tail with certain accounts at the head of the curve contributing highly disproportionate numbers to the overall aggregate.

The frequency distribution for authors contributing tweets about #Syria reveals a long tail with certain accounts that appear to be "retweet bots" contributing highly disproportionate numbers of tweets

The frequency distribution for authors contributing tweets about #Syria reveals a long tail with certain accounts  contributing highly disproportionate numbers of tweets. The x-axis is the rank of each screen name in descending order of frequency, and the y-axis is the number of tweets authored by that screen name.

A closer inspection of the accounts contributing disproportionate numbers of tweets reveals that the top accounts (such as RT3Syria as shown below) appear to be bots that are retweeting anything and everything about Syria. This finding makes sense given that it is unlikely that any human being could author hundreds of meaningful tweets a day.

On the chart, notice that at around the 100,000th rank, the number of tweets per author reaches one, which means that 200,000 of the 1.1 million tweets are accounted for by 200,000 unique accounts while the other 900,000 tweets are shared amongst the remaining 100,000 accounts. In aggregate, this means that about 80% of the content is accounted for by one-third of the contributing accounts!

A table below displays the frequency information for any screen name contributing more than 1,000 tweets in case you’d like to further investigate these accounts that sit at the head of the curve.

Screen Name Frequency
RT3Syria 4068
SyriaTweetEn 3546
irane_Azad 3146
Neda30 2339
AzadiIran92 2164
FreeIran9292 2123
IraneAzad_92 2062
Logunov_Daniil 2053
Abdirizak2327 1940
tintin1957 1801
SyrianRevo 1657
kokoyxx_xxx2903 1646
RT3Iraq 1644
4VictoryInIran 1592
shiiraaryare 1572
ILA_2013 1537
FreeMyIran 1487
dictatorpost 1434
AajelSyria 1422
Mojahedineng 1354
NewIranFree 1314
TheVoiceArtist 1303
EqlF07 1302
17febpage 1288
YallaSouriya 1256
mog7546 1246
KalamoonNews 1239
Iran1392Azad 1225
USRadioNews 1191
Opshy 1181
RobotsforObama 1175
Victory92Iran1 1143
ErwinFilbert 1103
FamousDraft 1086
SyriaTwitte 1079
Iran1392Victory 1067
AnonAlgeria 1064
monitor_view 1015
HistoryWhite 1012
watchman_A9 1009

Another interesting aspect of exploring who is contributing to #Syria tweets is to examine the language of the person who is tweeting. The following chart show that the vast majority of the tweets are written in English and Arabic. However, an separate breakdown excluding English and Arabic is also provided to gist the representation from other languages.

The majority of tweets about #Syria are written in English and Arabic

The vast majority of tweets about #Syria are written in English with Arabic coming in second place with around 8.5%.

XXX

A breakdown of the ~33,000 tweets that were not written in English or Arabic.

For curiosity’s sake, the following table conveys frequency information for any language that appeared more than 100 times across the 1.1 million tweets.

Language Frequency
English 985437
Arabic 94777
Spanish 6733
German 5247
Indonesian 4500
French 3342
Turkish 2400
Slovak 2313
Italian 2228
Japanese 1036
Vietnamese 967
Russian 869
Polish 801
Dutch 643
Greek 469
Slovenian 421
Danish 419
Urdu 357
Persian 289
Norweigan 282
Portuguese 277
Hindi 214
Tagalog 205
Swedish 173
Bulgarian 165
Estonian 140

Given the nature of world news and that the spotlight has very much on the United States this past week, it is not surprising at all to see that English is by far the dominant language. What is useful about this exercise, however, is to be able to provide a quantitative comparison between the number of tweets authored in English and Arabic. It is also a bit surprising to see that other widely spoken languages such as Spanish appear with such low frequency. A closer investigation of the English tweets might be worthwhile to look for traces of “Spanglish” or other mixed language characteristics.

What?

One of the more useful pieces of metadata that is tucked away in a tweet is a field with the tweet entities such as hashtags, user mentions, and URLs nicely parsed out for easy analysis. In all, there were approximately 44,000 unique hashtags (after normalizing to lowercase) with a combined frequency of 2.7 million mentions that included #Syria (and common variations) itself. There were over 64,000 unique screen names and more than 130,000 unique URLs appearing in the tweets. (It may be possible that there is actually a lesser number of unique URLs since many of the URLs are short links that might resolve to the same address. Additional analysis would be required to make this determination.)

This chart conveys frequencies for the top 100 tweet entities for each category to show the similarity in the characteristics of the distributions.

Co-occurring hashtags, screen names, and URLs with #Syria

Frequencies for  the top 100 co-occurring hashtags, screen names, and URLs with #Syria

Additionally, the following column-oriented table presents a compact view of the top 50 tweet entities for each category that you can review to confirm and challenge intuition about what you’d suspect to be the most frequently occurring tweet entities. (Note that there is no correlation for the items grouped in each row besides the row number itself, which corresponds to overall rank. The format of this table is purely to provide a compact view of the data.)

Hashtag Hashtag Freq Screen Name Screen Name Freq URL URL Freq
Syria 911625 BarackObama 21766 http://bit.ly/16u5nsX 9742
syria 114129 RT_com 12625 http://on.rt.com/0yh5ju 3579
tcot 42097 repjustinamash 11502 http://twitpic.com/adfcos 2624
Obama 41284 IAmWW3 8565 http://www.avaaz.org/en/solution_for_syria_loc/?twi 2591
سوريا 34454 trutherbot 8333 http://buff.ly/18xkbaT 2138
US 26108 SenJohnMcCain 7629 http://bit.ly/182SnLh 1874
Assad 21734 YourAnonNews 6715 http://equalforce.net 1335
Iraq 21518 Refugees 6039 http://bit.ly/17oFmzn 1190
health 20975 StateDept 5246 http://is.gd/joaNyV 1174
egypt 18916 FoxNews 5220 http://is.gd/SQP0KE 1138
SYRIA 18463 WhiteHouse 5061 http://is.gd/UlnFPr 1137
world 18457 SpeakerBoehner 4496 http://www.washingtonpost.com/blogs/post-politics-live/the-senates-syria-hearing-live-updates/?id=ed01ca14-222b-4a23-b12c-c0b0d9d4fe0a 1072
Iran 18317 politico 4237 http://fxn.ws/15r0K1n 1033
politics 18126 SenRandPaul 3776 http://www.hazteoir.org/alerta/53080-se-or-obama-no-otra-matanza-siria 1027
News 17432 AlArabiya_Eng 3731 http://is.gd/czuRAd 989
UN 16796 AJELive 3592 http://ara.tv/52jq9 890
Egypt 16191 RevolutionSyria 3486 http://unhcr.org/522484fc9.html 867
Russia 15847 truthstreamnews 3429 http://is.gd/3196CO 856
USA 15772 JohnKerry 3275 http://is.gd/JZZrwU 803
G20 14685 UN 3182 http://dontattacksyria.com 785
FOX 13039 iyad_elbaghdadi 3143 http://on.rt.com/04i6h3 780
Benghazi 12779 CNN 3138 http://ow.ly/i/32Sr1 779
news 12507 Partisangirl 3127 http://dld.bz/cNYSs 747
Euronews 11941 YoungCons 2933 http://is.gd/RHySNU 710
Headline 11833 AbbyMartin 2817 http://is.gd/0pFjAP 680
Breaking 11643 YouTube 2775 http://is.gd/TUI0Ql 666
fail 11160 ChristiChat 2706 http://is.gd/lKi3WP 632
LONDON 10859 AmbassadorPower 2647 http://pccc.me/17coyWv 604
newsfeed 10102 UNICEF 2618 https://17q.org/7vw77p 576
middleeast 9804 msnbc 2469 http://1.usa.gov/17bAAmd 574
Congress 9414 BBCWorld 2441 http://twitter.com/rx 572
Israel 9312 AnonOpsLegion 2436 http://is.gd/uMgUJo 569
ww3 9085 Politics_PR 2401 http://on.rt.com/ztbwir 537
p2 9043 LouisFarrakhan 2380 http://scriptonitedaily.wordpress.com/2013/09/03/why-is-the-bbc-banging-the-war-drum-on-syria-just-look-who-runs-it/ 522
Kerry 8998 Reuters 2364 http://youtu.be/ODegqpM7usw 519
CNN 8628 guardian 2301 http://www.youtube.com/watch?v=dWVdXuTYlH8 512
HandsOffSyria 8606 BBCBreaking 2219 http://uni.cf/17F4UEI 511
nukes 8599 RickWarren 2165 http://on.rt.com/3j9p5o 510
Lebanon 8537 charlespgarcia 2144 http://www.livetradingnews.com/un-official-syrian-rebels-used-sarin-nerve-gas-assads-army-6636.htm 488
Act2EndAssadsWar 8363 AP 2127 http://www.washingtonpost.com/blogs/worldviews/wp/2013/08/29/9-questions-about-syria-you-were-too-embarrassed-to-ask/ 483
BBC 8329 ChakerKhazaal 2111 http://is.gd/CnDZsT 469
Damascus 8124 David_Cameron 2092 http://is.gd/pysWu3 467
war 8044 tintin1957 2072 http://www.jpost.com/Experts/When-will-the-Muslim-world-stop-blaming-Jews-for-its-problems-325177 465
NoWarWithSyria 7817 JohnFugelsang 1951 http://aje.me/KL34vQ 465
AP 7742 WebsterGTarpley 1945 http://is.gd/mH7tEK 465
Putin 7663 ABC 1901 http://j.mp/14VZneq 462
ABC 7358 GOP 1899 http://act.boldprogressives.org/survey/syria_survey 457
TCOT 7048 piersmorgan 1895 http://is.gd/32VAgs 454
reuters 6858 nycjim 1863 http://is.gd/Hu7B4A 454
Turkey 6477 IngrahamAngle 1827 http://youtu.be/EV5q1gHJcWA 451

It isn’t surprising to see variations of the hashtag “Syria” (including the Arabic translation سوريا) and screen names corresponding to President Obama along with other well-known politicians such as Speaker John Boehner, Senators John McCain and Rand Paul, and Secretary of State John Kerry at the top of the list. In fact, the appearance of these entities is one of the most compelling things about this analysis: it was generated purely from machine readable data with very little effort and could have been completely automated.

…the appearance of these entities is one of the most compelling things about this analysis: it was generated purely from machine readable data with very little effort and could have been completely automated…

The tweet entities, including URLs is remarkably fascinating and well worth some extra attention. Although we won’t do it here, a worthwhile followup exercise would be to summarize the content from the webpages and mine it separately by applying Example 24 (Summarizing Link Targets) from Chapter 9 (Twitter Cookbook). These web pages are likely to be the best sources for sentiment analysis, which is one of the holy grails of Twitter analytics that can be quite tricky to detect from the 140 characters of the tweets themselves.

Where?

Of the 1.1 million tweets collected, only approximately 0.5% (~6,000) of them contained geo-coordinates that could be converted to GeoJSON and plotted on a map. The image below links to an interactive map that you can navigate and zoom in on the clusters and view tweet tweet content by geography. You can read more about how the interactive visualization was constructed in a previous post entitled What Are People Tweeting About Syria in Your Neck of the Woods?

Tweet Map

Approximately 0.5% (~6,000) of the 1.1 million tweets that included geo-coordinates are available to explore as an interactive map visualization

Although only 0.5% of all tweets containing geo-coordinates may sound a bit low, bear in mind the previous finding that certain “retweet bot” and news accounts are generating massive amounts of content and will probably not include geo-coordinates. (Somewhere on the order of 1% of tweets containing geo-coordinates is also consistent with other analysis I’ve done on Twitter data.)

What is a bit surprising about the geo-coordinates (but then starts to make some amount of sense) once you take a closer look is that there is a small number of geo-enabled accounts that generate a disproportionate amount of content just as there were with the larger aggregate as we earlier observed.

Try zooming in on the area around Berkeley, California, for example, and you’ll notice that there is one particular account, @epaulnet, that is mobile and generated virtually all of the content for the cluster in that region as shown below.

Berkeley, CA

Some accounts generating disproportionate amounts of content appears to hold true for tweets that include geo-coordinates. Virtually all of these coordinates around Berkeley, CA are generated by the same account.

A chart displaying the proportionality of each geo-enabled account relative to the frequency of tweets that it produces is shown below and consistent with previous findings. It is a bit surprising that the amount of content generated by accounts with geo-coordinates enabled is highly skewed. However, it starts to make some sense once you consider that it conforms to the larger aggregate population.

Geo-Enabled Accounts Frequency

Select users generate highly disproportionate amounts of tweets with geocoordinates. The user @epaulnet as shown in the screenshot of Berkeley, CA above is at the head of this particular curve and generated over 800 tweets in less than one week!

Although omitted for brevity, it is worth noting that the timestamps in which the geo-enabled content was produced also correlated to the larger population in which all other content was produced.

When?

As a final consideration, let’s briefly explore the time of day in which tweets are being authored. The following chart displays the number of tweets authored by hour and is standardized to UTC (London) time. In mentally adjusting the time, recall that London is 5 hours ahead of the East Coast (EST) and 8 hours ahead of the West Coast (PST).

Tweets by time of day

Tweets bucketed by hour of day and standardized to UTC (London) time.

For convenience, the same chart is duplicated below but framed in terms of U.S. East Coast time so that you it’s easier to think about it in terms of a continuous 24 hour period without having to “wraparound.”

Tweets by Time

Tweets bucketed by hour of day and standardized to EST (U.S. East Coast) time.

There is clearly an ebb and flow of when tweets are authored with a spread that is well beyond twice the minimum value in the chart. It would appear that most of the tweeting is happening during and after the evening news in London and western Europe, which roughly corresponds to lunchtime across the United States. However, it does seem a bit surprising that there isn’t a similar spike in tweeting after the evening news in the United States.

Closing Thoughts

With just a little bit of pre-planning, you can filter Twitter’s firehose for content pertaining to any topic that interests you and conduct a preliminary analysis just like this one and much more. A key part of making the whole process as easy as it should be is being equipped with the technical know-how and a toolbox that contains the right combination of templates.  The GitHub repository for Mining the Social Web, 2nd Edition is jam-packed with useful starting points for mining Twitter as well as Facebook, LinkedIn,  Github, and more.

Although, this post was just an exploratory effort that initially sized up a non-trivial data set involving more than one million tweets, we also learned a few things along the way and discovered a few anomalies that are worth further investigation. If nothing else, you hopefully enjoyed this content and now have some ideas as to how you could run your own data science experiment.

===

If you enjoyed this post that featured a sliver of what you can begin to do with Twitter data, you may also enjoy the broader story of social web mining as chronicled in a 400+ page book that’s designed to be the “premium support” for the open source project that’s on GitHub. You can purchase the DRM-free ebook directly from O’Reilly and receive free updates for life.

Read more about the journey of authoring Mining the Social Web, 2nd Edition and how I tried to apply lean practices to make it the best possible resource for mainstream data mining in Reflections on Authoring a Minimum Viable Book.

%d bloggers like this: