Arriving at a Base Influence Metric (Computing Twitter Influence, Part 1)
This post introduces a series that explores the problem of approximating a Twitter account’s influence. With the ubiquity of social media and its effects on everything from how we shop to how we vote at the polls, it’s critical that we be able to employ reasonably accurate and well-understood measurements for approximating influence from social media signals.
[ 24 Sept 2013 – Made a few light edits in preparation for a cross-post on the O’Reilly Programming Blog]
Unlike social networks such as LinkedIn and Facebook in which connections between entities are symmetric and typically correspond to a real world connection, Twitter’s underlying data model is fundamentally predicated upon asymmetric following relationships. Another way of thinking about a following relationship is to consider that it’s little more than a subscription to a feed about some content of interest. In other words, when you follow another Twitter user, you are expressing interest in that other user and are opting-in to whatever content it would like to place in your home timeline. As such, Twitter’s underlying network structure can be interpreted as an interest graph and mined for insights about the relative popularity of one user when compared to another.
…Twitter’s underlying network structure can be interpreted as an interest graph…
There is tremendous value in being able to apply competitive metrics for identifying key influencers, and there’s no better time to get started than right now since you can’t improve something until after you’re able to measure it. Before we can put some accounts under the microscope and start measuring influence, however, we’ll need to think through the problem of arriving at a base metric.
Subtle Variables Affecting a Base Metric
The natural starting point for approximating a Twitter account’s influence is to simply consider its number of followers. After all, it’s reasonable to think that the more followers an account has accumulated, then the more popular it must be in comparison to some other account. On the surface, this seems fine, but it doesn’t account for a few subtle variables that turn out to be critical once you begin to really understand the data. Consider the following subtle variables (amongst many others) that affect “number of followers” as a base metric:
- Spam bot accounts that effectively are zombies and can’t be harnessed for any utility at all
- Inactive or abandoned accounts that can’t influence or be influenced since they are not in use
- Accounts that follow so many other accounts that the likelihood of getting noticed (and thus influencing) is practically zero
- The network effects of retweets by accounts that are active and can be influenced to spread a message
Even though some non-trivial caveats exist, the good news is that we can take all of these variables into account and still arrive at a reasonable set of features from the data that could be implemented, measured, and improved as an influence metric. Let’s consider each of these issues and think about how to appropriately handle them.
Forging a Base Metric
The cases of (1) and (2) present what is effectively the same challenge with regard to computing an influence score, and although there’s not a single API that we can use to detect whether or not an account is a spam bot or inactive, we can use some simple heuristics that turn out to work remarkably well for determining if an account is effectively irrelevant. For example, if an account is following fewer than X accounts, hasn’t tweeted in Y days, or hasn’t retweeted any other account more than Z times (or some combination thereof), then it’s probably not an account of relevance in predicting influence. Reasonable initial values for parameterizing the heuristics might be some weighting of X=10, Y=30, and Z=2; however, it will take some data science experiments to arrive at optimal values.
In the case of (3), we can also take into account the total number of retweets associated with the account and even hone in on whether the it has ever retweeted the other account in question. For example, if a very popular account is following you, but it’s also following tens of thousands of other people (or more) and seldom (or never) retweets anyone (especially you), then you probably shouldn’t count on influencing it with any reasonable probability.
By the way, this shouldn’t surprise you; it’s just not humanly possible to do much with Twitter’s chronologically-oriented view of tweets as displayed in a home timeline. However, despite the sheer lack of the home timeline’s usability for following more than trivial numbers of users, Twitter does offer a coping mechanism: you can organize users of interest into lists and monitor the lists as opposed to the home timeline. The number of times a user is “listed” is certainly an important variable worth keeping in mind during data science experiments to arrive at an influence metric. (However, be advised that spam bots are increasingly using it as well these days as a means of getting noticed.)
In the case of (4), it would be remiss not to consider network effects such as what happens when you get retweeted, because this can completely change the dynamics of the situation. For example, even though an account of interest might have relatively few followers of its own, all it takes is for one of those followers to be popular enough for a retweet to light the initial spark and reach a larger audience. Consider the case in which an account has fewer than 100 followers, but one or more of those followers have tens of thousands of their own followers and opts to retweet as a case in point.
…even though an account of interest might have relatively few followers of its own, all it takes is for one of those followers to be popular enough for a retweet to light the initial spark and reach a larger audience…
As a final consideration, let’s just go ahead and acknowledge the serendipity of Twitter. The percentage of “active” followers who will probably even see any particular tweet for someone that they’re not very intentionally keeping up with is generally going to be a small fraction of what is theoretically possible. After all, most people have a lot more to do in life than carefully and thoughtfully monitor Twitter feeds. Furthermore, the popular users that would create the most significant network effects from a retweet must have done something to earn their “popular” status, which probably means that they’re quite busy and are unlikely to notice any given tweet on any given day.
To make matters worse, even if they do notice your tweet, they may opt to mark it as a “favorite” instead of retweeting it, which is another variable that we should consider in arriving at a base metric. Getting “favorited” is certainly a compliment, is useful data to consider for certain analytics, and serves a purpose of validation; however, it’s secondary effects don’t compare to a retweet because of the comparatively little visibility available to favorites as opposed to retweets.
In the next post, we’ll introduce some turn-key example code for making robust Twitter requests in preparation to acquire and store all of the follower profiles for one or more users of interest so that we can eventually mine the profiles and try out some variations of our follower metric. Stay tuned…