How To Mine Your GMail with Google Takeout and MongoDB

takeout

Google has really been on the up-and-up lately with a service called Google Takeout that allows you to export your data from its cloud. For the thoughtful cloud user who is becoming increasingly concerned about privacy, accidental data loss, or data ownership, this is a product that’s sure to please. Likewise, for the data mining enthusiast, quantified-self number cruncher, or hacker looking for a fun weekend project, Google Takeout is also a great option that enables some good fun.

In a world filled with Twitter, Facebook, and other popular social networks, it’s easy enough to overlook mail data as mundane; however, your mailbox is without a doubt one of the places where you have probably accrued some of the most interesting data over the years. The opening paragraph of Chapter 6 from Mining the Social Web, 2nd Edition is quick to highlight the interestingness of mailbox data and some of the possibilities:

Mail archives are arguably the ultimate kind of social web data and the basis of the earliest online social networks. Mail data is ubiquitous, and each message is inherently social, involving conversations and interactions among two or more people. Furthermore, each message consists of human language data that’s inherently expressive, and is laced with structured metadata fields that anchor the human language data in particular timespans and unambiguous identities.

Although social media sites are racking up petabytes of near-real-time social data, there is still the significant drawback that social networking data is centrally managed by a service provider that gets to create the rules about exactly how you can access it and what you can and can’t do with it. Mail archives, on the other hand, are decentralized and scattered across the Web in the form of rich mailing list discussions about a litany of topics, as well as the many thousands of messages that people have tucked away in their own accounts. When you take a moment to think about it, it seems as though being able to effectively mine mail archives could be one of the most essential capabilities in your data mining toolbox.

The remainder of Chapter 6 goes on to provide a fairly standalone soup-to-nuts primer on the nature of mail data, how to munge it into a convenient mbox format (regardless of its original source), and how to use a document-oriented database like MongoDB to facilitate running analytics and extracting some meaningful insights. The text itself leverages the well-known public Enron corpus as a realistic source of open data, but the code works just as well with any other kind of mail data that can be exported (or munged) into an mbox format.

As it turns out, Google Takeout can export your entire mailbox or any subset of it as defined by labels and other organizational options you can implement through the standard GMail user interface, and after a couple of relatively minor enhancements, it became easy enough to forget all about Enron, pick up right at Example 6-3, and work through the remainder of the chapter on your own mailbox data. Likewise, many popular mail clients allow you to export in mbox format and accomplish the very same thing.

The basic flow of the examples as presented in the IPython Notebook involves the following steps:

  • Arrive at an mbox formatted export of your mail
  • Convert the mbox export into JSON
  • Load the JSONified data into MongoDB
  • Use MongoDB’s powerful aggregation framework to query and analyze the mailbox

As is the case with all other chapters from Mining the Social Web, all of the source code examples for Chapter 6 are available online in a convenient IPython Notebook format and easy enough to follow along with even if you don’t have a copy of the text. Furthermore, the turn-key virtual machine that’s provided takes care of the initial installation/configuration pains of IPython Notebook, MongoDB, and some of the other dependencies so that you can get right to the good stuff!

If you haven’t yet installed the virtual machine, this quick start guide that features a step-by-step video may be of great help, and as always, I’m just a tweet, Facebook message, GitHub ticket, or email away if you need any assistance along the way.

Enjoy.

Twitter Data Mining Round Up

ContentRoundup_Twitter

Since the release of Mining the Social Web, 2E in late October of last year, I have mostly focused on creating supplemental content that focused on Twitter data. This seemed like a natural starting point given that the first chapter of the book is a gentle introduction to data mining with Twitter’s API coupled with the inherent openness of accessing and analyzing Twitter data (in comparison to other data sources that are a little more restrictive.) Twitter’s IPO late last year also focused the spotlight a bit on Twitter, which provided some good opportunities to opine on Twitter’s underlying data model that can be interpreted as an interest graph.

Throughout the remainder of this coming year, I hope to spend much less time on Twitter and systematically work through the myriad other topics in the book: Facebook, LinkedIn, Google+, email, web pages, etc. However, before changing course, it seemed useful to provide a consolidated reference of the existing Twitter-related content.

Blog Posts

IPython Notebooks

Excerpts & Presentations

Videos

While there are plenty of other great links out there on the web about data mining with Twitter, these are a few that I am particularly proud to have produced. I hope you enjoy them.

Stay in touch and feel free to reach out with any suggestions or requests about future content.

Mining Social Web APIs with IPython Notebook [Data Day Texas Workshop Slides]

Screen Shot 2014-01-12 at 3.47.55 PM

Thanks to everyone who attended the Mining Social Web APIs with IPython Notebook workshop at Data Day Texas. I’m really glad that I made the trip down to Austin and could share some of my work with you. The data truly is bigger in Texas, Austin was a fantastic city to visit, and everyone I had the pleasure of speaking with at the conference was really friendly and motivated to learn.

In case you missed the workshop, you can download the workshop slides from Slideshare. Everything you need to follow along should be in the deck, but as always, please don’t hesitate to contact me if there’s anything at all that I can do to help you in any way.

Gratefully,

-MAR

5 Questions for Aspiring Author-Entrepreneurs

startup-sign

For most of 2013, most of my nights and weekends have been consumed with a writing (and selling) a book entitled Mining the Social Web (2nd Edition). This makes the fifth tech book that I’ve written in approximately five years, and one thing I’ve come to learn over the course of my book writing adventures is that book writing is a skill in and of itself. Like anything else, the more of it that you do, the more that you learn and can share back with others.

This post presents the following questions (along with some anecdotal advice) that I’d recommend mulling over if you are an aspiring tech book writer.

What are your motives?

Writing a quality tech book of reasonable length is not for the faint of heart. Like any other long-lived effort in an age of waning attention spans and instant gratification, some of the pains involved will push you to a point where you’ll seriously reconsider whether or not this book-writing idea was worthwhile in the first place. On more than one occasion, you’ll contemplate the other things that you could be doing with your time. In the end, if you don’t have a good reason as to why you’re writing the book, you’ll probably quit and be just another publishing casualty along the way.

To be perfectly clear, your motives certainly don’t have to be altruistic or selfless. You just need to be honest with yourself, clearly articulate them in writing somewhere, and review them from time to time. A few of the possible reasons you might consider writing a tech book could include:

  • Rigorously learning a new topic
  • Building your reputation
  • Earning extra income
  • Altruistically fulfilling a need in the market

Of all the reasons to write a book, earning extra income is the one that I’d admonish you to consider the most carefully. The difference between doing something for fun versus doing it for profit can dramatically change the dynamics and relative enjoyment of the activity. The two motivations certainly don’t need to be mutually exclusive, but just make sure that your goal for the book is achievable by your own standards and that you really believe it’s worth a significant portion of your time to accomplish.

i.chzbgr

If money is your primary interest in writing a tech book, you may find that you don’t have very many problems at all.

How long will it take?

I’d recommend thinking about the amount of effort that it takes to write a quality tech book in terms of both overall effort involved as well as calendar time. The former is based upon estimates that you’ll derive from your outline of the book and can be used to comparatively think about the “opportunity costs” of not doing something else with your time. The latter partitions that overall amount of time into a schedule that fits onto the calendar and helps you to better understand the ramifications of those opportunity costs.

Just a few of the opportunity costs that you should consider:

  • Missed consulting revenue
  • Volunteer work
  • Exercise
  • Social relationships
  • Entertainment

After writing 5 books myself, the base metric I’ve settled upon from my own personal experience is that it takes about 2 hours per page after all of the detailed are worked out. That figure includes the earliest stages of brainstorming, the amortization of time diverted into research activities that inevitably happens along the way, and everything else that leads into the final round of proofreading in which I (re-)read every single word of the final manuscript that’s about to go to the printer. That number may seem high, and your own mileage may vary, but you might at least consider it as a a starting point or as an upper-bound if you think you’re considerably more efficient.

Perform-Estimation-in-Agile-Projects-1

As with software projects, estimating the effort required to write a tech book can be quite difficult. Lots of early peer review on your outline and other efforts to ferret out the “unknown unknowns” that could adversely affect your project schedule is essential.

To illustrate,  let’s assume that you’ve produced a solid outline that suggests you’ll be writing a book that’s estimated to be around 350 pages. Using a heuristic of 2 hours per page, that translates to about 700 hours of effort, and unless you’ve enjoyed a recent windfall or other special circumstances that allows you to approach this endeavor as a full-time job, you’ll be inevitably sacrificing a substantial portion of your nights and weekends for the better part of a year to get it done if you’re moonlighting at the rate of 15-20 hours a week.

One other consideration that you should always take into account with any activity involving estimation is Hofstadter’s Law, which is defined as follows: It’s always takes longer than Hofstadter’s Law predicts that it will, even when you take into account Hofstadter’s Law.

Seriously, estimation is not easy, and you’ll find that there are gaps in your outline that you’ll need to fill along the way. Those detours can really start to add up. The bottom line is that it will almost certainly take longer than you anticipate to write a book that you’ll be proud of writing. Be sure to regularly reassess your original estimates and update them along the way.

To self-publish or not to self-publish?

Besides making that initial mental commitment to write a book, determining whether or not to work with a publisher and choosing a particular publisher is probably the biggest decision that you’ll make. I’d recommend approaching this very important decision with standard cost-benefit analysis as well as from the basis of whether or not you need a partner to achieve your goals for the book or if you can do it alone.

As with any other relationship in life, open and honest communication is key, and it’s imperative that you manage expectations properly. Your relationship with your publisher (and even more specifically, with your editor) is no different. A good way to think about “manage my expectations” is “don’t surprise me”. From the standpoint of working with a publisher, that typically translates into staying on schedule and adhering to the agreed upon outline for the content.

However, you’re the one who will be staying up late and making lots of sacrifice to produce the book as a moonlighting activity, so you should be sure that the publisher can meet your own expectations before engaging in a (legally binding) partnership with them. A few questions to consider during your initial conversations with a publisher:

  • What type and frequency of feedback will you provide?
  • How much grace is extended for missing deadlines or potentially extenuating circumstances?
  • How much can I deviate from the original outline without renegotiating the contract?
  • Will I ever be able to renegotiate any key financial metrics like royalty rates or advances?
  • How much “production support” are you providing for professional illustrations, proofreading, copyediting, etc.?
  • What will you do to market/sell the book once it’s complete?

There’s a real value that you can estimate and place on those factors. Sure, you could do it all yourself, but that would take up even more of your time and translate into even higher opportunity cost.

snoopy_writing

You could self-publish, but you could also do lots of other things with the time that it would take to produce a truly professional work. Carefully consider your motives and goals for producing the book before deciding that self-publishing is right for you.

In an era of self-publishing, ebooks, and print-on-demand services, I’d recommend that you hold the publisher to very high standards on at least the following fronts:

  • The shaping and refinement of your initial ideas
    • Don’t underestimate taking into account the importance of writing a book that the market needs as opposed to just writing a book that you want to write.
  • Constructive criticism about your manuscript as it evolves
    • You need the feedback, no matter how good you think that you are. You want your product to be the best that it possibly can be.
  • The application of quality production processes to the final manuscript
    • This is seriously tedious work that you really don’t want to do yourself and are paying a huge premium for by working with a publisher. Make them earn it!
  • A solid distribution channel with ample sales/marketing
    • Once your book is complete, it’s a product. At that point, it’s not not about writing; it’s about addressing the market and selling it.

In my recent book-as-a-startup experiences with Mining the Social Web (2nd Edition), it’s the application of production processes and the distribution channel that have provided the most value. Multiple rounds of proofreading, copyediting, professional illustrations, and the creation of cover art are all things that I’d rather not have done for myself and certainly took the professionalism of the book to a whole new level. In terms of distribution, suffice it to say that it is certainly in the publisher’s interest to see your work succeed, but you are only one of scores of authors that they are probably working with, so temper your expectations.

One expectation that you should certainly not not misunderstand is that your publisher is not your primary source of sales and marketing. You as the author are your primary source of sales and marketing. Once you have a final product in a distribution channel, there will probably be some momentum from a small PR campaign around your book that the publisher takes care of, but that’s really just to set off a spark. The real sales and marketing is up to you, and you’ll have to be enterprising to figure out what’s working and what’s not working. I highly recommend the application of Lean Startup principles, which is a good segue into the next topic.

Is it a project or a product?

Trick question! It’s both — though not quite at the same time. The distinction that I’m making between project and product can be illustrated with the following two pieces of advice:

  • The process of writing a book is a project
  • A book is a product that you sell

The takeaway here is that if you only think about your book as a project, then the project basically ends once you have a product in the publisher’s distribution channels. At that point, the project is “complete” aside from some ad-hoc work you might occasionally do to promote it. By the time the book publishes, you’re probably frazzled, exhausted, and just want to regain some balance in your life, so it’s a very natural reaction to feel a sense of accomplishment, breathe a sigh of relief, and trust that the publisher will sell it for you. After all, if it’s any good, it’ll just “sell itself”, right?

j0422876-300x200

A project and a product are two entirely different things. Think carefully about them as your manuscript approaches final form and strongly consider treating your book as a startup. The experiential benefits can be tremendous and of incalculable value regardless of the success of the book itself.

I’m confident that you’ll make a few bucks with your book while you momentarily decompress from the surge to get it across the finish line, but I’d strongly admonish you to reengage and treat it like a product from that point forward. The decision to think of your book as a startup and yourself as the CEO of this tiny little startup is a lot more work compared to performing ad-hoc work whenever you feel like it, but it unlocks an entirely new perspective on life.

With a product and distribution channel in hand, you’ll be forced to think about things that you’ve always taken for granted (or thought of as unimportant/easy work) in other professional engagements. A few examples of the hats you’ll wear as an author-entrepreneur with your book-as-a-startup business to get you thinking:

  • As CEO, what should you be doing to maximally promote the book? Blogging? Speaking engagements? Book tour? Should you spend money on various sources of online ads? Should the book just be a prop for consulting?
  • As CMO, can you accurately estimate the size of your addressable market? Determine if your messaging is as effective as it needs to be?
  • As COO, can you explain the prior month’s revenue? Forecast the next month’s revenue?
  • As CTO, is there a way that you can simplify the user’s experience to try out the code? Perhaps a VM or a web app that’s trivial to install?
  • As the SVP of Customer Service, can you institute a system to respond to unhappy readers? Before they leave you a bad review?

At the end of the month, it really all boils down a single number: revenue earned. The arithmetic and accounting reports (as provided by the publisher or online publishing system) are pretty simple. As the author-entrepreneur, it’s your job to do something about them.

What is holding you back from selling more books? Is it a flawed product, or is it a marketing issue?

Writing a book is one thing. Selling a book is a different beast entirely.

Marketing is hard.

The following video is a short ~5 minute Ignite talk that provides some (hopefully motivational and entertaining) information on the notion treating a book as a startup.

What is its expected shelf life?

Last but certainly not least is the longevity of your book, regardless of whether you prefer to think of it as a project or a product. In either case, you’ve invested non-trivial effort into making it a reality, and you probably won’t look forward to the maintenance involved in keeping it up to date, or the day that you have to rewrite significant portions to reflect changes in the underlying technology that backs the dialogue and example code.

As much as you need to understand your addressable market, you need to understand the technology that you are including in your book, the community that backs it, and any roadmaps that may (or may not) exist. Take it from someone who has written a book that was affected by fairly major changes to the social web landscape (short-notice Twitter API changes, the retirement of Google Buzz and the birthing of Google Plus, OAuth 2.0 evolution, etc.) that it’s not enough to just write about what exists right now.

You need to craft your written message so that it’s as evergreen as possible. In the words of a famous Canadian hockey player, you want to “skate where the puck’s going, not where it’s been”. Be as prescient as possible in making the right bets in terms of what you introduce in written form (the book) versus what you can provide as an online supplement that will be much easier to maintain. As with (successful) software projects, the majority of the effort required is usually during the maintenance of the product after it’s been operationalized. Why should a successful tech book be any different?

0709-Business-Twinkies_full_600

How does the shelf-life of your book compare to the shelf-life of these Twinkies?

Revenue is trust. If your customers trusted you enough to pay for a product with your name on the front of it, you can either take care of them and show yourself worthy of that trust, or you can inevitably tarnish your reputation. And that’s not good for business.

Closing Remarks

Writing a successful tech book is an incredibly daunting endeavor, and if you really want to maximize the revenue opportunities associated with it, you’d be wise to think of it in terms of a tiny startup business, apply some Lean Startup principles, and treat yourself to the entrepreneurial education that only real world experience can bring. It will require more sacrifice than you think that it will, it will take more time than estimate that it will, things will go wrong, and the whole process will truly test you. However, you will come out the other side stronger, wiser, and with “street smarts” that you can’t get by just sitting around and talking about things.

Talk is cheap. Don’t be cheap. Get to work on that book, and let me know if there’s anything I can ever to do help you. I hope to share some more book-as-a-startup posts in early 2014.

Understanding the Reaction to Amazon Prime Air (Or: Tapping Twitter’s Firehose for Fun and Profit with pandas)

amazon-prime-air

On Cyber Monday eve, Jeff Bezos appeared in a 60 Minutes segment and revealed to the world that he’s been working on an experimental effort called Amazon Prime Air. The general idea behind Amazon Prime Air is that Amazon may one day deliver relatively lightweight items directly to your doorstep in less than 30 minutes after you order via a fleet of small unmanned aerial vehicles. The following short video summarizes the concept in case you’ve somehow missed it.

Within moments of the announcement, I tapped Twitter’s firehose for the keyword query “Amazon” by employing a couple of recipes from the Twitter Cookbook, because this seemed like an ideal opportunity to capture a relatively large volume of tweets laden with emotional reaction. Over the course of the next few hours, I collected ~125,000 tweets, analyzed them in IPython Notebook with pandas, and later presented these findings as an online mini-workshop. (A video archive of the entire workshop is now available in case you missed it last week.)

Rather than rehashing the results here, I’d rather invite you to spend a few minutes reviewing the notebook. It’s easy to follow along with, features lots of narrative, and includes output from running the code. The analysis techniques range from basic times-series analysis with pandas to rudimentary natural language processing toward the end, so there should be a little something in there for everyone.

As always, questions and comments are welcome. Enjoy.

Confessions of a Prolific Moonlighter (with a Chronic Writing Disorder)

A ~5 minute Ignite talk (20 slides, 15 seconds per slide) that provides some advice on writing tech books — and life.

The fundamental takeaway is that a book is a startup! (If you want it to be…)

  • It’s a product (and/or services.)
    • But it’s especially product
  • Tech writing is a skill
    • It’s story-telling
  • Moonlighting is a skill
    • Maintain work/life balance
  • You can have a startup
    • Write a book!

Download the slides on SlideShare.

Enjoy!

What Do Tim O’Reilly, Lady Gaga, and Marissa Mayer All Have In Common?

lady-gaga-nerd-glasses

This post examines the followers of some popular Twitter users as the final installment of a multi-part series about exploring Twitter influence by asking the (Freakonomics-inspired) question, What do Tim O’Reilly, Lady Gaga, and Marissa Mayer all have in common? Although it may initially seem like an obnoxious question to ask, some of the answers may intrigue you once you begin to take a closer look at the data. (Although dashingly good looks might be one thing that they all have in common, we’ll let the data do the talking and stick with Twitter followers as the basis of computing similarity for this post.)

tim_gaga_marissa

Which two of these three accomplished entrepreneurs are most alike? It all depends on the features that you’re comparing!

Goals

The initial idea behind this entire series on Twitter influence is that it would be an interesting and educational experiment in data science to put Tim O’Reilly‘s ~1.7 million followers under the microscope and explore the correlation between popularity (based upon number of followers) and Twitter influence. 

In order to draw some meaningful comparisons, however, we’ll need to consider at least one other account. Marissa Mayer seems like a fine selection for comparison since her Twitter account is similar yet different to Tim’s account. For example, she’s also a “tech celebrity” and business executive. However, her particular expertise is not quite the same, and she only has about one-fourth as many followers. (Or so it would initially appear…)

Just to make this interesting, let’s further mix things up a bit by introducing a wildcard. Lady Gaga seems as good a choice as any to introduce a bit of unexpected fun into the situation. She is one of the ten most popular Twitter users based upon number of followers, an accomplished entrepreneur, and  surely draws interest from a broad cross-section of the population.  The introduction of a third account also provides the opportunity to draw some additional comparisons, so let’s compute the Jaccard index for the various combinations of these three accounts and see what turns up. The Jaccard index measures similarity between sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets, or, more plainly, the amount of overlap between the sets divided by the total size of the combined set. This is a simple way to measure and compare the overlap in followers.

Results

The full results (example code, notes, and the results from executing each cell) are available as an IPython Notebook, and you are encouraged to review it in depth. For convenience, a summary of the key results that you’ll see computed in the notebook follow:

  • Approximately 50% of Tim O’Reilly’s ~1.7 million followers are “suspect” in the sense that they may be inactive accounts or spam bots. In comparison, only about 15% of Marissa Mayer’s ~460k followers are suspect according to the same criteria.
    • Although mostly speculative, this difference might be explainable by a massive wave of spam-bots targeting popular users back in 2009 when Twitter experienced some unprecedented growth in its number of users. (For example, a closer look at the data reveals that ~66% of Tim O’Reilly’s followers joined Twitter in 2009.)

A histogram of Tim O’Reilly’s followers who have fewer than 10 followers of their own. Approximately 50% of these followers are “suspect” in that they may be spam-bots or inactive accounts; decreasing the threshold to 5 decreases the number to just under 40%.

  • Approximately 25% of Tim O’Reilly’s (“non-suspect”) followers also follow Lady Gaga as compared to only about 18% for Marissa Mayer.
    • In other words, there appears to be a slightly stronger affinity between Tim O’Reilly and Lady Gaga than between Marissa Mayer and Lady Gaga.
  • Lady Gaga has a higher Jaccard similarity to Tim O’Reilly than to Marissa Mayer. (However, Tim O’Reilly and Marissa Mayer have a much higher Jaccard similarity to one another than either one of them have to Lady Gaga, as might have been reasonably expected from their strong technology backgrounds.)
    • Tim O’Reilly and Marissa Mayer have ~100k followers in common, and even once this number is adjusted for suspect followers, there are still ~95k followers in common. This is a high number but doesn’t seem all that surprising.
    • What may seem a bit unexpected is that once you introduce Lady Gaga, this number only drops to ~25k. In other words, the total number of followers that Tim O’Reilly, Marissa Mayer, and Lady Gaga all have in common amongst the three of them is still about 25k accounts.

Perhaps the broad takeaway that addresses our initial inquiry about using popularity as an indicator of clout is that “number of followers” is not as clear cut a heuristic as it may have first seemed. After all, the actual gap between Tim O’Reilly and Marissa Mayer appears to be considerably smaller than it once appeared after making a simple adjustment for so-called “suspect” followers.

But what do Tim O’Reilly, Lady Gaga, and Marissa Mayer have in common? At least one way of answering the question is that there appears to be that there at least 25k common fans who are interested in all three of them. After all, Twitter is an interest graph. A closer analysis of these common account profiles could prove quite interesting and is a recommended exercise.

Although nothing definitive was proven, it seems quite likely that a coarse filter on an account’s followers is a good starting point. It wouldn’t be too difficult to perform some additional filtering to increase the precision of identifying abandoned accounts or spam bots that cannot be influenced in order to more accurately narrow in on a base metric for computing Twitter influence. You now have the tools and a good starting point to do just that — and a lot of other fun stuff.

By the way, you notice that we didn’t tell you how many of Lady Gaga’s followers appear to be spambots or inactive. That is the topic for another post to follow. (Unless, of course, you beat me to the punch!)

Enjoy!

Updates

23 Nov 13 @ 1900UTC – Like Tim O’Reilly, approximately 50% of Lady Gaga’s followers are also “suspect” when applying the same “minimum follower” filter. She joined Twitter around the same time as Tim O’Reilly back in March 2008.

More analysis to follow soon with a closer look at ‘suspect’ followers with the goal of identifying the inactive/spambot accounts with very high probability. Thoughts on criteria to use are welcome. Leave a comment

Resources

Super Simple Storage for Social Web Data with MongoDB (Computing Twitter Influence, Part 4)

Icon_MongoDB_by_xkneo

In the last few posts for this series on computing twitter influence, we’ve reviewed some of the considerations in calculating a base metric for influence and how to acquire the necessary data to begin analysis. This post finishes up all of the prerequisite machinery before the real data science fun begins by introducing MongoDB as a staple in your social web mining toolkit and showing how to employ it for storing social data such as Twitter API responses.

As Easy As It Should Be

MongoDB is an excellent option to consider if you need a quick and easy fix for your data science experiments, and if you like Python, there’s a good chance you’ll enjoy MongoDB as well. Much like Python, MongoDB easy to pick up along the way, it scales up fairly well as the size of your data grows without too much fuss, the online documentation is excellent, the community is robust, language bindings are plentiful, and it’s generally just as easy as it should be to do a lot of data manipulation to/from Python.

MongoDB is an excellent option to consider if you need a quick and easy fix for your data science experiments…

MongoDB document-oriented, which (for our purposes) basically means that it stores JSON data, enabling you to easily archive the responses that you get back from most social web APIs. It’s easy enough to query the data with the standard find() operator, but a more powerful aggregation framework is available for constructing more nuanced data pipelines.

A primer of MongoDB is unwarranted, but if you have a copy of the book on hand, Chapter 6 (Mining Mailboxes) introduces a MongoDB as a sort of surrogate API for mail data. (The first half of this chapter focuses on normalizing arbitrarily sourced mail data so that it can be ingested into MongoDB for standardized analysis.)

Saving and accessing JSON data with MongoDB (Example 9-7 from the Twitter Cookbook) introduces two functions for storing and retrieving Twitter API data from MongoDB that we’ll adapt in the next section for our immediate needs. Take a moment to review this recipe if you haven’t previously encountered it. The functions that it provides are little more than load/store convenience wrappers.

Storing Millions of Twitter Followers

Recall from the last post in this series that a recipe like Getting all friends or followers for a user (Example 9-19 from the Twitter Cookbook) is fundamentally limited by the amount of memory that’s available. It buffers API responses in memory and accumulates 75,000 long integer values every 15 minutes, and although this is fine for a user with a “reasonable” number of followers, it won’t work at all for celebrity users with millions of followers. Even if we did have unlimited heap space, we’d still want to strive for a low memory profile as well as maintain a persistent archive for more convenient analysis that’s unconstrained by rate limits and network latency. After all, once you have the data, you won’t want to go to the trouble of fetching it again unless absolutely necessary since this process can be quite time consuming.

To illustrate just how easy it is to adapt a recipe from the cookbook like Example 9-19, take a look at this revised version of get_friends_followers_ids that’s been renamed to store_friends_followers_ids and compare it back to the original version. The primary substance of the change is simply the introduction of a save_to_mongo call for persisting each API response (along with a few tweaks to make this possible.)

def store_friends_followers_ids(twitter_api, screen_name=None, user_id=None,
                              friends_limit=maxint, followers_limit=maxint, database=None):

    # Must have either screen_name or user_id (logical xor)
    assert (screen_name != None) != (user_id != None), "Must have screen_name or user_id, but not both"

    # See https://dev.twitter.com/docs/api/1.1/get/friends/ids  and
    # See https://dev.twitter.com/docs/api/1.1/get/followers/ids for details on API parameters

    get_friends_ids = partial(make_twitter_request, twitter_api.friends.ids, count=5000)
    get_followers_ids = partial(make_twitter_request, twitter_api.followers.ids, count=5000)

    for twitter_api_func, limit, label in [
                                 [get_friends_ids, friends_limit, "friends"],
                                 [get_followers_ids, followers_limit, "followers"]
                             ]:

        if limit == 0: continue

        total_ids = 0
        cursor = -1
        while cursor != 0:

            # Use make_twitter_request via the partially bound callable...
            if screen_name:
                response = twitter_api_func(screen_name=screen_name, cursor=cursor)
            else: # user_id
                response = twitter_api_func(user_id=user_id, cursor=cursor)

            if response is not None:
                ids = response['ids']
                total_ids += len(ids)
                save_to_mongo({"ids" : [_id for _id in ids ]}, database, label + "_ids")
                cursor = response['next_cursor']

            print >> sys.stderr, 'Fetched {0} total {1} ids for {2}'.format(total_ids, label, (user_id or screen_name))
            sys.stderr.flush()

            # Consider storing the ids to disk during each iteration to provide an
            # an additional layer of protection from exceptional circumstances

            if len(ids) >= limit or response is None:
                break
                print >> sys.stderr, 'Last cursor', cursor
                print >> sts.stderr, 'Last response', response

# Sample usage follows...

screen_names = ['SocialWebMining', 'LadyGaga']

twitter_api = oauth_login()

for screen_name in screen_names:

    store_friends_followers_ids(twitter_api, screen_name=screen_name,
                                friends_limit=0, database=screen_name)

print "Done"

That’s really all that there is to it. We’re now to the point that we can reliably harvest and store arbitrary volumes of Twitter data.

It may be worthwhile to review the prior posts in this series as a reminder for just how far we’ve come so far. Now having all of the necessary machinery and prerequisite discussion in place, we’ll return to the original proposition of computing Twitter influence with an initial review of some data for a few well-known Twitter accounts in the next post in this series.

Recommended Resources

How to Deliver a Successful Tech Workshop with Vagrant and AWS

the-cloud-dropbox

At Strata, I delivered workshop called Mining the Social Web with IPython Notebook, and in order to ensure that the workshop would meet its objectives and be a smashing success, I knew that a few constraints had to be considered:

  1. Everyone must be able to follow along with the examples. (The goal of the workshop is actually doing something with data as opposed to just talking about it.)
  2. You need a development environment to follow along with the examples. (You can’t do anything with data unless you have a development environment.)
  3. Most people wouldn’t have prepared a development environment. (Inevitable.)
  4. Preparing a development environment is isn’t possible to do on site at the workshop. (It’s far too time-consuming and the wireless would probably buckle even if it weren’t.)

Those constraints are actually pretty challenging to satisfy, but there are some approaches that you can consider:

  • Do nothing; if people didn’t prepare, then it’s too bad for them. (Unacceptable if you want people to enjoy your workshop. Even if it’s not your fault that they didn’t prepare, it’s still your fault that they didn’t prepare.)
  • Pass out media such as CDs or USB drives with the necessary software on it on site. (Cumbersome at the very least  for a non-trivial number of attendees and still could be fairly time consuming.)
  • Provide pre-configured cloud-based machines for everyone. (Check.)

Running a Vagrant Box on AWS

Powering Mining the Social Web’s virtual machine experience with Vagrant has turned out to be a remarkably good decision. It trivializes the process of bootstrapping a virtual machine and applying a configuration management template, which is perfect for creating a repeatable development environment. Just follow along with the quick start guidewatch the screencasts, and you’ll be up and running in no time.

But that’s the Vagrant you know and not the Vagrant that would do much to help with the workshop.

The Vagrant that you (probably) don’t know is the Vagrant that can just as trivially launch that very same virtual machine on the AWS cloud. In short, you just need an AWS account and the vagrant-aws plugin. In a little more detail, here are the basic steps involved once you’ve already been able to follow along with the quick start guide and launch your virtual machine locally:

  1. Sign up for an AWS account. (Right now, there’s even a “free tier” that will work just fine for Mining the Social Web, so it won’t even cost you anything. However, you are required to have a credit card on file.)
  2. Install the vagrant-aws plugin. Type this in a terminal: vagrant plugin install vagrant-aws
  3. Install a “dummy” Vagrant box, which just creates a shell for Vagrant to use for some local bookkeeping. Type this in a terminal: vagrant box add dummy https://github.com/mitchellh/vagrant-aws/raw/master/dummy.box
  4. Define the four environment variables starting with MTSW_ that are referenced in Mining the Social Web’s Vagrantfile. These environment variables define your AWS access key, AWS secret access key, the name of the keypair used to start an EC2 instance, and the path to the private key for that keypair.
  5. Start your virtual machine in the cloud. Type vagrant up –provider=aws and keep an eye on the console

As with anything else, there may be a few configuration details that you’ll need to tweak, but that’s the gist. If you’re interested in launching a virtual machine with AWS provider, you should definitely learn some EC2 fundamentals, and read up enough about Vagrant to understand the contents of the Vagrantfile. In particular, bone up on the details associated with launching EC2 instances (so that you have a better idea of some of the other settings you can configure such as the region in which your AWS machines will start and those sorts of things), and invest a little bit of time learning more about Vagrantfile settings.

Running a Vagrant Box on AWS

The instructions in the previous section are exactly what you’d do to bootstrap a single AWS instance in the cloud, and unless you’re doing some fairly heavy duty work, you should be able to employ a micro-instance that costs less than $0.02 per hour! However, recall that for my workshop, I didn’t need just a single instance. I needed to launch ~60 machines so that everyone in the workshop would have their own virtual machine.

As it turns out, it’s not so difficult to go from one to sixty. Here’s how:

  1. Login to your AWS management console to view running EC2 instances. (You’ll see the instance that Vagrant started on your behalf.)
  2. Create an AMI from your running EC2 instance. (An AMI is an “Amazon Machine Image”; think of it as a template for an EC2 instance.)
  3. Launch new EC2 instances from the “AMIs” item in the navigation menu.
Screen Shot 2013-11-14 at 5.25.29 PM

Once Vagrant has configured your EC2 instance, you can create an AMI from the running instance. Then, you only need access to the AWS management in order to launch your instance from the AMI. Launching from the AMI usually takes less than 1 minute since all of the configuration management has already been applied.

Again, you will need to gain a little comfort with AWS along the way, but that’s pretty much all that’s required for a basic setup. At this point, you technically don’t even need Vagrant anymore since you can launch fully pre-configured EC2 instances as needed. (However, do keep in mind how much work Vagrant did to make it this easy to create the AMI image that you can now so easily employ.)

One other consideration worth pointing out is that you may want to think about securing your IPython Notebook server with a password since it’s in the cloud and could be accessible to anyone in the world if you haven’t locked down the range of IP addresses that access it. (Even then, a password would still probably be a good idea.)

Finally, note that your account will only be able to launch 20 EC2 instances by default, but that AWS customer service is ready and willing to help you if you need more. (Read more about my terrific encounter with AWS customer support.)

Now, go out and deliver a successful technical workshop!

Additional Resources

If you’ve found this post interesting, you may also enjoy these other resources:

An Approximate Solution for TL;DR [~50 Year Old Text Summarization Hack Presented as a ~1.7MB Animated GIF]

hp-luhn

Suffering from information overload? Too much TL;DR happening in your life? Attention span just isn’t what it used to be?

Watch this short ~30 second screencast (a ~1.7MB animated GIF) that demonstrates a 50+ year old hack for summarizing news articles and other types of online content. After all, it seemed fitting that the presentation of a text summarization algorithm would be as compressed and summarized as possible, right?

The text summarization code itself is taken from Mining the Social Web.

preview

Click on the image above to watch a higher resolution version of this ~30 second animated GIF screencast. This preview version is ~360KB while the higher resolution version is still only 1.7MB. (WordPress wouldn’t render the full version of the GIF containing the animation inline because of the constraints imposed by this site’s theme.)

For those who prefer it, the video version of this “screencast” is also available.

Follow

Get every new post delivered to your Inbox.

Join 4,586 other followers

%d bloggers like this: