How To Mine Your GMail with Google Takeout and MongoDB

Google has really been on the up-and-up lately with a service called Google Takeout that allows you to export your data from its cloud. For the thoughtful cloud user who is becoming increasingly concerned about privacy, accidental data loss, or data ownership, this is a product that’s sure to please. Likewise, for the data mining enthusiast, quantified-self number cruncher, or hacker looking for a fun weekend project, Google Takeout is also a great option that enables some good fun.

In a world filled with Twitter, Facebook, and other popular social networks, it’s easy enough to overlook mail data as mundane; however, your mailbox is without a doubt one of the places where you have probably accrued some of the most interesting data over the years. The opening paragraph of Chapter 6 from Mining the Social Web, 2nd Edition is quick to highlight the interestingness of mailbox data and some of the possibilities:

Mail archives are arguably the ultimate kind of social web data and the basis of the earliest online social networks. Mail data is ubiquitous, and each message is inherently social, involving conversations and interactions among two or more people. Furthermore, each message consists of human language data that’s inherently expressive, and is laced with structured metadata fields that anchor the human language data in particular timespans and unambiguous identities.

Although social media sites are racking up petabytes of near-real-time social data, there is still the significant drawback that social networking data is centrally managed by a service provider that gets to create the rules about exactly how you can access it and what you can and can’t do with it. Mail archives, on the other hand, are decentralized and scattered across the Web in the form of rich mailing list discussions about a litany of topics, as well as the many thousands of messages that people have tucked away in their own accounts. When you take a moment to think about it, it seems as though being able to effectively mine mail archives could be one of the most essential capabilities in your data mining toolbox.

The remainder of Chapter 6 goes on to provide a fairly standalone soup-to-nuts primer on the nature of mail data, how to munge it into a convenient mbox format (regardless of its original source), and how to use a document-oriented database like MongoDB to facilitate running analytics and extracting some meaningful insights. The text itself leverages the well-known public Enron corpus as a realistic source of open data, but the code works just as well with any other kind of mail data that can be exported (or munged) into an mbox format.

As it turns out, Google Takeout can export your entire mailbox or any subset of it as defined by labels and other organizational options you can implement through the standard GMail user interface, and after a couple of relatively minor enhancements, it became easy enough to forget all about Enron, pick up right at Example 6-3, and work through the remainder of the chapter on your own mailbox data. Likewise, many popular mail clients allow you to export in mbox format and accomplish the very same thing.

The basic flow of the examples as presented in the IPython Notebook involves the following steps:

  • Arrive at an mbox formatted export of your mail
  • Convert the mbox export into JSON
  • Load the JSONified data into MongoDB
  • Use MongoDB’s powerful aggregation framework to query and analyze the mailbox

As is the case with all other chapters from Mining the Social Web, all of the source code examples for Chapter 6 are available online in a convenient IPython Notebook format and easy enough to follow along with even if you don’t have a copy of the text. Furthermore, the turn-key virtual machine that’s provided takes care of the initial installation/configuration pains of IPython Notebook, MongoDB, and some of the other dependencies so that you can get right to the good stuff!

If you haven’t yet installed the virtual machine, this quick start guide that features a step-by-step video may be of great help, and as always, I’m just a tweet, Facebook message, GitHub ticket, or email away if you need any assistance along the way.

Enjoy.

Leave a comment