Rebooting Mining the Social Web for a Rapidly Changing World
— New Co-Author: Mikhail Klassen —
Over the last two years, Matthew and I have been overhauling Mining the Social Web, preparing to release this technical manual in its third edition. I was brought on to help with the project, which ended up taking some interesting turns.
The project started (in a way) at PyCon 2016 in Portland, Oregon. It was late May and my first time in Oregon. I had flown down from Calgary, Alberta, where I was living at the time. A few months earlier I had defended a PhD in astrophysics and I was pretty burned out.
While there was still work remaining to get the manuscript of the thesis into its final form, I was ready to think other projects. My interests had begun to shift away from simulating star formation and towards machine learning. The term “data scientist” was still very new, but held great appeal to me. And having done mostly data analysis and writing for the last two years of my PhD, this seemed like a natural fit.
PyCon is this beautiful annual confluence of geeks who love Python. I felt at home. There that I met an editor from O’Reilly Media who thought a “data scientist” with my scientific background could be a good fit for this project. She invited me to follow up after the conference.
The danger with writing technical books is that technology moves faster than the publishing cycle. The proposed project was to join Matthew Russell in overhauling Mining the Social Web for the 3rd Edition. I was told this would mostly involve modernizing the code to Python 3, and testing everything to make sure the code still ran. That didn’t sound too difficult and I had done some data mining work with Twitter before, so I agreed.
Security and Scandal
As you may recall, many things happened in 2016. Some of them involving social media.
Over the course of the last two years, in the wake of the Cambridge Analytica scandal and some major data breaches, social media has come under much more scrutiny and all the major platforms went on the defensive.
APIs were changed. Access to data was severely curtailed. Certain privileges required approval from the platform’s developers. Mining the Social Web was full of examples designed to teach data mining techniques and provide the reader with tools for building interesting applications. Suddenly a lot of the code no longer worked.
As an author, I also had to consider some moral questions around data mining. Was it ethical to be teaching others how to programmatically pull and sift data from Facebook, Instagram, Twitter, and elsewhere?
As I wrote in the Preface to the 3rd Edition, there are many positive uses for data mining, even when the data comes from social media. There are many examples of data mining and data analysis being used for social good (see, for example, the DSSG Fellowship). I also wanted people to understand just how much metadata is attached to the things they post online, especially on public platforms like Twitter and Instagram. This metadata is mostly invisible to the user logged into these apps, but accessible over the API.
And so over the course of many months, I wrote new code examples, rewrote some of the old ones, updated the API calls, updated the manuscript, and modernized the Python code.
From IG to AI
Then Matthew and I realized that the book really needed a chapter on Instagram. Since the 2nd Edition, Instagram had exploded in popularity. There are currently about 1 billion monthly active users on the platform and the book did not have a chapter on it. This needed to change.
Instagram is different from the other platforms we covered because Instagram is a visual platform. Mining text or metadata is one thing, but analyzing images requires computer vision. I introduced basic artificial neural networks in the chapter, but we were not about to roll our own deep convolutional network and train it on ImageNet. That’s a topic for a whole other book. Instead, we made use of some free Google Vision APIs and wrote code to have it “look” at Instagram photos and describe what they contained.
As the finishing touches were being put on the book, another announcement was made that made all of us groan. Google was going to be sunsetting Google+. Mining the Social Web was going to look immediately dated if we had an entire chapter devoted to a social network that was about to disappear. So Matthew heroically rewrote the chapter, keeping many of the great examples around mining text data, which are universal, and making sure that our book would have a better shelf life.
So while the publishing date was pushed back several times, we’re proud about how far the book has come. Mining the Social Web has undergone a thorough refresh and we plan to continue supporting the community through bug fixes and updates to the GitHub repository.