Recollect Engineering

Visualizing the Twitter social graph, Part 1: Getting the data

This series is a technical writeup of how we created visualizations of Twitter social graphs like this:

@bertrandom's Twitter Social Graph visualized on Recollect

I will attempt to cover the entire process, from prototyping to release. This first part will cover retrieving the initial data necessary to build a prototype.

Let’s get started

Technically, the first thing to do when designing a visualization is to ask yourself, “What questions am I trying to answer about this data?” Asking yourself this question will shape everything you do about the project, so it’s important that you take it seriously - otherwise you’re just creating visualizations for the sake of creating visualizations. Unfortunately, this is out of the scope of this article, but I highly recommend you check out Visualize This, by Nathan Yau and Mining the Social Web, by Matthew A. Russell. We will be assuming you have already decided what you’re building - a graph of your Twitter network.

First, I created an application with Twitter. This is actually a good idea even if you already have an application key with them, because you can dedicate all of your REST API calls for retrieving the data that you need for your prototype. It’s also a good way to secure the name for your project/website/startup, as Twitter application names must be unique.

Next, I defined what data I wanted to get. In this case, I wanted a list of all my friends and followers and I wanted lists of all their friends and followers. In my opinion, it’s better to start off ambitious about what data you want and slowly refine that to what data you actually need.

Perusing the Twitter API docs, we come across these two API methods:

GET friends/ids

GET followers/ids

These API methods return user IDs of the friends and followers of a specific user. Perfect, right?

The next step is to actually test what we get back from these API methods. One of the great things about the Twitter API is that Twitter has one of the best API consoles I’ve ever seen, it’s hidden in the Twitter for Mac client (check out the Developer Tab in Preferences):

Developer Console

First, I use GET users/show to figure out my Twitter user ID from my screen_name. Then I pass that into GET friends/ids and take a look at the response:

{ 
    "previous_cursor": 0, 
    "next_cursor_str": "0", 
    "ids": [ 
        20015311, 
        35097545, 
        30923, 
        .
        .
        .
        14572071, 
        755936, 
        1058011 
    ], 
    "previous_cursor_str": "0", 
    "next_cursor": 0 
}

I do the same for GET followers/ids and we’re already half-way to our goal - we now have a list of all my friends and followers.

It’s worth mentioning that, as of this writing, I only have 207 friends and 167 followers on Twitter. Astute readers will note that these numbers are considerably smaller than the amount that we can get back from either of these two API calls - let’s talk about what we would need to do if we had a much larger number of friends and/or followers.

If we re-read the API documentation for GET followers/ids, we will see that the maximum number of user IDs from this call is 5000. This means that if we have more than 5000 followers, we will need to execute multiple API calls to get all the followers. Let’s see how this works in pseudo-code:

Call GET followers/ids with our user_id and no additional params.
Store all the user_ids somewhere.
Check the response, if next_cursor is greater than 0, call GET followers/ids again with our user_id and the cursor set to next_cursor. 
Store all the user_ids somewhere.
Keep doing this until next_cursor is 0.

In practice, my actual PHP code does not look that different:

$twitter_user_id = 6588972;
$cursor = -1;
$followers = array();

do {

    try {

        $ret = $twitter->getFollowers(array(
            'user_id' => $twitter_user_id,
            'cursor' => $cursor,
        ));

    } catch (TwitterException $e) {

        return $this->handleTwitterException($e); 

    }

    if ($ret['success']) {

        foreach ($ret['data']->ids as $id) {
            $followers[] = $id;
        }

        $cursor = $ret['data']->next_cursor_str;

    }

} while (!$ret['success'] || $cursor > 0);

In this case $twitter is my homegrown Twitter class that talks to the Twitter API, but even without seeing that, you can tell there are two weird things going on here. First of all, it appears that we’re potentially throwing exceptions as a result of fetching followers and it seems that the same API call can get repeated over and over again. There are two important caveats with the Twitter API that are being abstracted here.

Twitter Caveats

Believe it or not, the Twitter API will not always return you a successful response. Sometimes it will tell you that the data you’re requesting does not exist (HTTP code 404), sometimes you don’t have permission to receive it (HTTP code 401 or 403), and sometimes it’s just not in the mood to give you what you want (HTTP code 503). Every time you get a response from Twitter, you must check the HTTP code and handle the situation appropriately. In the case of my Twitter class, I consider codes 401, 403, and 404 to be fatal exceptions, but I consider code 503 to be a side-product of the cost of doing business with Twitter. I also set a threshold for the number of times that we can fail to successfully fetch content so we don’t get stuck in an endless loop of fail whale - when it exceeds that threshold, it throws an exception.

The other thing to consider about the Twitter REST API is rate-limiting. Here is their documentation about rate-limiting, but a quick summary is that if you do un-authenticated API calls, you can do 150 requests per hour, per IP address. If you do OAuth calls, you can do 350 requests per hour, per access token. Thus, when we move this prototype to production, you should never do un-authenticated API calls because your application simply won’t scale, unless you have an amazing pool of free public IP addresses. For the prototype though, you can actually combine these two to get 500 API calls per hour. The way that this affects you in terms of your code is that you should never let your client hit the Twitter API after you’ve been rate-limited. This is an abuse of Twitter’s API and goodwill - you must track your usage on your side. Fortunately, Twitter makes this easy, they return headers that tell you exactly how many API calls you have left, it’s simply a matter of monitoring these headers after each call and updating the count on your side. The ones of particular importance to you are:

X-FeatureRateLimit-Remaining This tells you how many API calls you have remaining before you get rate-limited.

X-FeatureRateLimit-Reset This tells you when the rate limit will be reset.

The way I do is it that I set a threshold where I rate-limit myself. For example, let’s say the threshold is 2 API calls remaining. After every call, I check X-FeatureRateLimit-Remaining against my threshold and if it is the same or lower, I have the program sleep until the time specified in X-FeatureRateLimit-Reset. This ensures that we never send Twitter a request after we’ve been rate-limited.

Last thing to mention about rate-limiting before we move on - if there is an equivalent way to get the data you need through the Streaming API, you should do so, as the Streaming API has practically no limits. Also, if you’re using User Streams and you’re planning on moving on from prototype to production, you should take a look at the Site Streams documentation. User Streams are fine for prototyping but in production, you must move on to Site Streams which requires applying for beta access.

What’s the conclusion that we should draw from these two caveats? Either ensure the Twitter library you’re using handles these issues, or write the client yourself.

Grabbing the data

Now that we understand how the Twitter API works and how to properly talk to it, we can get back to the problem at hand - how do we get all the data? Assuming that we’re starting with my user (207 friends, 167 followers), let’s do some quick napkin math. We have to loop over approximately 400 users and fetch friends and followers for each of them. Most of those users will have less than 5000 friends and followers each, but let’s assume a handful will be Internet celebrities and have lots of followers. Let’s say 5% will have more than 5000 followers but less than 25000 followers. If we assume worst-case scenario:

400 * 0.95 * 2 = 760
400 * 0.05 * (5 + 1) = 120
760 + 120 = 880
880 / 350 = 2.5

What does this mean? Well, for 95% of the users, we have to make two API calls each, one to fetch followers and one to fetch friends. For 5% of the users, we have to make 6 API calls each, 5 to fetch up to 25000 followers, and one to fetch friends. This totals 880 API calls. We can only make 350 OAuth calls an hour, so it will take us 2.5 hours to fetch all this data.

This napkin math is all well and good, but we have to consider what I like to call the Barack Obama issue.

When I worked at Flickr and was writing code that had to scale, I would often sanity check my code by considering, what happens when Barack Obama gets involved? For many tasks, I would use a specific photograph to test code, which was this one:

Situation Room Photo

http://www.flickr.com/photos/whitehouse/5680724572/

The Situation Room photo has over 2.6 million views on Flickr and over 6,000 favorites, well out of the league of most Flickr photos. If you were writing code that had to paginate those favorites, or loop over them checking for something, you could test it against that photo and it would tell you how well your algorithm would hold up in the worst case scenario.

In the case of Twitter, I use Barack Obama himself @BarackObama (680,337 friends, 13,712,222 followers).

It is not an edge case that I would follow @BarackObama, in fact it’s a perfectly reasonable assumption that I might. So your code has to deal with this, but perhaps not in the way that you think.

In order to fetch over 20,000,000 twitter user ids to get both his friends and followers, we would have to execute 4000 API calls, which would take 11.4 hours. Yeah - that’s not going to happen. The obvious thing to do is to exclude @BarackObama from the social graph. How useful could his network be anyways, if he follows that many people or that many people follow him? But here lies the question - how do we programatically know to exclude him?

GET users/show will return the extended information of a user, including their friends and followers counts, but GET users/lookup will return the extended information for up to 100 users.

Thus, if we have 400 users, we will only need to make 4 additional API calls to determine which users not to fetch friend and follower IDs for. So our rough estimate is 884 calls and it will take approximately 2.5 hours to fetch the data.

The code for this should be pretty obvious, but let’s say that you have three functions:

get_followers($twitter_user_id)
get_friends($twitter_user_id)
lookup_users($user_ids)

Here’s some PHP that would bring it all together:

<?php

    $twitter_user_id = 6588972;

    $followers = get_followers($twitter_user_id);
    $friends = get_followers($twitter_user_id);

    file_put_contents($twitter_user_id . '_followers.csv', implode(',', $followers));
    file_put_contents($twitter_user_id . '_friends.csv', implode(',', $friends));

    // There's probably a better way to do this, but this abuses PHP's hash table abilities and is relatively fast
    $users = array();

    foreach ($followers as $user_id) {
        $users[$user_id] = 1;
    }

    foreach ($friends as $user_id) {
        if (!isset($users[$user_id])) {
            $users[$user_id] = 1;
        }
    }

    $user_ids = array_keys($users);

    if (count($user_ids) > 0) {

        do {

            $set = array_splice($user_ids, 0, 100);
            $extended_users = lookup_users($set);

            foreach ($extended_users as $extended_user) {

                if ($extended_user->followers_count > 25000 || $extended_user->friends_count > 25000) {
                    unset($users[$extended_user->id]);
                }

            }


        } while (count($user_ids) > 0);

        foreach ($users as $user_id => $nothing) {

            $followers = get_followers($twitter_user_id);
            $friends = get_followers($twitter_user_id);

            file_put_contents($user_id . '_followers.csv', implode(',', $followers));
            file_put_contents($user_id . '_friends.csv', implode(',', $friends));

        }

    }

?>

Now that we have all the data in CSV files, we must process the data so that it will be accepted by our graphing program. We will cover that in the next part of this series, Visualizing the Twitter social graph, Part 2: Processing the data.

Bertrand Fan, co-founder