Tumblr Engineering

Golang and The Tumblr API

You’ve been asking for an official Golang wrapper for the Tumblr API. The wait is over! We are thrilled to unveil two new repositories on our GitHub page which can be the gateway to the Tumblr API in your Go project.

Why Two Repos

We’ve tried to structure the wrapper in a way that is as flexible as possible so we’ve put the meat of the library in one repo that contains the code for creating requests and parsing the responses, and interacts with an interface that implements methods for making basic REST requests.

The second repo is an implementation of that interface with external dependencies used to sign requests using OAuth. If you do not wish to include these dependencies, you may write your own implementation of the ClientInterface and have the wrapper library use that client instead.

Handling Dynamic Response Types

Go is a strictly typed language including the data structures you marshal JSON responses into. This means that the library could have surfaced response data as a map of string => interface{} generics which would require the engineer to further cast into an int, string, another map of string => interface{}, etc. The API Team decided to make it more convenient for you by providing typed response values from various endpoints.

If you have used the Tumblr API, you’ll know that our Post object is highly variant in what properties and types are returned based on the post type. This proved to be a challenge in codifying the response data. In Go, you’d hope to simply be able to define a dashboard response as an array of posts

type Dashboard struct {
  // ... other properties
  Posts []Post `json:"posts"`
}

However this would mean we’d need a general Post struct type with the union of all possible properties on a Post across all post types. Further complicating this approach, we found that some properties with the same name have different types across post types. The highest profile example: an Audio post’s player property is a string of HTML while a Video post’s player property is an array of embed strings. Of course we could type any property with such conflicts as interface{} but then we’re back to the same problem as before where the engineer then has to cast values to effectively use them.

Doing Work So You Don’t Have To

Instead, we decided any array of posts could in fact be represented as an array of PostInterfaces. When decoding a response, we scan through each post in the response and create a correspondingly typed instance in an array, and return the array of instances as an array of PostInterfaces. Then, when marshalling the JSON into the array, the data fills in to the proper places with the proper types. The end user can then interact with the array of PostInterface instances by accessing universal properties (those that exist on any post type) with ease. If they wish to use a type-specific property, they can cast an instance to a specific post type once, and use all the typed properties afterward.

This can be especially convenient when paired with Go’s HTML templating system:

snippet.go

// previously, we have some `var response http.ResponseWriter`
client := tumblrclient.NewClientWithToken(
    // ... auth data
)

if t,err := template.New("posts").ParseFiles("post.tmpl"); err == nil {
    if dash,err := client.GetDashboard(); err == nil {
        for _,p := range dash.Posts {
            t.ExecuteTemplate(response, p.GetSelf().Type, p.GetSelf())
        }
    }
}

post.tmpl

{{define "text"}}
<div>
    {{.Body | html}}
</div>
{{end}}
{{define "photo"}}
<div>
    Post: {{.Type}}
</div>
{{end}}
{{define "video"}}
<div>
    Post: {{.Type}}
</div>
{{end}}
{{define "audio"}}
<div>
    Post: {{.Type}}
</div>
{{end}}
{{define "quote"}}
<div>
    Post: {{.Type}}
</div>
{{end}}
{{define "chat"}}
<div>
    Post: {{.Type}}
</div>
{{end}}
{{define "answer"}}
<div>
    Post: {{.Type}}
</div>
{{end}}
{{define "link"}}
<div>
    Post: {{.Type}}
</div>
{{end}}

This is a rudimentary example, but the convenience and utility is fairly evident. You can define blocks to be rendered, named by the post’s type value. Those blocks can then assume the object in its named scope is a specific post struct and access the typed values directly.

Wrapping Up

This is a v1.0 release and our goal was to release a limited scope, but flexible utility for developers to use. We plan on implementing plenty of new features and improvements in the future, and to make sure that improvements to the API are brought into the wrapper. Hope you enjoy using it!

golang api

Command Line Tumblr

A Totally New Interface for Tumblr?

Today, Tumblr is accessible via mobile, web or api—but what if you’re a linux enthusiast? Nerds like you can now access Tumblr completely via command line.

“What about images?” you ask. Displaying an image in command line is not something new. There are already a bunch of existing libs doing this, namely aalib, libcaca and super low level ncurses. And the most interesting project built based on those—p2p video chat—comes from a hackathon.

I picked up a much higher level library called blessed, for least efforts to achieve a best looking interface. As you may seen, blessed is javascript-based and very fancy. It provides you with almost every widget you might need to build an awesome dashboard.

Most of the work has already been done after figuring out the right library, to show tumblr in command line, we just need to

Connect the api to fetch image urls.
Do some front-end design to show a Tumblrish dashboard.

What? Still need codes?…

var post = blessed.box({
    parent: dashboard,
    top: '15%',
    left: 'center',
    width: '40%',
    height: '80%',
    draggable: true,
    border: {
        type: 'line'
    },
    style: {
        fg: 'white',
        bg: 'white',
        border: {
            fg: '#f0f0f0'
        }
    },
});

var load_post = function() {
    if (index < 0 || index >= posts.length)
        return;

    post.free();
    var post_data = posts[index];
    /** avator */
    blessed.ANSIImage({
        parent: post,
        top: 0,
        left: '-30%',
        width: '20%',
        height: '20%',
        file: post_data.avator,
    });

    /** posts */
    var count = post_data.count;
    // TODO: switch all sizes
    for (var i = 0; i < count; i++) {
        var offset = 100/count * i;
        var width = 100/count;
        blessed.ANSIImage({
            parent: post,
            left: offset + '%',
            width: width + '%',
            height: '98%',
            file: post_data.data[i]
        });
    }

    screen.render();
}

Blessed already provided lots of high level apis. As an example, to display a post as an image, all your input is just an image url, and call

blessed.ASNImage({
    ...
    file: image_url/local_file
})

It supports png and gif, and even, if you’d like to show a video, blessed also provides video. Hypothetically speaking, we can use this library to build almost all components in the dashboard of Tumblr today. Note, it’s not connecting the real api, but I suppose that would be pretty easy. Also there’s a memory optimization issue might need to be addressed if we really want to use this library for something.

command line

PHP 7 at Tumblr

At Tumblr, we’re always looking for new ways to improve the performance of the site. This means things like adding caching to heavily used codepaths, testing out new CDN configurations, or upgrading underlying software.

Recently, in a cross-team effort, we upgraded our full web server fleet from PHP 5 to PHP 7. The whole upgrade was a fun project with some very cool results, so we wanted to share it with you.

Timeline

It all started as a hackday project in the fall of 2015. @oli and @trav got Tumblr running on one of the PHP 7 release candidates. At this point in time, quite a few PHP extensions did not have support for version 7 yet, but there were unofficial forks floating around with (very) experimental support. Nevertheless, it actually ran!

This spring, things were starting to get more stable and we decided it was time to start looking in to upgrading more closely. One of the first things we did was package the new version up so that installation would be easy and consistent. In parallel, we ported our in-house PHP extensions to the new version so everything would be ready and available from the get-go.

A small script was written that would upgrade (or downgrade) a developer’s server. Then, during the late spring and the summer, tests were run (more on this below), PHP package builds iterated on and performance measured and evaluated. As things stabilized we started roping in more developers to do their day-to-day work on PHP 7-enabled machines.

Finally, in the end of August we felt confident in our testing and rolled PHP 7 out to a small percentage of our production servers. Two weeks later, after incrementally ramping up, every server responding to user requests was updated!

Testing

When doing upgrades like this it’s of course very important to test everything to make sure that the code behaves in the same way, and we had a couple of approaches to this.

Phan. In this project, we used it to find code in our codebase that would be incompatible with PHP 7. It made it very easy to find the low-hanging fruit and fix those issues.

We also have a suite of unit and integration tests that helped a lot in identifying what wasn’t working the way it used to. And since normal development continued alongside this project, we needed to make sure no new code was added that wasn’t PHP 7-proof, so we set up our CI tasks to run all tests on both PHP 5 and PHP 7.

Results

So at the end of this rollout, what were the final results? Well, two things stand out as big improvements for us; performance and language features.

Performance

When we rolled PHP 7 out to the first batch of servers we obviously kept a very close eye at the various graphs we have to make sure things are running smoothly. As we mentioned above, we were looking for performance improvements, but the real-world result was striking. Almost immediately saw the latency drop by half, and the CPU load on the servers decrease at least 50%, often more. Not only were our servers serving pages twice as fast, they were doing it using half the amount of CPU resources.

These are graphs from one of the servers that handle our API. As you can see, the latency dropped to less than half, and the load average at peak is now lower than it’s previous lowest point!

Language features

PHP 7 also brings a lot of fun new features that can make the life of the developers at Tumblr a bit easier. Some highlights are:

Scalar type hints: PHP has historically been fairly poor for type safety, PHP 7 introduces scalar type hints which ensures values passed around conform to specific types (string, bool, int, float, etc).
Return type declarations: Now, with PHP 7, functions can have explicit return types that the language will enforce. This reduces the need for some boilerplate code and manually checking the return values from functions.
Anonymous classes: Much like anonymous functions (closures), anonymous classes are constructed at runtime and can simulate a class, conforming to interfaces and even extending other classes. These are great for utility objects like logging classes and useful in unit tests.
Various security & performance enhancements across the board.

Summary

PHP 7 is pretty rad!

tumblr engineering php php7

Juggling Databases Between Datacenters

Recently we went through an exercise where we moved all of our database masters between data centers. We planned on doing this online with minimal user impact. Obviously when performing this sort of action there are a variety of considerations such as cache consistency and other pieces of shared state in stores like HBase, but the focus of this post will be primarily on MySQL.

During this move we had a number of constraints. As mentioned above this was to be online when serving production traffic with minimal user impact. In aggregate we service hundreds of thousands of database queries per second. Additionally we needed to encrypt all data transferring between data centers. MySQL replication supports encryption, but connections to the servers themselves present several challenges. Specifically, from a performance standpoint the handshake to establish a connection across a WAN can impact latency if there is significant connection churn. Additionally, servicing read queries across a backhaul link adds latency, which is never desirable.

We decided to tackle these issues in several ways. We were able to leverage a number of existing features of our applications and infrastructure, as well as developing new automation to fill gaps in functionality. Our configuration and applications in various runtimes, were able to support a read/write split (which may seem obvious to some, but isn’t always easy to accomplish in every scenario). We used the read/write split, along with encrypted replication, to provide a local read replica. Some runtimes can set up a persistent encrypted connection to a remote master, which serviced read requests in those cases, as the per-connection latency was amortized over a large number of queries. For runtimes which have a high churn rate, such as PHP, we used a MySQL proxy, ProxySQL, which provided persistent, encrypted connections, as well as meeting our performance requirements. We built automation to deploy proxies for numerous database pools, servicing thousands of requests per second, per pool.

When performing the cutover, our workflow was as follows. In each data center, there was a config which pointed to a local read slave, a remote master, and a local proxy with the master (remote or local) as a backend. When moving masters between datacenters, our database automation, Jetpants (new release coming soon!), reparented all replicas, and our automation updated the proxy backend to point to the new master. This resulted in seconds of read-only state per database pool and minimal user impact.

More coming soon!

databases mysql proxysql jetpants datacenters

Introducing Laphs

The Core Web team at Tumblr is proud to announce the release of Laphs (Live Anywhere Photos - LAPhs; get it?), an open source JavaScript library for implementing Apple’s Live Photos on the web.

We use Laphs to support Live Photos on the web at Tumblr and now you can too! Check it out on github and npm and let us know what you think.

Happy coding!

open source javascript live photos apple

Categorizing Posts on Tumblr

Millions of posts are published on Tumblr everyday. Understanding the topical structure of this massive collection of data is a fundamental step to connect users with the content they love, as well as to answer important philosophical questions, such as “cats vs. dogs: who rules on social networks?”

As first step in this direction, we recently developed a post-categorization workflow that aims at associating posts with broad-interest categories, where the list of categories is defined by Tumblr’s on-boarding topics.

Methodology

Posts are heterogeneous in form (video, images, audio, text) and consists of semi-structured data (e.g. a textual post has a title and a body, but the actual textual content is un-structured). Luckily enough, our users do a great job at summarizing the content of their posts with tags. As the distribution below shows, more than 50% of the posts are published with at least one tag.

However, tags define micro-interest segments that are too fine-grained for our goal. Hence, we editorially aggregate tags into semantically coherent topics: our on-boarding categories.

We also compute a score that represents the strength of the affiliation (tag, topic), which is based on approximate string matching and semantic relationships.

Given this input, we can compute a score for each pair (post,topic) as:

where

w(f,t) is the score (tag,topic), or zero if the pair (f,t) does not belong in the dictionary W.
tag-features(p) contains features extracted from the tags associated to the post: raw tag, “normalized” tag, n-grams.
q(f,p) is a weight [0,1] that takes into account the source of the feature (f) in the post (p).

The drawback of this approach is that relies heavily on the dictionary W, which is far from being complete.

To address this issue we exploit another source of data: RelatedTags, an index that provides a list of similar tags by exploiting co-occurence patterns. For each pair (tag,topic) in W, we propagate the affiliation with the topic to its top related tags, smoothing the affiliation score w to reflect the fact these entries (tag,topic) could be noisy.

This computation is followed by filtering phase to remove entries (post,topic) with a low confidence score. Finally, the category with the highest score is associated to the post.

Evaluation

This unsupervised approach to post categorization runs daily on posts created the day before. The next step is to assess the alignment between the predicted category and the most appropriate one.

The results of an editorial evaluation show that the our framework is able to identify in most cases a relevant category, but it also highlights some limitations, such as a limited robustness to polysemy.

We are currently looking into improving the overall performances by exploiting NLP techniques for word embedding and by integrating the extraction and analysis of visual features into the processing pipeline.

Some fun with data

What is the distribution of posts published on Tumblr? Which categories drive more engagements? To analyze these and other questions we analyze the categorized posts over a period of 30 days.

Almost 7% of categorized posts belong to Fashion, with Art as runner up.

The category that drives more engagements is Television, which accounts for over 8% of the reblogs on categorized posts.

However, normalizing by the number of posts published, the category with the highest average of engagements per post isGif Art, followed by Astrology.

Last but not least, here are the stats you all have been waiting for!! Cats are winning on Tumblr… for now…

tags cats vs dogs post categorization data science

javascript

Flux and React in Data Lasso

javascript:

TL;DR
Flux helped bring the complexity of Data Lasso down, replacing messy event bus structure. React helped make the UI more manageable and reduce code duplication. More below on our experience.

Keep reading

javascript javascript data lasso react flux

cocoa

cocoa:

WWDC 2016 has come and passed, but we wanted to take the time to call out the new idea that Apple unveiled which we think are important as developers, as well as things to make our product teams aware of for future launches.
WWDC has slowly returning back to a software and developer focused event over these passed few years, and this year was no exception. Many new technologies, tools, and ideas introduced for developers to plug into to enrich both their applications as well as the Apple ecosystem in general. So let’s get into what we saw and enjoyed.

Keep reading

cocoa

javascript

tumblr.js update

javascript:

We just published v1.1.0 of the tumblr.js API client. We didn’t make too much of a fuss when we released a bigger update in May, but here’s a quick run-down of the bigger updates you may have missed if you haven’t looked at the JS client in a while:
Method names on the API are named more consistently. For example, blogInfo and blogPosts and blogFollowers rather than blogInfo and posts and followers.
Customizable API baseUrl. We use this internally when we’re testing new API features during development, and it’s super convenient.
data64 support, which is handy for those times when you have a base64-encoded image just lying around and you want to post it to Tumblr.
Support for Promise objects. It’s way more convenient, if you ask me. Regular callbacks are still supported too.
Linting! We’ve been using eslint internally for a while, so we decided to go for it here too. We’re linting in addition to running mocha tests on pull requests.
Check it out on GitHub and/or npm and star it, if you feel so inclined.
tumblr.js REPL
When we were updating the API client, we were pleasantly suprised to discover a REPL in the codebase. If you don’t know, that’s basically a command-line console that you can use to make API requests and examine the responses. We dusted it off and decided to give it its own repository. It’s also on npm.
If you’re interested in exploring the Tumblr API, but don’t have a particular project in mind yet, it’s a great way to get your feet wet. Try it out!

javascript javascript tumblr api

cyle

Some Themed Posts Updates

cyle:

New feature: Your theme’s accent/link color changes your post’s like/reblog/reply/etc colors! Whoa!
Big bug fix: The colors used to theme your posts now make more sense. Your title color is your text color (used to be your link color, which doesn’t really make sense), and your background color is still your background color.
Bug fix: “Keep reading” links on reblogs are now the right color.
Bug fix: Ask/answer posts are now themed better.
Bug fix: Added a line under your “contributed content” for a reblog, so that the space there doesn’t look as weird. Maybe it still does, I dunno.
Bug fix: The “follow” text above reblogged content should now be the right color.
I’m still working on lots of bigger changes, too. More info on those when they’re ready for release. As always, feel free to message me if you have any questions or suggestions!

cyle tumblr labs themed posts

U2F with Yubikeys

During our recent hackday we wanted to explore new ways to login to Tumblr and play with some cool toys. The following is not an announcement of any kind, other than that U2F is awesome and everyone should buy a Yubikey (they aren’t paying us to say this, we swear).

Authenticating your online identity

If you’ve ever logged into any website on the internet, chances are you’ve been through an authentication flow. You provide the site with a username you use to identify yourself on that platform, followed by a password that (in theory) only you know to prove that you are you. If all that matches what the site has in their database, you’re authenticated! However, that particular flow only represents a single factor of authentication, the “knowledge factor” (because you know your password). But even if you have a highly complex password, unique to that one site, that probably won’t be enough to really secure your account from unauthorized access. That’s why we provide the ability (and highly encourage users) to enable Two-Factor Authentication (2FA).

Traditionally, 2FA is done either via SMS or through an authenticator app (i.e. Duo, Authy, Google Authenticator, etc). But what happens if you don’t have reception, how will you receive a text message? What if there’s an issue with the authenticator service, and you don’t have a fallback? Surely there has to be a realistic and practical option past what industry has been relying on that can help mitigate some of these issues.

Keep reading

hackday yubikey 2fa u2f

Golang and The Tumblr API

Why Two Repos

Handling Dynamic Response Types

Doing Work So You Don’t Have To

snippet.go

post.tmpl

Wrapping Up

Command Line Tumblr

A Totally New Interface for Tumblr?

PHP 7 at Tumblr

Timeline

Testing

Results

Performance

Language features

Summary

The Art of Open-Sourcing

Juggling Databases Between Datacenters

Alpha or Beta? The Choice is Yours, Android Users.

Introducing Laphs

Categorizing Posts on Tumblr

Methodology

Evaluation

Some fun with data

Flux and React in Data Lasso

TL;DR

Mozart

tumblr.js update

tumblr.js REPL

Some Themed Posts Updates

U2F with Yubikeys

See, that’s what the app is perfect for.

Why Two Repos

Handling Dynamic Response Types

Doing Work So You Don’t Have To

snippet.go

post.tmpl

Wrapping Up

A Totally New Interface for Tumblr?

Timeline

Testing

Results

Performance

Language features

Summary

The Art of Open-Sourcing

Alpha or Beta? The Choice is Yours, Android Users.

Methodology

Evaluation

Some fun with data

TL;DR

Mozart

tumblr.js REPL