Computing Joy

Data Longevity and Data-Centric Software

Dmitri Zagidulin — Thu, 17 Nov 2016 15:53:00 GMT

Every program attempts to expand until it can read mail. Those programs which cannot so expand are replaced by ones which can.

Zawinski's Law of Software Envelopment

I wonder if there's a similar law having to do with APIs, or backup/export functionality, or open standards. If there is not, there should be. Something like "Law of Software Superiority".

Rules of Software Superiority

It comes down to this. If you have a choice of which software (or web app or service) to use:

Choose the one with an API over the one that doesn't
If choosing between two APIs, pick one that uses standard protocols and data formats, or is an open standard itself
Choose one that has an Export/backup functionality
Choose one that stores its data in a standard (spec'd out or well understood) format
Choose one that is open-source over one that isn't. Hell, go wild, and choose free software (in the Stallman sense) over merely open source (and of course over proprietary)
That said, if the choice is between free software that uses an obscure/undocumented data format (that is, it's not clear how you'd get your data out) and proprietary software that has a clear backup/export strategy to an understandable format... choose the proprietary one. The safety and longevity of your data is of paramount importance.

These preferences (libre > open-source > proprietary, api > backup/export > neither, open data format > proprietary/undocumented format) are not about idealism (unless that's your thing, in which case, by all means). They are about reducing risk. Chances are incredibly high that:

If you're using an online service or a commercial app, the company will go out of business, or even more likely, will get acquired or merge with another (which will not care about product continuity).
If the company doesn't change hands, the project (or app or service) will get discontinued, or pivot to a direction you don't like.
Your account can be locked out or revoked at any time, due to over-zealous content filtering, accounting errors, etc.
Even for open-source projects, the project leadership may abandon it and move on, or again, take it in a direction that makes it unusable to you.

This is not meant to discourage you, or sound overly pessimistic. This is just a simple reality of living with software. My point is that if you consistently choose software and services that make it possible to get your data out, you can be at peace with all these eventualities.

Ok, so, the API thing... The important thing here is to be able to get the data out of the software. (And, only secondarily useful for actually scripting / extending it / mashups.) If you can get to the data (because it lives on your computer's hard drive) but can't use it (it's in an undocumented/proprietary format), it's no good. Similarly, if it's in a simple or well-understood format (say, blog posts, or pictures or whatever) but you can't get to it (ahem, I'm looking at you, various social media sites), again, useless. At best, you will have to resort to the nightmare of screen-scraping.

Data-Centric Software

There is a type of software that, by its very nature, lends itself well to preserving the longevity of your data, and compares favorably in terms of the above rules. And that is data-centric software, where the user has control over data in a well-understood and documented format, and thus has a choice of which apps to use to work with it. Chances are you're very familiar with this sort of model, just from using popular desktop applications. Examples include text editing and word processing, spreadsheets, presentations, image editing. Personal accounting software (Quickbooks and the like) is actually almost there, although there are occasional subtle incompatibilities with formats.

The situation gets a lot more difficult (in terms of users being able to control their data, and be able to have a choice of interchangeable/competing applications) once you get into online services and applications, and mobile apps. In the world of traditional desktop software, data lives on generic storage (hard drives, etc), and most applications have equal access to it. In the mobile app world, aside from a small handful of standardized shared resources (basically, your camera roll), storage mechanisms are a lot more opaque, and each app is encouraged to use its own individual slices of storage.

And with online services and social media apps, the situation is even worse. Unless a service is progressive enough to expose a comprehensive API, your data is pretty much locked in there, and good luck using third-party services with it.

It doesn't have to be this way, though. There is nothing about web apps or mobile apps that inherently forbids them from being data-centric, or makes interoperability impossible. These days, browser applications can actually have access to your hard drive (if you let them), and cheap cloud service providers can serve the role of generic data stores (just like your computer's hard drive). Your data (whether it's documents or images or even social media type things like blog posts and status updates) could be under your control, and you could have the choice of interoperable competing software with which to edit it.

This goal of enabling interoperable data-centric web applications that use generic storage (local or cloud-based) is one of our main motivations at the Solid project (project repo | specs). (There are other goals, like making decentralized app development easier, encouraging the use of linked data, breaking the deadlock of various social media monopolies, and so on.) It's an ambitious project, and involves a lot of interesting engineering and research challenges (as does all large-scale decentralized software). I'll get into some of the technical challenges (and our solutions) in subsequent posts.

Understanding Linked Data

Dmitri Zagidulin — Mon, 26 Sep 2016 18:27:35 GMT

What does the concept of Linked Data mean to you as a developer? It means that you have datasets that have the following properties:

Globally unique IDs (since they use URIs for IDs). Also, you can almost always dereference those IDs and get more detailed useful data from them.
Globally unique, collision free, reusable property names (or column names, if you're coming from an RDBMS world).
The datasets are self-documenting and self-describing. If you dereference each of the unique property names, you get comments, context, data types, and if you're lucky, validatable schemas.

Linked Data from First Principles

The easiest way to understand the benefits and challenges of Linked Data is to start with something familiar to most developers -- data in the CSV format. Let's say we want to store some user records:

id,name,birth_date  
1,"Alice","1990-01-01"  
2,"Bob","1995-02-02"  
3,"Cindy","1999-01-01"

We have data, we have property names on the first line, but there are several challenges here. For one, although the meanings of the example property names are fairly easy to guess, anybody who's worked with CSV datasets knows that this is not always the case. It would be of immense help to be able to have some sort of explanations or comments alongside that first line, to understand what the properties are and how to process them. Along the same lines, the schema of the properties is far from clear (such as their data types and validation logic). Lastly, this dataset is not very portable, in terms of its IDs. They appear to be the usual sort of auto-incrementing integer, but it's not easy to add them to an existing dataset (say, a Users table), since those IDs could already be taken up by existing users. To put it another way, those IDs are not very collision-resistant.

Let's put that same dataset into JSON format, to make it slightly easier for developers to understand (and use in their code).

[
  { "id": 1, "name": "Alice", "birth_date": "1990-01-01" },
  { "id": 2, "name": "Bob", "birth_date": "1995-02-02" },
  { "id": 3, "name": "Cindy", "birth_date": "1999-01-01" }
]

A little better -- we can now refer to a property from a parsed row by name (say, user.name) instead of by index (user[1]).

Now, imagine if we could give each of those users a globally unique id. Maybe each of them has their own domain name. Or failing that, an account on some service provider. Then we would have:

[
  { "id": "http://www.alice.com#me", "name": "Alice", "birth_date": "1990-01-01" },
  { "id": "http://bob.provider.com#about", "name": "Bob", "birth_date": "1995-02-02" },
  { "id": "http://cindy.provider.com#about", "name": "Cindy", "birth_date": "1999-01-01" }
]

Now the dataset becomes much more portable. We can merge it into existing datasets with no fear of id collisions. Not only that, but now we can dereference those IDs and hopefully be able to get more useful data, such as a public user profile.

Incidentally, HTTP URIs is not the only way to have globally unique identifiers. Other schemes have been used, such as XRI. The benefit of HTTP URIs should be obvious, however -- the tooling and infrastructure and developer familiarity with those is considerable.

The property names are still a bit ambiguous though. Does name mean full name, or just the given name? To address this, we could do the same thing with property names as we did with the IDs, and just use URIs:

[
  {
    "id": "http://www.alice.com#me",
    "http://schema.org/givenName": "Alice",
    "http://schema.org/birthDate": "1990-01-01"
  },
  {
    "id": "http://bob.provider.com#about",
    "http://schema.org/givenName": "Bob",
    "http://schema.org/birthDate": "1995-02-02"
  },
  {
    "id": "http://cindy.provider.com#about",
    "http://schema.org/givenName": "Cindy",
    "http://schema.org/birthDate": "1999-01-01"
  }
]

Now, all of a sudden, we have reusable, unambiguous properties. With the added benefit of -- we can resolve those properties as HTTP URIs and get a human-readable comment explaining its semantics, and the data format and validation constraints for the values (for example, the fact that the birthDate is in ISO 8601 date format). And they're reusable in the sense of, now app developers are encouraged to simply use http://schema.org/birthDate for the birth date property name, instead of various incompatible combinations of birthdate, birth_date, bd, and so on.

Of course, repeating the full URL for the property name for each record gets a little verbose, and not very DRY. Let's factor out the property name URIs, and put them in a lookup dictionary, in their own context section.

{
  "context": {
    "givenName": "http://schema.org/givenName",
    "birthDate": "http://schema.org/birthDate"
  },
  "data": [
    {
      "id": "http://www.alice.com#me",
      "givenName": "Alice",
      "birthDate": "1990-01-01"
    },
    {
      "id": "http://bob.provider.com#about",
      "givenName": "Bob",
      "birthDate": "1995-02-02"
    },
    {
      "id": "http://cindy.provider.com#about",
      "givenName": "Cindy",
      "birthDate": "1999-01-01"
    }
  ]
}

And now we have the best of both worlds. We have compact property names (so that we can once again refer to a parsed user's property as user.givenName instead of something horrible like user['http://schema.org/givenName']). And we still retain the benefit of globally unique unambiguous dereferenceable property names (they just get short, readable local aliases). (By the way, a thematic grouping of properties, such as http://schema.org/, is referred to as a vocabulary or ontology in the Linked Data community.)

Congratulations, we have just created a proper Linked Data document. And with a few tweaks (we'll use the reserved properties @id and @context, and @graph instead of data), we can turn this into full-fledged RDF based linked data, using the JSON-LD serialization format. (You can have Linked Data without using any of the RDF formats, it's just that by using them, you get access to a rich ecosystem of tools, standards, databases, validators, reasoners and deduction engines, a standardized query syntax, and so on.)

{
  "@context": {
    "givenName": "http://schema.org/givenName",
    "birthDate": "http://schema.org/birthDate"
  },
  "@graph": [
    {
      "@id": "http://www.alice.com#me",
      "givenName": "Alice",
      "birthDate": "1990-01-01"
    },
    {
      "@id": "http://bob.provider.com#about",
      "givenName": "Bob",
      "birthDate": "1995-02-02"
    },
    {
      "@id": "http://cindy.provider.com#about",
      "givenName": "Cindy",
      "birthDate": "1999-01-01"
    }
  ]
}

Linked Data Benefits

So what did all of that get us? The benefits of the Linked Data approach are several.

Easier data merging and schema migration. Mashups (combining data from heterogenous sources) become much easier, due to unique IDs and unambiguous properties. The flat graph based structure of most linked data formats (composite structures are expressed via local links, instead of nested documents) also makes merging datasets and schema migration much easier.

Data is discoverable (from IDs), self-describing and self-documenting.

Reuse and interop. Unique property names (that are published on the net, well described and schema-specified) encourage interop and reuse, and help cut down on constant wheel reinventing.

Better searches. If you embed linked data in your web pages (in JSON-LD format, or embedded in HTML attributes using the RDFa format), Google will actually parse it and use it to enhance search results. See Google's Introduction to Structured Data for further discussion.

Rich toolset/ecosystem. By using an RDF based format, you get a lot of extra tooling and infrastructure for free (in addition to the existing JSON-based toolsets, for example).

Linked Data Challenges

In going from something like CSV to RDF-based linked data, you've probably picked up on a few implications and challenges to organizing your data in this fashion. Let's go over a few of those.

URIs. Giving URIs to things is not always easy.

Schema discovery. Discovering, choosing or creating schemas (vocabularies) that fit your use case is sometimes challenging. But at least with linked data, you have the option to browse and study listings/directories of such schemas (unlike with database table schemas, for example). Schema.org/schemas is a good place to start.

Availability. Linking to things on the net means you are depending on the uptime of other systems. (See also Leslie Lamport's quote, "A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable".) Fortunately, actually dereferencing the links of properties or IDs is not mission critical, but is more helpful during development and design phases. To put it another way, you can still use http://www.alice.com/#about as a good user ID even if Alice's website happens to be down during a given day -- browsing linked data is an additional benefit and possibility, instead of a core operation.

Setting up Nginx with LetsEncrypt certificates

Dmitri Zagidulin — Sun, 01 May 2016 18:07:15 GMT

If you're setting up a web application (or even a testing/staging server for one), sooner or later you're going to have to bite the bullet and do it properly. Assuming that you're deploying a non-PHP type application (Ruby, Node.js, Smalltalk, Go, and the like), you will need to do the following:

Get yourself a proper SSL certificate (self-generated certs won't cut it). Hopefully I don't need to convince you that you shouldn't be running any sort of web app involving users over a plain-text HTTP connection.
Put your app behind a real front end/webserver/proxy, and not just run your app as root via sudo node app.js on port 80/443 like a hobo. This means: Nginx, Apache, or HAProxy (or, in some obscure cases, some combination of the 3).

Incidentally, you'd be surprised at how hard some developers argue and drag their feet about that second point. Sure, you can run your app as root on port 443 for a few minutes, just to test everything is working. But don't even think about leaving it deployed like that, not even on a testing server. No, not even if the app is trivial and there isn't sensitive data at stake. There is a reason that front-ending apps with Apache/Nginx/whathaveyou is an industry best practice.

Also, it's not that difficult to do, so you have no excuse. Let's walk through the procedure.

Step 0: Pre-Requisites

This post assumes that you've spun up a server and set up its domain name. Specifically:

You have a domain name (here, we'll use example.com).
You have access to a server (an inexpensive VPS instance from Digital Ocean, Scaleway or Amazon AWS works great). I'll be using Ubuntu here, but the instructions are almost identical for CentOS and others.
You've pointed your domain registrar's DNS records to your VPS host's NS servers (so, if you're using DO, your Custom DNS Server entries at the registrar will point to NS1.DIGITALOCEAN.COM, NS2. ... and so on)
You've set up the proper DNS records (A record, and a CNAME record for any subdomain) on your VPS host's Networking/DNS tab. In this example, we'll be setting up Nginx to point to test.example.com, so we at least need a * CNAME record added to support that subdomain.

Step 1: Obtain an SSL Certificate with LetsEncrypt

SSL Certificates from recognized Certificate Authorities used to be quite expensive. For example, that's how Mark Shuttleworth, of Ubuntu/space tourism fame, partly got his fortune -- by selling SSL certificates back in the day. Over the years, they have come down in price, but even now, if you want to get a Wildcard certificate (so that it covers arbitrary subdomains), you're looking at anywhere from $85 USD to $500+ per year.

Fortunately, there's also LetsEncrypt.org. LetsEncrypt is a remarkable service -- a legit Certificate Authority (CA) that gives you SSL certificates for free, and gives you a command-line client that lets you do this programmatically.

While LetsEncrypt doesn't offer wildcard certs, they do let you include multiple subdomains in a single certificate, and offer very reasonable rate limits. Also, chances are good that you don't need a wildcard certificate anyway, since you're probably not running a user-facing hosting service.

Docs: see the LetsEncrypt.org Getting Started Guide and the Full Docs for more information.

Installing the LetsEncrypt Client

Pre-requisites: Make sure you have openssl installed. (Also, if you're going to install from the certbot repo, make sure you've also installed git.)

Installation via apt-get or similar: The main page has OS-specific installation instructions (you just have to select your OS from the pulldown, as well as what webserver (Apache, Nginx, etc) you'll be using with it. For example, here's their Ubuntu 16 + Nginx installation docs. Assuming you're on Ubuntu 16.04 (xenial):

sudo apt-get install letsencrypt

Installation from git repo + script: Alternatively, you can just install the certbot-auto wrapper script directly from its repo (which is what I did). See the Installing Client Software section of the getting started guide.

Understanding LetsEncrypt Plugins

It took me some confusion and experimentation to understand the various letsencrypt plugins. Did I need an "authenticator" or an "installer"? Since I wanted to use the certificate with Nginx, did I need the Nginx plugin? (Answer: the Nginx plugin is either not available or completely undocumented, which is the same thing. So no, you don't need it.)

If not the Nginx plugin, did I need to go the standalone or the webroot route? Or maybe manual?

Eventually, I sorted it out. Since LetsEncrypt.org is a Certificate Authority, their main goal is to verify that you control the domain for which they are issuing a certificate. To do that, the LetsEncrypt client needs to do a back-and-forth call and response dance with their servers. Which means that you have only a few options.

Simplest route: --standalone If you can afford to stop your webserver (and let the client take over the HTTPS port for a second), you can just use the --standalone plugin to verify your domain (to generate your certificate). This is perfect for when you're first setting up your server, or cases when momentary downtime is ok (if it's not a user-facing production service).

In the following example, I'm using the letsencrypt-auto script from the repo, but if you installed the client from an OS package (like via apt-get), the command line parameters should be the same:

Here is how you would generate a certificate (that's the certonly command) using the --standalone plugin (which requires you to stop Nginx or whatever else service is using port 80 and 443) for two different subdomains (example.com and test.example.com):

./letsencrypt-auto certonly --standalone -v \
  --email your@email.com  -d example.com \
  -d test.example.com

Several things to notice here:

This generates the certificates in /etc/letsencrypt/live/example.com/ (the live directory actually contains symlinks to the latest generated certificates)
The link to the latest certs becomes relevant later, since you'll need to renew your certs every 90 days.
This command actually generates a single certificate for both subdomains (or however many you listed using the -d flags). This means that if you have a finite amount of subdomains (as opposed to an arbitrary number of user-created subdomains), you can easily list them as one entry (plus aliases) in the Nginx setup, and use just one certificate path (as you'll see in the example below).
The --email is optional but helpful (LetsEncrypt will send you reminder emails that your certs are about to expire).

If you have an existing Nginx that you cannot stop/start: use --webroot. The --webroot plugin allows the LetsEncrypt client to verify your domain without stopping your existing server and taking over ports 80 and 443. It does this by placing some files in a directory you specify (which, again, lets their servers know that you actually control your domain).

This route is slightly trickier, since you have to create the --webroot-path (or just -w) directory, make sure Nginx has read/write access to it, make sure that there's an entry for it in sites-available and so on. But if you have an existing Nginx installation, and cannot afford a moment of downtime, you don't have many other choices.

The general idea is the same: use the certonly command with the --webroot plugin, list the domain and subdomains you want a certificate for with the -d flag, and specify a webroot path directory (-w) which the client can use to create the /.well-known/acme-challenge directory it needs for verification.

For example (assuming you have the /var/www/example.com directory created and set up in the Nginx config):

./letsencrypt-auto certonly --webroot -w /var/www/example.com -d example.com -d test.example.com --email your@email.com

Step 2: Set up Nginx

Once you have your certificates generated, it's time to set up Nginx to use them (and to serve as a front end / reverse proxy for your web application).

(Optionally) Generate a strong Diffie-Hellman group:

openssl dhparam -out /etc/ssl/certs/dhparam.pem 2048

Edit the Nginx config file for your site (for example, edit /etc/nginx/sites-available/example.com):

server {  
        root /usr/share/nginx/html;
        index index.html index.htm;

        listen 443 ssl;
        server_name example.com test.example.com;

        ssl_certificate /etc/letsencrypt/live/example.com/fullchain.pem;
        ssl_certificate_key /etc/letsencrypt/live/example.com/privkey.pem;

        ssl_protocols TLSv1 TLSv1.1 TLSv1.2;
        ssl_prefer_server_ciphers on;
        ssl_dhparam /etc/ssl/certs/dhparam.pem;
        ssl_ciphers 'ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256\
:DHE-DSS-AES128-GCM-SHA256:kEDH+AESGCM:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA:ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES256-SHA384:ECDHE\
-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA:ECDHE-ECDSA-AES256-SHA:DHE-RSA-AES128-SHA256:DHE-RSA-AES128-SHA:DHE-DSS-AES128-SHA256:DHE-RSA-AES256-SHA256:DHE-DSS-AES2\
56-SHA:DHE-RSA-AES256-SHA:AES128-GCM-SHA256:AES256-GCM-SHA384:AES128-SHA256:AES256-SHA256:AES128-SHA:AES256-SHA:AES:CAMELLIA:DES-CBC3-SHA:!aNULL:!eNULL:!EXPORT:!D\  
ES:!RC4:!MD5:!PSK:!aECDH:!EDH-DSS-DES-CBC3-SHA:!EDH-RSA-DES-CBC3-SHA:!KRB5-DES-CBC3-SHA';  
        ssl_session_timeout 1d;
        ssl_session_cache shared:SSL:50m;
        ssl_stapling on;
    ssl_stapling_verify on;
        add_header Strict-Transport-Security max-age=15768000;

  # Reverse proxy to Connect
  location / {
    proxy_buffering off;
    proxy_set_header Host $http_host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Proto $scheme;

    # untested, but taken from https://gist.github.com/nikmartin/5902176#file-nginx-ssl-conf-L25
    # and seems useful
    proxy_set_header X-NginX-Proxy true;
    proxy_read_timeout 5m;
    proxy_connect_timeout 5m;

    proxy_pass http://localhost:3000;
    proxy_redirect off;

    # Static files
    location ~* .+\.(ico|jpe?g|gif|css|js|flv|png|swf)$ {
      # http context
      proxy_cache backcache;
      proxy_buffering on;
      proxy_cache_min_uses 1;
      proxy_ignore_headers Cache-Control;
      proxy_cache_use_stale updating;
      proxy_cache_key "$scheme$request_method$host$request_uri$is_args$args";
      proxy_cache_valid 200 302 60m;
      proxy_cache_valid 404 1m;

      proxy_pass http://localhost:3000;
    }
  }
}

Note the proxy_pass line -- it assumes that your web app (Node.js or whatever) will be listening on http://localhost:3000.

Extra Credit

Start up Nginx, fire up your web app, and use the SSL Server Test page to make sure the SSL/cert part of your app is set up properly.
Set up a firewall (UFW for Ubuntu makes for an extremely easy to use firewall package), and close off the ports you don't need.
Make sure your app is running as a service, using either your OS's startup daemons (such as upstart or better yet supervisord for Ubuntu), or a language-specific service runner (such as the excellent pm2 for Node.js).
Set up logging
Set up monitoring

Fork this post on GitHub

Joining the Solid team

Dmitri Zagidulin — Sat, 30 Apr 2016 21:01:47 GMT

Shortly after New Year's, I said goodbye to my colleagues at Basho, and joined the Solid project MIT CSAIL (specifically, the DIG / IPRI groups). I loved working at Basho, learned a mind-boggling amount, liked and respected my colleagues, and still remain a huge fan of Riak. And am also really proud of helping the Riak Explorer project get off the ground (and provide an Admin GUI for Riak clusters). When faced with an opportunity of a lifetime to join a radical, ambitious project aiming to decentralize data ownership and social web applications, however, how could I pass it up?

So that's what I'm doing these days. Writing libraries, frameworks and reference apps that will enable an ecosystem of user- and data-centric social applications. Diving into the current Semantic Web stack (which so far has been a whirlwind of new concepts, standards and libs), writing specs, doing a bit of community relations, and helping develop new W3C standards.

Reification In Social Media: More Please

Dmitri Zagidulin — Sat, 21 Jun 2014 17:36:00 GMT

reification:

The consideration of an abstract thing as if it were concrete, or of an inanimate object as if it were living.
The consideration of a human being as an impersonal object.
(programming) Process that makes out of a non-computable/addressable object a computable/addressable one.

(see also the Wikipedia disambiguation page)

I find the concept of reification fascinating and useful (according to Five Dollar Words for Programmers, it's from the Latin res facere, "thing making"). I actually didn't know about meaning #1 (especially in the fallacy sense, as in "we need to be careful not to reify the economy"), nor about meaning #2 (apparently it's a thing in Marxism and in critical theory). The Computer Science/programming meaning, however, is something that I think about constantly, and I would like to explain why it's relevant to you.

Let's start with an example of what reification is in social media.

Consider a plain blog post, either on WordPress or on LiveJournal. It's a generic, untyped container, and it can hold pretty much anything expressible in HTML -- text, links, images, videos, and so on. The fact that posts are general-purpose is a good thing. But the downside is: the blogging engine cannot readily determine what kind of post it is.

Now, take a look at what Tumblr does, when you go to make a new post. It gives you choices:

You can make a Text post (very much like a regular blog post, with links and paragraphs and so on). But also, you can instead post a single standalone Photo. Or just a URL (Link) you want to share, with commentary or without. Or a link to a (YouTube, say) Video. (I didn't actually know what the Chat type is for, but apparently it's for "overheard" conversation snippets). You get the idea. Now, why do they have those choices? Can't you just make a plain blog post (using text or HTML), and have it consist of a single YouTube video link, and have it serve the same purpose? I'm going to explain the advantages shortly. The key thing to note is what's happening here, behind the scenes.

If you made a regular blog post, and all it contained was a quote that you liked, in quotation marks and everything, you would know that the post contained a quote. And your readers would know that the quote was the whole point of the post. (You could also add proper HTML markup, and actually put it into blockquote tags). But the blogging engine, the system, wouldn't know that it was a quote.

But if, instead, you had a way to explicitly mark your post as a Quote type (by clicking on the new Quote button in Tumblr, for example), that would be an example of reification. The blogging software would then "know" that your entry was about a quotation, in the sense of, you could test for the type in the backend logic, and display the quote differently, file it in its own category, and offer new functionality based on that knowledge.

To put it simply, reification (in the context of social media software), is where when you go to make a new entry, you can choose what kind of thing it is (whether you're creating a long-form post, a short Twitter-style note, or just sharing a link, or uploading a photo).

So, why force users to make extra decisions, why complicate your code, risk confusion, and so on? Reification enables you to do the following things.

Display/Formatting

You can display entries of different types, well, differently. If it's a quote, you can center it, put really bitchin' giant quotes around it, format the author/source of the quote correctly, center the whole thing, and so on. If it's a YouTube video, the user can just post a URL, and behind the scenes, you can actually post an embedded YouTube player already showing the video, and so on. This is a minor advantage, relatively speaking, since technically, the user could apply those special styles manually (if they knew HTML and cared about doing that).

Filtering

Instead of a single generic stream of posts (which is the only option for a system like LiveJournal), you can now give readers additional choices. Do they not feel like reading at the moment, and just want to look at pictures their friends posted? They can click on the Photos stream, and only view those. Or the inverse -- don't care to be spammed by people posting videos or pictures, at the moment? Turn those off, for this session. FetLife does this quite well, by the way -- you can either read your entire update stream, or just view people's pictures, or videos, or only their posts, etc. In this context, you can think of post types as agreed-upon tags/categories, that are built right into the user interface, to make the reader's experience easier.

Aggregation and New Functionality

Once you start going down this path of having explicit entry types, you can really get creative with your functionality.

For example, having a separate standalone Photo type means that you can now easily integrate with third-party photo-specific services (and vice versa). So now, when you upload a photo, you can select a checkbox and also have it post to your Instagram account (or Flickr, or Pinterest, or whatever comes along). Similarly, you can now extend Instagram/Tumblr/whatever clients to also cross-post to your blog engine, correctly typed and formatted. And if you look at it another way, having Photos integrated into your blogging platform in a first-class way can replace specialized services like Instagram. You can now offer the same functionality as Instagram, for example (in the sense of, a dedicated image feed from your friends), with the advantage of being able to reuse users' existing friends lists, filters/circles, and other such security and trust mechanisms.

Even if there are no other third-party services to integrate with, for a particular post type, this separation means that you're gaining the functionality of standalone apps. Consider Links, for example. Why use a standalone bookmarking service (do you remember Del.icio.us?) in addition to sharing those links with your friends, when you can click a tab and view all of the links (and just the links) you've ever posted? Same with Quotes. I love quotes, and keep a simple quote text file. I would much rather have it integrated into my blogging system/social network, so I can see quotes that other people posted, so I can see just mine, so they can be linked off of my user profile.

You can now have all sorts of fun with stats. Viewing a user's profile, you could now view a pie chart, "This user's activity is 80% videos, 10% photos, and 10% text posts".

That's just with a small handful of existing types that Tumblr offers. And those types only denote what kind of media a post primarily contains. You could actually get even more specific, and start marking more abstract conceptual categories, denoting an author's intentions:

Book/movie/game reviews - What if you had a way to explicitly mark "this post is a review"? Now, you could say "I really like how Cat writes about movies! Let's see what else she's watched and reviewed lately." Or Danielle wrote some intriguing book reviews last month, let's see if I can find that one book she raved about." Review types are especially powerful when combined with widely-recognized unique IDs. If you can not only select "This post is a book review", but also input its ISBN number? From there, given an open enough API, it's a skip away from being able to install a browser plugin, so that when you're looking at a particular book in your favorite online bookstore/library/whatever, you can see at a glance what your friends, from your contact list, had to say about the book.

Recipes I would love to be able to view the recipes my friends list posted, or to use my own feed as a personal cookbook.

Quantified Self type entries - Things like RunKeeper entries, exercise program updates, diet progress, steps walked, all that stuff which you now track through separate apps, and keep separate contact books (and filters, if the apps have any) for, why not centralize those channels, and reuse one set of friendslist / filters / whatever?

So, to summarize: reification, in the contest of social media posts, offers all sorts of useful new features and capabilities, and I would like to see more of it.

Standards and Protocols are Holy

Dmitri Zagidulin — Fri, 20 Jun 2014 17:46:00 GMT

In biology, the individual organism is ephemeral. Each individual is incredibly fragile, and has a built-in limited lifespans, not to mention a good chance of meeting with accidental death at any moment.

The only things that matter, that have at least a shot at longevity, are genes (and species, really).

With humans, again, the individual, while certainly extremely important, is very fragile and limited in lifespan (just ask Tolkien's Ringwraiths, "Nine for Mortal Men, doomed to die" and all that). Genes certainly matter (the importance of family). But in addition, memes (in the Dawkins sense, as ideas and units of cultural propagation) become incredibly important, and offer another venue for longevity.

To put it another way, the effect of an individual on the world is threefold: their actions during their lifetime, their propagated genes (arguably the least important), and their cultural legacy (their propagated memes, the ideas and memories they leave behind). If "cultural legacy" sounds too grand or abstract, remember that it operates on the tiniest of scales. If you have a child, how you raise it and the things that you teach it (this becomes a part of your cultural legacy, the memes that you propagate) is much more important than the genes it inherits. On a smaller scale, that book you lent to your friend's kid? You know the one, that sparked off their lifelong love of scifi and fantasy, or of problem solving, or poetry, or riding horses, or whatever? That's a part of your cultural legacy. Obviously, if you wrote that book, your legacy is even greater. But in the world of memes and ideas, curation, rebroadcasting, analysis and commentary, are almost as important as actual idea creation.

Sidenote: if one wanted to rank those three factors (actions, genes, memes) in the order of importance (a questionably useful endeavour), I would argue that an individual's cultural legacy, the ideas and memes, are of the highest import, and leave the largest influence. Certainly more important than genes (unless you're Magneto, and your kids can somehow inherit a never-before-seen genetic mutation that can save the world). And more important than individual actions. Think of history's high-impact individuals, like Margaret Thatcher, Alexander the Great, Martin Luther King or Eleanor of Acquitaine. These were no slouches, when it came to individual actions and their effect on the world. But consider how much greater the impact of their cultural legacy is. Of the ideas, laws and concepts that came into the world as a result of their actions.

Now consider the world of technology, specifically software.

As engineers and developers, we spend so much of our time involved in specific tech stacks. Endlessly debating the merits of a particular tool, programming language, framework, infrastucture components or applications.

Here's my point: individual systems, applications, and frameworks don't matter that much. But formats, standards and protocols are holy (whether de-facto or de-jure). (Similarly, the ability to export and import, is the single most important feature that your software can have). This is related to the concept of living software, though different in emphasis.

In terms of impact and longevity, the importance of standards and protocols over individual applications is the equivalent to the importance of genes and species over individual organisms, or to the importance of cultural legacy and memes over a person's genes or individual actions.

MySQL and Postgres? Yes, they're important, sort of. But only because of the ANSI SQL standard. (And because of the CSV format for importing/exporting data between them).

The Mosaic browser, Netscape, IE, Firefox or Chrome? Again, also important. But only insofar as they implement the HTTP protocol and the HTML/CSS/ECMAscript formats. (And especially important in the battle for those protocols and formats, using the usual weapons of embrace-and-extend, nonstandard features and so on).

Pay attention to these. Formats, standards and protocols are an important battleground, one that does not receive enough attention in our open source culture. Identify and stake out the crucial ones, and protect them. Get involved in their formation and implementation (again, you can have an enormous influence by just writing a useful library such as Markdown, even if you don't have a seat on a W3C standards workgroup).

Github Page Views / Analytics

Dmitri Zagidulin — Sat, 31 May 2014 17:48:00 GMT

Something I didn't know about, and was excited to find out yesterday.

Github finally has traffic analytics built-in! (Well, finally as in this past January. But still, I didn't realize!) And all thanks to Ilya Grigorik and his excellent ga-beacon repo.

See, for the longest time, if you had a Github repository, you could only get a sense of how many times your code has been forked, or how many people "starred" it or were watching it. But to answer a question as simple as "How many page views did my repository get?", you had to arrange your own tracking.

Since you couldn't put your own Javascript snippets into a README file, using Google Analytics was right out. So the only other recourse was to use a "beacon" image (this is how email views are tracked, also, by the way). If you included an image (usually a clear 1x1 pixel image) that lived on a server you controlled, you could track how many times the image was requested, and so you could track page views (and the usual analytics stats).

Ilya took the next step, with the ga-beacon repo. He used the beacon image idea, and hooked it up to Google Analytics, so you could get all those nice graphs and tracking for free.

Once the repository became popular enough, it sounds like Github decided to just include this functionality natively. (So, if you own the repo, go to its Graphs > Traffic, to see the page views and visits).

Though you can still use his code, since Google Analytics provides more info and better graphs than the simple reporting included in Github.

So, this made me happy for two reasons:

Github has native pageview analytics now!
If your favorite site is lacking features, sometimes you can embarrass it into supporting them by coding them yourself, and having your code become sufficiently widespread.

The Flattening of Design

Dmitri Zagidulin — Sat, 31 May 2014 17:47:00 GMT

I'm still not sure how I feel about the "All the things must be flat" design trend. (Well, that's not true, I certainly hate Windows 8's Xbox-like desktop, the way that the Xbox UI was changed to match it, and the iOS 7 UI update.) Also, I did notice, the other day, how LJ's own user interface was certainly, er, flattenized.

But, I thought this article was interesting:

The Flattening of Design

"[...] companies aren't simply following Microsoft's lead in the quest for flat. There are cultural and technological reasons for this new look and feel."

Also, unrelated (well, related in that the article mentioned that the flat design is reminiscent of these):

Gallery of Russian Propaganda Posters

Explaining MapReduce to My Distant Relatives

Dmitri Zagidulin — Wed, 11 Jul 2012 17:49:00 GMT

You need to understand what MapReduce is.

If you've never heard the term and you don't work in the tech sector, stick around. It's easy to understand, and it's important.

A one-line answer

Q: What is MapReduce? A: MapReduce is a counter-intuitive but very powerful way to answer questions and perform calculations.

Ok. But how does it work, and why does it matter? Here's one way to think about it.

Ants and Elephants: a quick analogy

Databases are behind most of the software that you encounter and care about. And the dominant life form in the world of databases, for the last several years have been relational databases. Your local library catalog system? Runs on a relational database. The software controlling the grocery store checkout lines (the ones that list the items that you bought and their prices) uses a relational database to keep track of everything. Wikipedia is a giant relational database (of HTML pages and edits to them).

Let's try the analogy again. Imagine that database related tasks (asking questions, counting, calculating) are similar to... carrying huge, heavy sacks of rice across a field.

Given that image, relational databases are smart, hard working elephants that can lift entire sacks and carry them across. They are well-trained and good at what they do. You point them to the sack of rice, give them a few commands, they lift the sack and carry it across the field. (If you're worried about animal rights... imagine that they're robot elephants).

MapReduce based systems, on the other hand, is like having a giant army of obedient ants at your command. If you need a sack of rice carried across, you compose a set of individual instructions (take a single grain of rice out of the sack, drag it across the field, and put it in a sack on the other side) and send it out to all the ants simultaneously. They surge into action, and although each ant is much weaker and simpler than an elephant, when the dust settles, the sack of rice still ends up transported across the field.

You can probably see what the drawbacks of working with ants are. It's much easier and more intuitive for a programmer (in charge of transporting rice across a field) to learn how to point a smart elephant to a large bag and tell it where to carry it. The elephant knows how to lift -- it's done it before, it knows how to walk and how to keep its eyes on the goal. And the rice stays together, in an intuitive logical grouping. The ants, on the other hand, have to be micro-managed. You have to direct them carefully on how to unload the rice, how to carry it across the field without bumping into other ants, and how to load it back into a sack. And if you're not careful, you'll end up with rice scattered all over the field.

So why is MapReduce so important? Well, there are several reasons, which we'll discuss in a bit. But I'll give you the first hint.

Ants are cheap and interchangeable. If the elephant falls sick one day, what are you going to do? The rice still has to get hauled across the field. Sure, you can keep a backup elephant around, to take up the slack while the first one recovers from an elephant cold. Except, now you have to buy (and feed) two elephants. And what if they both fall sick at the same time? If a single ant gets sick (or stepped on, or eaten)... it's much easier to replace, you order another bagful of ants, and off they go.

Keep that image, of ants and elephants, in the back of your mind. Meanwhile... what have we really explained here? Carrying grains of rice is easy to imagine, but it's a bit too abstract. How do you actually perform calculations with MapReduce? And what, again, are its advantages over relational databases?

Students versus Museum Directors, Fight!

Imagine that there is a book museum, with an extensive rare book collection.

And you wake up in the middle of the night, in cold sweat, and you absolutely must know: How many poetry books with red covers are there in the rare book collection? The success of your business depends on it.

Here's the traditional way to get the answer to that question:

Answering questions, the relational database edition

You call up the museum's Director. And you pose that question to him - How many poetry books with red covers are there in your collection?

(Here's what you must know about the Director. He's trained all of his life to answer these kind of questions. He's really fast at counting. He has a great memory, and a complete map of the library and all of its shelves in his head. He reads at a Guinness World Record level speed, and his movements are efficient and precise.)

The Director frowns. He's usually well prepared for questions like these. For example, he has some common questions already researched and pre-computed; if you merely asked him "How many poetry books are in your collection?", all he would have to do is to consult his ledger, with neat totals of all the books by section -- he wouldn't even have to leave his office to answer. Color, however? That's not in the ledgers.

But no matter. This is what he does best. He switftly moves to the Poetry section, and walking the shelves methodically, he scans all of the poetry books in his museum, one by one, counting the red books on display. He can count really high without losing his place, he does not stumble or miss a book. Pretty soon, he comes back to his office with an answer, and phones you with an exact total of red poetry books in his museum.

Not only that, but, assuming that you'll ask that question again, he can add a 'Cover Color' column to his ledgers, and order his assistants to start keeping track of red and green and yellow cover totals, tabulated by section and by author and everything. The next time you ask, he won't even have to leave his office to answer.

This is how the world of traditional relational databases works. They are good at what they do. There is a powerful, smart, precise Director of whom you can ask questions (if you know how to speak his language). Traditional databases were an astoundingly useful technology, and still are. They shape much of the modern world.

But there is another way to get the answer to the question about book covers. This is MapReduce:

Answering questions, the MapReduce edition

Imagine for a second, that you had access to a nearly limitless, very inexpensive labor pool, that was not skilled in anything in particular, but highly trained to follow directions. Like, say, teenagers fresh out of highschool. (If you're worried about teenager rights.. imagine that they're robot highschool students).

You hire a large group of them. Maybe twice as many people as there are poetry books in the museum, plus a little more on top of that. You divide them into two teams.

The first is the Map team. (You can give them red armbands, to tell them apart from the second team).

You arm the Map team with some very simple tools. Each one of them gets a blank paper index card, and a pen. And you give each of them an identical set of directions.

Their directions are: at the appointed hour, every member of the Map team streams into the museum and heads to the Poetry section. Each team member lines up in front of one book. (Pretend for a second that the museum is spatious enough to accomodate them all). Once in position, each member simply writes down the color of the cover of the book they're standing in front, and puts a 1 next to it. Like so:

(Index card of Map team member #1 contains) Red: 1 (Index card of Map team member #2 contains) Black: 1 and so on, until each book is recorded, one index card per book.

After all the books have been recorded, the Map team heads out of the museum, each carrying their index card, each with a single color and the count: 1.

That's it! That's all the directions given to each member of the Map team. Go in. Find one book (no overlapping, no duplicates, no book left behind). Record the color of its cover. Put a count of 1 next to it. Get out.

Now comes the handoff to the Reduce team (who is also armed with pens, and blank index cards).

The Reduce team (you can give them blue armbands) stands outside the museum and collects the Map team's index cards. They stand in rows, like a pyramid, a long front line ready to greet the incoming Map team. Then a smaller row behind that. The rows reduce in size, until there is one final member of the Reduce team all the way in the back, standing holding a phone.

The directions given to the Reduce team are a tiny bit more complex, but still easy for individual member to understand:

Take the index card that's being passed to you (by the Map team coming out of the museum, or by a Reduce team member before you). Look at the color written on it, and throw away anything that's not Red. Go through all the fileld-out cards in your posession, and for every color, add up the totals for it. (Though since you threw away all the other colors, you'll only have Red totals, in this example). Write down each color and its sum on your own card, and pass it back one row.

Eventually, the totals start adding up as the cards move through the rows. Finally, the last Reduce student receives the two semi-final cards with the two subtotals, adds them up, and calls you with the answer.

The work that each student does is easy. They just throw away unneeded colors, add two numbers together, and pass on their work down the line. Most importantly, they don't have to keep a long-running total in their head, they can't lose count (all the relevant information is written down before them), and they don't care what the other students are doing.

This is how you answer the question "How many red poetry books are there in the museum?", MapReduce style.

Now you may be wondering: Who does that? Why would you do ridiculous things like write down a single color and put the numeral 1 next to it? Can't you get the students to start counting books right away, or something?

This is what I mean by counter-intuitive. MapReduce requires a slightly different mindset that may seem strange and intimidating at first. But after a couple of examples, you get the hang of it, I promise. You catch on.

But the initial confusion is worth it. MapReduce is a skill and a mindset worth learning. (And if you're not a programmer, it's a skill worth teaching to somebody in your organization).

Why is it worth learning? Why do I keep saying that MapReduce is important? I'll tackle that in another post.

Until then, I'll leave those answers as an exercise to the reader. Just keep in mind the parameters set down in this artificial analogy. The difference between museum directors and generic high school students. The amount of training each one gets, the ease of recruitment, and the amount of salary each one demands. Keep in mind the ease of talking to an intelligent Director, versus the hassle of hammering out foolproof individual instructions, and wranging hordes of students yourself, but also the opportunity that the second method presents.

Three Unrelated Thoughts on Tech Schools

Dmitri Zagidulin — Wed, 11 Apr 2012 17:51:00 GMT

A few months ago, I came across a mention of a CS high school opening in NYC: New York City gets a Software Engineering High School.
I think this is a brilliant idea. High school (and, honestly, even earlier, in middle school) is the ideal time to learn programming. (I lucked into an AP Computer Science class early on in highschool, and it changed my life. I've always been interested in computers, but I changed my college plans from pre-med to comp sci right then and there).

I wonder how they'll structure their curriculum?

(My next thought, of course, was: Now we need one of those here in Portland, ME!)

Last weekend, at an Easter party (with a small horde of actual kids! hunting for eggs!), I met a guy who studied learning psychology (and apparently helped translate a book on the subject by a famous Russian academic. I lost the title, but he said he'd lend me the book).

Anyways, he pointed me towards a very interesting project: the Baxter Academy for Technology and Science, a tech charter school opening right here in Portland ME (down the street from the ferry, actually)!

This is very interesting and promising, and I hope they include a healthy dose of computer science in their curriculum.

The other day, on a Ruby user group mailing list, somebody mentioned that they have a bunch of tech books to donate, and they'd prefer to give it to a charity or an educational organization. There were several suggestions (a public library, an underfunded tech college).
But one I thought was very interesting: the MOUSE.org project.

MOUSE seems to be a set of programs to help high school students learn leadership and tech skills, centering around a student-run Help Desk (which, in addition, helps the school save on technology costs).

Again, very interesting, I'm going to file this on here in case I have a chance to introduce something similar here in Portland.