Protocol Buffers: Google's Data Interchange Format
Posted:
Monday, July 7, 2008
At Google, our mission is organizing all of the world's information. We use literally thousands of different data formats to represent networked messages between servers, index records in repositories, geospatial datasets, and more. Most of these formats are structured, not flat. This raises an important question: How do we encode it all?
XML? No, that wouldn't work. As nice as XML is, it isn't going to be efficient enough for this scale. When all of your machines and network links are running at capacity, XML is an extremely expensive proposition. Not to mention, writing code to work with the DOM tree can sometimes become unwieldy.
Do we just write the raw bytes of our in-memory data structures to the wire? No, that's not going to work either. When we roll out a new version of a server, it almost always has to start out talking to older servers. New servers need to be able to read the data produced by old servers, and vice versa, even if individual fields have been added or removed. When data on disk is involved, this is even more important. Also, some of our code is written in Java or Python, so we need a portable solution.
Do we write hand-coded parsing and serialization routines for each data structure? Well, we used to. Needless to say, that didn't last long. When you have tens of thousands of different structures in your code base that need their own serialization formats, you simply cannot write them all by hand.
Instead, we developed Protocol Buffers. Protocol Buffers allow you to define simple data structures in a special definition language, then compile them to produce classes to represent those structures in the language of your choice. These classes come complete with heavily-optimized code to parse and serialize your message in an extremely compact format. Best of all, the classes are easy to use: each field has simple "get" and "set" methods, and once you're ready, serializing the whole thing to – or parsing it from – a byte array or an I/O stream just takes a single method call.
OK, I know what you're thinking: "Yet another IDL?" Yes, you could call it that. But, IDLs in general have earned a reputation for being hopelessly complicated. On the other hand, one of Protocol Buffers' major design goals is simplicity. By sticking to a simple lists-and-records model that solves the majority of problems and resisting the desire to chase diminishing returns, we believe we have created something that is powerful without being bloated. And, yes, it is very fast – at least an order of magnitude faster than XML.
And now, we're making Protocol Buffers available to the Open Source community. We have seen how effective a solution they can be to certain tasks, and wanted more people to be able to take advantage of and build on this work. Take a look at the documentation, download the code and let us know what you think.

Well... first, it sounds good!
ReplyDeleteThis is awesome news!!
ReplyDeleteThis does sound good. It's great to see that Google has released this into the open, which is helping open source projects instead of keeping it as a trade secret, which would be in their best interest. Now we won't have to duplicate the work.
ReplyDeletewow.. that's very good, I never think before about this, very nice..
ReplyDeleteWell it's easy to be faster and smaller than XML, of course. But what about ASN.1 DER?
ReplyDeleteIt's a (binary, efficient, deterministic) standard in wide use, you're already forced to learn it if you work e.g. with cryptography, and there are some quite decent libraries out there (e.g. asn1c for C, for Java there is much in BouncyCastle).
What does "faster than xml" mean? Considering you do not offer any context I really wonder.
ReplyDeleteSounds perfect.
ReplyDeleteLet us try to use it.
Thats fantastic news. Great to see. Well done Google.
ReplyDeleteI think your comparison of PB to DOM-based XML access is misleading. A more correct comparison would be between PB and XML data binding. An XML data binding toolkit generates statically-typed object model from XML schema (XSD, DTD, etc.) and it is as easy to use as PB-generated classes. XML data binding has been around for a while and there are normally several options to choose from for each programming language.
ReplyDeleteSome data binding toolkits also offer fast and compact binary parsing/serialization (e.g., to XDR) in addition to XML. This makes it possible, for example, to use binary encoding for internal communications and storage while using XML for third-party integration.
And how about JSON? It's not binary, but it's not XML either... And it's human-readable.
ReplyDeleteNot to mention it is already the lingua franca of AJAX.
Great! I dived into it straight away and tried to use it with the Google App Engine. That doesn't seem to work yet though.
ReplyDeletehttp://ur1.ca/6p
Curse you Google!!! ;) Why didn't you release this a year ago before I started writing an app that needs precisely this!? Sob...
ReplyDeleteInteresting. I wonder how it compares to HDF5?
ReplyDeletehttp://en.wikipedia.org/wiki/Hierarchical_Data_Format
Is it simpler? Different goals?
If you had troubles with XML, you should have switched already to an alternative like ASN1, JSON, AMF and others.
ReplyDeleteIt really depends on what you are coding. Why would Google format be better than those? Most are also open...
Everything old is new again: Google just reinvented AOL's SNAC and TLV:
ReplyDeletehttp://iserverd.khstu.ru/oscar/basic.html
Is Google's FLAP implementation far behind? Who knows.
It's nice to see Google validating the technological innovation that AOL had, what, 15 years earlier.
What is inefficient about XML is the angle broke-kets, i.e. the external format. This is entirely independent of a given schema language, such as XSD or RelaxNG. You could just as easily have come up with an efficient binary format that could be described by one of the existing schema languages. Learning to separate abstract syntax -- such as schema -- from concrete syntax -- such as a binary or ASCII or unicode form -- has been one of the major advances in computing.
ReplyDeletei think this offering is indicative of the industry as having lost the thread. The point of XML is *not* the angle brackets, but rather self-describing data. By moving in this way, because of Google's size and influence -- Google has made a step backwards.
Hi all,
ReplyDeleteTo answer a few questions:
@Lapo: Sorry, I personally am not very familiar with ASN.1 DER. A brief look at some documentation suggests to me that it is more complicated than Protocol Buffers, which can be good or bad depending on whether you need that complication.
@Lawauach: "Faster than XML" means "In typical use cases, a reasonably-structured protocol buffer will be significantly faster to encode and decode than an equivalent reasonably-structured XML type.". It's hard to be more specific since there are so many different use cases. If in doubt, it would probably be a good idea to benchmark XML vs. Protocol Buffers for whatever case you have in mind. Don't forget to use "option optimize_for = SPEED;". :)
@boris: You're probably right that the ease-of-use argument goes away if you use an XML data binding toolkit. I haven't actually seen such things used before but that's probably because we just don't use XML that much in the first place at Google.
@Jose: JSON and Protocol Buffers are logically very similar. In fact, it would make sense to encode Protocol Buffers in JSON format when communicating with AJAX front-ends -- some projects at Google do exactly this. The Protocol Buffer API supports reflection on message classes (even in C++), which makes it easy to write a general JSON encoder/decoder for protocol messages.
@Channel Hopper: Embarassingly for me, I had originally planned to get this released well over a year ago, but it turned out to be a lot more work than I expected. Sorry. :(
@Marko: Sorry, I don't know anything about HDF5, so I don't think I can provide a fair comparison there.
Channel Hopper:
ReplyDeleteThere was always thrift: http://developers.facebook.com/thrift/
I like it :).
ReplyDeleteIs Google going to use it in APIs?
I can understand your reasons for not wanting to use XML. But if you need a cross-language solution for serializing data structures, then they don't come much simpler than YAML.
ReplyDeletewww.yaml.org
This is great news. Have you thought about also making this an Open Standard. I'm sure the folks at OMG would be happy to help you.
ReplyDeleteI have my own data format that is an alternative to XML as well. It works by normalizing the data into records which all contain the same number of fields, and placing an agreed-upon delimiter between each field. The end of the record is indicated by a newline.
ReplyDeleteI think this "delimited" format has a lot of potential.
Looks cool. No-one seems to have mentioned the similarity to RFCs 1832 and 1831 though (XDR and RPC from Sun, originally specified in the late 80s). Why not just use those? I do. The libraries already exist for many languages.
ReplyDeleteHi Google,
ReplyDeleteyou did not specify what license. A lot of us will be really interested if the license makes sense to us.
Can you add license information when posting such an announcement next time ?
thank you,
BR,
~A
@anjanbacchu: seems to be licensed under Apache v2.
ReplyDelete@Lapo: The first thing I too thought about when I read this was "why not use ASN.1? That's what you just described." As mentioned previously it is used all over the computing world (though few realize it), most notably in telecom where you can have very tight constraints on memory, CPU processing power and throughput. It has proven itself in all of those categories since the mid-80's.
ReplyDelete@Kenton: Is there any way to get some information from other Google team members about whether or not ASN.1 was considered when Protocol Buffers was being developed? Soon I am going to be pitching an efficient way to format data throughout my organization, and ASN.1 is my go-to after initial research. If there is a specific reason why it was not used, or why Protocol Buffers is or would be better? I'd like to hear it if so, perhaps I'm missing something.
Jose, if you wanted JSON representation and are using Java objects, take a look at the Gson project: http://code.google.com/p/google-gson
ReplyDelete@Kenton: You'll admit that it's hard to swallow that it's up to me to tell you why your protocol would be faster than XML :)
ReplyDeleteWhat if I was writing a blog post saying the opposite without explaining any use case nor any details on how/why it may be faster?
What bothers me here is not the claim you make, heck like I was saying elsewhere it would make little sense to release/use it otherwise, but the fact you it's a bold claim that lack professionalism.
I added the project on Ohloh: http://www.ohloh.net/projects/protobuf
ReplyDelete@nharward and @Lapo: I had exactly the same thought! Why not just (a subset of) ASN.1? And DER is only one form of concrete encoding rules there's also XER (XML-based) and the more binary-compact PER.
ReplyDeleteseems ok to me so far
ReplyDeletebut hopefully it wont become as complex as XML :-)
Great technique for in-house development.
ReplyDeleteI think it is far from a enterprise grade technology, even though google is using it. It has nothing over XML or ASN.1
In 20 years time when you migrate this to a new platform, you will wish you've done it in XML.
In two years that you plan to extend this, you'll wish you've done it in XML.
Performance, granted I'm sure it is faster. But parsing ".properties" files are even faster.
Say, one day you decide you have the perfect protocol buffer
person message Person {
required string name = 1;
required int32 id = 2;
optional string email = 3;
}
It is so perfect that you foolishly decide every application in the house should use it, right?
Then someone needs and unique identifier, sort of a SSN, and that's mandatory. This would leave the Person protocol buffer useless for this application and if you want to change it across the board it will be backwards incompatible. All of the sudden you have to write glue/workaround code and logic.
In XML, you just craft a new DTD or XSD and use a separate namespace, and the other applications (if parsing appropriately), can even share the safe files or messages, they won't notice the change.
But maybe this can be solved, I'm sure it could in the future.
If I were to use it instead of XML or any other similar technology, I would be very careful and think hard about the future of the application that will use it.
Hey, why not throw an XSL for this? Any one interested?
I'm reading the documentation of ProtocolBuffers and it looks interesting, but it looks to me that Google reinvented the wheel.
ReplyDeleteASN.1 exists for years and years and is a ITU standard. It is extensively used in the telecom industry, e.g. call records of GSM switches, among others platforms like GPRS, MMS, SMS and so on.
ASN.1 has different types of enconding rules being BER the most common one. There are also DER, PER and even XER that actually encodes the protocol in XML.
What is not so good on ASN.1? Its learning curve is more than PB, for sure, and, because it is not so widespread, the few compilers available are really really expensive.
There is a good open source compiler called asn1c and you can also used the ASN.1 module that comes with Erlang (remember that Ericsson is a telecom company).
There are two books about ASN.1 available for free on the Internet.
Anyway, good work guys.
Maybe now, because Google's name is associated to this initiative, people will wake-up and see that, despite XML being a nice technology, it is not the universal panacea to solve all interfacing problems, configuration files, etc etc.
What I ever never liked in XML is that it costs a lot to process.
Cheers,
@Andretti: "far from a enterprise grade technology" -- Are you saying this because Protocol Buffers are simple? Something doesn't have to be complicated to be "enterprise grade technology", and since Google uses Protocol Buffers "for almost all of its internal RPC protocols" (as it is written on the Google Code homepage) it surely isn't a toy-project. Moreover your comment about lack of extensibility is wrong, Protocol Buffers were designed specifically to gracefully handle the use-case you mentioned.
ReplyDeleteAccording to the docs, one of the scalar value types is uint32. The mapping to Java is to a Java "int", which is indeed 32 bits, but is signed. Which means if you're writing to the value, say "2^32 - 1" via C++ as a uint32, and then reading it back in Java, you'll get a signed number back. That is, what you put in isn't what you get out.
ReplyDeleteAm I missing something here? Seems like the Java native type for uint32 should be "long", something big enough to actually hold the value.
@andretti:
ReplyDeleteIf I've understood you correctly I think you've missed a facet of Protocol Buffers:
Those =1, =2, =3, etc. in the format spec are field identifiers. This means that you could add a UID field with field ID 4 - older code that didn't recognize the new field could safely ignore it, while new code could take advantage of it.
If the author of a message is legacy code and the receiver is new code expecting a UID field - well, then you're stuck in any case.
HDF is a very mature API ( C/C++/FORTRAN) for storing huge ( read giga to terabytes) of structured scientific and simulation data - protocol buffers and HDF don't really address the same problem space and are not in 'conflict'
ReplyDeleteBefore rushing into using protocol buffers I would suggest looking at the code generators from CodeSynthesis and Liquid Technologies that generate C++/Java/C# from XML .XSD files - if speed/data size is not an issue then provide they same ease of use.
@tsuna
ReplyDeleteYou are right, does not have to be complex to be enterprise grade. But has to be proven outside google's ecosystem and withstand test of time.
In other words it lacks maturity, and that is something you get with time, you cannot produce a mature product from zero, and google knows this, that is why is labeled as beta.
And to take of the the example, I haven't read it all, just a few pages of ducmentation and my impression is that you will have to extend the original definition. That will work with your new application, but you cannot exchange messages directly with the apps using the base definition without manual conversion.
You have to deserialize/serialize from PersonWithSSN to Person in order to talk to older apps.
@Andretti: it's not because Google releases it publicly now that this code is recent/immature. It's labeled as beta because ... well I guess because Google likes to have beta. But within Google this has been used for years (go to research.google.com, several scientific publications on Google's infrastructure mention the existence of protocol buffers) and every day, millions of people query Google's services, and all of these people (even you right now) are using Protocol Buffers. So this is very mature and broadly used technology.
ReplyDeleteAnd no, if you extend the definition of a person by adding a SSN field, older code will still be able to read new messages (without recompiling/restarting servers). This is one of the most important features of Protocol Buffers, they ARE backward compatible.
@Andretti: You're right in one sense, the new data can never be 'required', it has to be 'optional', otherwise it won't be backward compatible. I guess that's just the way PB handles backward compatibility; but it has been working great.
ReplyDeleteSounds a bit like Facebook Thrift. The lightweight representation seems great!
ReplyDeleteRegarding ASN.1: I don't want to get in a flamewar here, but basically my feeling on ASN.1 is that it is a very complicated standard, and the more complicated a standard is, the harder it is to implement it well. Protocol buffers are a very simple spec with a very high-quality implementation.
ReplyDeleteRegarding JSON: JSON is structured similarly to Protocol Buffers, but protocol buffer binary format is still smaller and faster to encode. JSON makes a great text encoding for protocol buffers, though -- it's trivial to write an encoder/decoder that converts arbitrary protocol messages to and from JSON, using protobuf reflection. This is a good way to communicate with AJAX apps, since making the user download a full protobuf decoder when they visit your page might be too much.
@Andretti: If you add a new field to a protocol buffer type, you can still use it to read messages written before that field existed, and old software can read new messages that include the field without any problems (it just ignores the new field). This is arguably the most important feature of protocol buffers. I'm not sure what you mean about protocol buffers not being "enterprise-class", but we have used them successfully for almost all our communications and file formats at Google for about seven years now.
@Lawouach: Sorry to disappoint. Creating realistic benchmarks is really hard, since different use cases will produce completely different results. Even if we did provide benchmarks, you would still need to benchmark it for yourself before using it to make sure it works well for your particular use case, whatever that might be.
Pity you chose to reinvent Corba IDL and IIOP instead of using it. Corba has some quirks, but it is a mature and proven technique. There are excellent open source products like JacORB, TAO or IIOP.NET that interoperate very well with each other and J2EE.
ReplyDeleteGoogle could have been compatible to the Corba world.
Regarding to ASN.1, I think ASN.1 is not so complex if you only use a defined subset of it. And I recently found out about a Java Open-Source Compiler called openASN.1 ( http://www.openasn1.org ). It's just a pity it is in german only. Maybe if someone with a better grasp of English than I have could translate the documentation.
ReplyDeleteI realy would like to see a detailed comparison between protocol buffers and ASN.1 so I could decide what would be better.
Regarding comparison to HDF5, The HDF Group has posted an initial comparison at http://wiki.hdfgroup.org/Google+Protocol+Buffers+and+HDF5
ReplyDeleteIf you would like to add comments, the invite key for write/comment privileges is hdf5wiki.
@Wabble: Protcol Buffers is definitely not Corba IIOP. Please correct me if I'm wrong but in Corba all messages go through the "all knowing" ORB. This is what degraded performace, you could never have a conversation directly with another client/server without the ORB bottleneck. What is appealing to me about Protcol Buffers is the fact that there is no ORB. And the IDL seems much simpler.
ReplyDeleteI hope to develop a benchmarking tool that compares SOAP, JSON, and Protocol Buffer based web services for a presentation. The presentation will cover the pros/cons of each technology when developing web services. I will compare things like ease of development, supported platforms, and throughput.
Interesting - it does look very similar to ASN.1. In which case it does seem like a bit of 'not invented here' wheel re-inventing; there might have been some common simplifications in using ASN.1 for both purposes. (For instance ASN1 is used in both LDAP and digital certificates, so using it for generic serialisation might work nicely with some crypto / directory apps).
ReplyDeleteThe apparent complexity of ASN.1 is largely due to its flexibility - if you're using only the sort of functionality that pbuffer gives you, it would be pretty much the same, I would think. Conversely, it seems likely that pbuffer will need to grow to handle more complex structured data (or it risks simply being a blob transportation protocol).
Anyway, agree the concept is good; but it seems a bit strange not to learn from the past. Maybe that's just my inner grumpy old developer talking :-). Some of the tools around pbuffer are certainly things that could be usefully added while keeping ASN1 as the wire protocol; the .proto files are a lot easier than using an ASN1 compiler but can do the same job.
@Dan: Yes, everything goes trough the ORB. But, technically, the Orb is a library, that helps generated code to serialize data to IIOP and put it onto a socket, so there is not a lot of overhead. I agree, the Corba IDL has some quirks, i.e. you have to model an optional type T as sequence of T (that can have a length of 0).
ReplyDeleteI'd volunteer to provide the Corba and J2EE/EJB implementations of your benchmark.
Thank Google for sharing that with us.
ReplyDeleteAs far as I understand PB cannot be used to parse stream in SAX-parser style. If it is so it is use is pretty much limited to small amount of data. Also there has to be some kind of framing around messages so application knows when data is ready to be parsed. Am I missing something in its documentation on that?
Cheers
Alex
I don't see a scalar value type for date/time. Does Google have a convention for handling dates within Protocol Buffers (e.g., use an int64 to store the number of milliseconds since January 1, 1970, 00:00:00 GMT)?
ReplyDeletethank you very much !
ReplyDeleteThe primiary reason for XML is so that data can be self-describing and free of schema, it looks to me that Google (a web giant) has just taken a step backwards away from the web...
ReplyDeleteThis looks so much like the SLICE data interchange definition language from Zeroc, which evolved out of Corba.
ReplyDeleteI guess I must ask, given the extreme ease of use of ICE, why reinvent this particular wheel?
> I guess I must ask, given the extreme ease of use of ICE, why reinvent this particular wheel
ReplyDeleteICE is much more than PB, provides RPC too. But, ICE is GPL but PB is Apache license which matters to some.
It is another interesting thing to take into account: will we see a collection of some exist protocols descriptions in protocol buffer format?
ReplyDeleteSo, if I need to implement ICCP/TASE.2 in my application - I just download .proto file, generate code and merge it with my app.
How the GPB messages are transported? By TCP/IP? Can I transport without TCP/IP or it is a requirement?
ReplyDelete