Ken Birman's home page

Ken Birman

Kenneth P. Birman
N. Rama Rao Professor of Computer Science
435 Gates Hall, Cornell University
Ithaca, New York 14853

W: 607-255-9199; M: 607-227-0894; F: 607-255-9143
Email: [email protected] CV: Jan 2017
Follow @KenBirman

I�m on sabbatical, hence not teaching in F16 or S17, but I�ve been writing a kind of technology blog (really more of a series of essays).� I welcome comments.��

Current Research (full publications list). At the end of this subsection is a list of links to recent recorded videos.

Derecho. This project looks at ways of leveraging remote DMA (RDMA) technologies to move large data objects at wire speeds. The download site is (derecho.codeplex.com).

Derecho is the world's fastest Paxos library: it outperforms Corfu (the prior record-setting system) and also scales to very large numbers of replicas.

Derecho also has an in-memory multicast mode that runs several orders of magnitude faster than any previous atomic multicast, and is more than 15,000x faster than our own Vsync library (embarassing to admit, but the fact is that RDMA is just surreal and will change our world, so if someone is going to eat my lunch, it might as well be me!). A very cool feature is that Derecho automatically manages subgroups of a group (you associate them with classes in your code, so there might be a load-balancer subgroup, a cache layer, a back-end). Plus those can be sharded: the cache could be 1000 nodes in subgroups of 2 each, for example. We currently do this as a one-level flat hierarchy but the protocols can actually support a kind of recursive application of the same ideas. Moreover, Derecho is highly efficient: if an application as a whole has 1025 members in the top-level group but most of the traffic involves multicasts in little shards of 2 or 3 members each, Derecho offers strong consistency for those multicasts (atomic, or Paxos) and yet only the participants in a particular exchange of bytes see those bytes or incur any load at all. This is unlike prior multicast systems where subgroups involved sending to everyone and then discarding unwanted data.

Derecho also scales in a fantastically impressive way. We've tested with up to 256 replicas. For smaller levels of replication, it comes nearly for free: you can make one replica at 12.5GB/s for example, or you can make 4 for essentially the identical delay. You'll see some slight increase in delay and loss of throughput by 64 replicas, and we are down by a factor of 3 or so in the 256 case, but we think this could actually go away once we begin to use a new feature of the verbs package called cross-channel synchronization.

Core papers: both under submission, so cite these as technical reports (preprints), dated Sept 21, 2016.
1) The Derecho subgroup and sharding model.
2) The Derecho Paxos and atomic multicast protocol.

Derecho is built from two components, one to replicate data at RDMA speeds (RDMC) and one to build strong Paxos-style protocols over the replication substrate (SST). Over these components, we offer a very clean C++ 17 group communication API.

RDMC is our underlying data replication (accessible stand-alone but normally used throughg the Derecho API, which offers RDMC as its "raw" mode). It can copy (replicate) huge blocks of memory (even gigabytes) or files at very close to the full data rate of an optical network. For example, on a 100Gbps Mellanox RDMA technology (more than 3x faster than memory-to-memory copying with memcpy on a fast server!) we can make 2 replicas of data in source memory at 112Gbps, and in scalability experiments, we found that we could make 512 replicas for about 3x or 4x the time needed to make just one, which is kind of crazy if you think about it. With small events we are sending about 300k per second in groups of these sizes. It is almost embarassing to point out that these speeds are about 10,000x faster than Vsync or any version of Paxos you've ever heard of.

The second component, which we call SST, is used to transform RDMC into a strong consistency model, much like the one used in Vsync: The novelty here centers on a concept we are calling monotonic properties and logic, which (as the name suggests) is all about things that stay true once a program learns them (stable in a formal sense), and ways to build up stronger protocols over these basic primitives.

A talk on Derecho is here.

Key people: The PhD students playing lead roles on this work are Sagar Jha, Matt Milano and Edward Tremel. Robbert van Renesse and I work closely with them. We are treating Derecho as an open-source project and hoping to create a real community around it that would include industry as well as other academic groups. Why not join us in creating an RDMA revolution? The download site is Derecho.codeplex.com. But I should warn that some parts of the code base are still evolving as of fall 2016. We expect it to be stable by the beginning of the winter.
Freeze Frame File System. We also have a cool new file system for real-time applications. It soaks up updates from time-synchronized sensors or other sources and captures the data into an in-memory data structure. Then you can read this data at any instant in time that you wish, with very good temporal precision and also with logical consistency guarantees. We attached our new file system to Spark and it actually outperforms the Berkeley Spark+HDFS setup, and we think we can also support the widely popular Kafka API, which would make the solution immediately useful to two large user communities. So our vision is that perhaps you capture real-time data from some kind of IoT setting, or from smart cars, or the smart grid. Then you flag "events of interest" for instant analysis, and the analytic program (running on Hadoop or on MPI) might want to visit snapshots of the past -- maybe lots of them, at fine temporal resolution. We can materialize that data instantly for you, on demand, replicate it onto a cluster of nodes running Spark, and give all sorts of guarantees... a very cool new option!

The two lead authors on this work are Weijia Song and Theo Gkountouvas. Weijia will present out first paper on FFFS at ACM Symposium on Cloud Computing (SOCC) 2016.
Cloud computing, Vsync. My longer term research (talking now about the past 25 years!) is mostly concerned with what people are calling "Cloud Computing." Of course, Cloud Computing is just the buzzword of the day: Back before these terms gained wide use, we would have said that I work on reliable distributed computing, applications involving reliable information collection or dissemination, and problems associated with security in complex distributed systems. Over the years one of my main activities has been to create software libraries for people who want to build this kind of system, I've always been a bit of a hacker, and writing code keeps me sane.

Until we created Derecho, my most recent such library was the one now called Vsync, and I've recorded a series of detailed videos explaining precisely how to use it. You'll find links to the videos (and even suggestions for projects you could do) on vsync.codeplex.com and hopefully, with this help, the library itself will be easy for you to install and start using. I maintain it myself and welcome email with bug reports, requests for help, suggestions, etc. Vsync includes simple ways to replicate data and build fault-tolerant services, implementations of state machine replication and virtual synchrony, a version of the famous Paxos protocol, and more elaborate mechanisms like distributed locking and key-value storage, real-time data aggregation and other features. People work with it mostly from C#, C++/CLI (a version of C++ with some extra .NET features to deal with managed memory) and IronPython (one of the very popular versions of Python).

But even though Derecho has an API a lot like Vsync, be aware that Vsync is painfully slow compared to Derecho. In some sense, Vsync is a toy system once you compare it with this sort of cutting edge option.
Connection to Paxos. These days there is intense interested in Paxos, which makes it interesting to note that there is a core protocol called gbcast (short for "group atomic broadcast") in Vsync that dates back to 1983 and was probably the first of the "Paxos-equivalent" protocols to have become widely used -- Leslie Lamport's Paxos paper was first released in TR form in 1990. Gbcast doesn't actually look like Paxos but in fact it does bi-simulate classic Paxos and can even be transformed into Paxos, or Paxos into it through a series of relatively simple steps. (In this use, bisimulate means that every run Gbcast could produce is a run Paxos could produce and vice versa). This said, I don't think of Gbcast as the first Paxos protocol, mostly because (1) Gbcast looks nothing like Paxos, (2) Tommy Joseph and I proved Gbcast correct in a 1983 paper, but our proof wasn't the kind of proof by refinement for which Leslie Lamport's work became so famous, and (3) The detailed handling of durability, namely recovery from a complete outage, is different from that used in Paxos: Paxos defines durability in terms of a complete log that captures and saves protocol state, but in virtual synchrony we focus more on application durability.

These days my view is that you really need both. Virtual synchrony is the right model for partition-free membership management in a group communication system (no "split brain"), but the Paxos ordering and durability properties are important too ("no rollback, state machine replication"). You simply want both. In virtual synchrony we struggled with speed and ended up with an optimistic early delivery mode and a flush that you have to call before taking actions with external consequences. This is clumsy, and in Derecho we do away with it. But the Paxos log makes little sense unless you actually want to use the log as the entire store of application state. So you want an atomic multicast, and sometimes you want true durable Paxos. To me this insight was a long time arriving. I think Leslie Lamport actually agrees with me on all of these points, by the way: they aren't massive topics for heated debate.

Today there are many protocols that are equivalent to Paxos in this same bisimulation sense. Leslie Lamport likes to call our version "virtually synchronous Paxos," and while at first I found that a bit odd, now I like it more and more. Of course there is a mind-boggling causality issue raised, but the universe doesn't seem to have ended or anything like that. Calling the gbcast protocol an early Paxos implementation might help people appreciate that virtual synchrony isn't somehow in opposition to Paxos.

� Using all of this technology for IoT applications, notably the Smart Electric Power Grid. Under a new grant from the Department of Energy GENI program and with further support from NSF and DARPA, we�re applying our insights concerning high assurance cloud computing and Vsync to the challenge of controlling the smart power grid. We're currently working to transition the GridCloud system (we built it using Vsync but also incorporating best-of-breed electric power technologies from our colleagues at Washington State University in Pullman) for experimental use by the New England and New York ISOs and NYPA, a major TO. Beyond the smart grid, we believe that this style of embedded cloud-hosted real-time system could have many other applications. Some of the more exciting components of GridCloud include CloudMake, a new management tool, and the forensic file system mentioned above, for capturing real-time data streams. Theo Gkountovas leads on CloudMake, and he and Weijia Song have collaborated on the file system. Z Teo's work on IronStack is also very promising: He has a way to use SDN network routers in very demanding settings like weather-challenged environments. And Edward Tremel is looking at options for privacy-preserving data mining in large deploy

Video links:

A talk I gave at MesosCon on our new RDMA research (Derecho and the Freeze Frame File System) here.
SOSP '15 History Day talk on fault-tolerance and consistency, the CATOCS controversy, and the modern-day CAP conjecture. My video is here and an accompanying essay is here.
Robbert van Renesse and me discussing how we got into this area of research: here.
Instruction videos about learning to use the Vsync library (previously known as Isis2): On the documentation page here or on YouTube (tagged "Vsync"). I'm planning to revise these into a series of short 10-minute segments. We also plan to do a set of short segments on using Derecho.

My Textbook (updated with a lot of cloud-computing content): Guide to Reliable Distributed Systems: Building High-Assurance Applications and Cloud-Hosted Services.

Click here to get to my spring 2014 course, which has slide sets and other materials associated with the book. You are welcome to use these in your own courses if you like. I'll revise this link to point to the spring 2015 version soon. No need to ask for permission: anything I post is open source, completely fair game for copying or working with.

$Description: Description: Description: \\web2.cs.cornell.edu\cs\Courses\cs5412\2012sp\41o8u8MC8iL._SS500_.jpg$

Older work. I've really worked in Cloud Computing for most of my career, although it obviously wasn't called cloud computing in the early days. As a result, our papers in this area date back to 1985. Some examples of mission-critical systems on which my software was used in the past include the New York Stock Exchange and Swiss Exchange, the French Air Traffic Control system, the AEGIS warship and a wide range of applications in settings like factory process control and telephony. In fact, every stock quote or trade on the NYSE from 1995 until early 2006 was reported to the overhead trading consoles through software I personally implemented - a cool (but also scary) image, for me at least! During the ten years this system was running, many computers crashed during the trading day, and many network problems have occurred - but the design we developed and implemented has managed to reconfigure itself automatically and kept the overall system up, without exception. They didn't have a single trading disruption during the entire period. As far as I know, the other organizations listed above have similar stories to report.

Today, these kinds of ideas are gaining "mainstream" status. For example, IBM's Websphere 6.0 product includes a multicast layer used to replicate data and other runtime state for high-availability web service applications and web sites. Although IBM developed its own implementation of this technology, we've been told by the developers that the architecture was based on Cornell's Horus and Ensemble systems, described more fully below. The CORBA architecture includes a fault-tolerance mechanism based on some of the same ideas. And we've also worked with Microsoft on the technology at the core of the next generation of that company's clustering product. So, you'll find Cornell's research not just on these web pages, but also on web sites worldwide and in some of the world's most ambitious data centers and high availability computing systems.

In fact we still have very active dialogs with many of these companies: Cisco, IBM, Intel, Microsoft, Amazon, and others. An example of an ongoing dialog is this: we recently worked with Cisco to invent a new continuous availability option for their core Internet routers, the CRS-1 series. You can read about this work here.

My group often works with vendors and industry researchers. We maintain a very active dialog with the US government and military on research challenges emerging from a future generation communication systems now being planned by organizations like the Air Force and the Navy. We've even worked on new ways of controlling the electric power grid, but not in time to head off the big blackout in 2003! Looking to the future, we are focused on needs arising in financial systems, large-scale military systems, and even health-care networks. (In this connection, I should perhaps mention that although we do get research support from the government and the US military, none of our research is classified or even sensitive, and all of it focuses on widely used commercial standards and platforms. Most of our software is released for free, under open source licenses.)

I'm just one of several members of a group in this area at Cornell. My closest colleagues and co-leaders of the group are Robbert van Renesse and Hakim Weatherspoon. We also collaborate with Gun Sirer, Paul Francis, Al Demers and Johannes Gehrke, as well as with other systems faculty members at Cornell: Andrew, Fred, Rafael, Joe, etc. The systems group is close-knit, and many of our students are jointly advised by other faculty members in the systems area. Werner Vogels worked with us until September 2004, when he joined Amazon.com as Vice President and Director for Systems Research.

Four generations of reliable distributed systems research! Overall, our group has developed three generations of technology and is now working on a fourth generation system: The Isis Toolkit, developed mostly during 1987-1993, the Horus system, developed starting in 1990 until around 1995, the Ensemble system, 1995-1999. Right now we're developing a number of new systems including Isis², Gradient, and the reliable TCP solution mentioned above, and working with others to integrate those solutions into settings where reliability, security, consistency and scalability are make-or-break requirements.

Older Research web pages:

Live Objects, Quicksilver, Maelstrom, Ricochet and Tempest projects
Ensemble project
Horus project
Isis Toolkit (really old stuff! This is from the very first version of Isis).

A collection of papers on Isis, edited by myself with Robbert van Renesse, may still be available -- it was called Reliable Distributed Computing with the Isis Toolkit and was in the IEEE Press Computer Science series.

Graduate Studies in Computer Science at Cornell: At this time of the year, we get large numbers of inquiries about our PhD program. I want to recommend that people interested in the program not contact faculty members like me directly with routine questions like "can your research group fund me". As you'll see from the web page, Cornell does admissions by means of a committee, so individual faculty members don't normally play a role. This is different from many other schools -- I realize that at many places, each faculty member admits people into her/his own group. But at Cornell, we admit you first, then you come here, and then you affiliate with a research group after a while. Funding is absolutely guaranteed for people in the MS/PhD program during the whole time they are at Cornell. On the other hand, students in the MEng program generally need to pay their own way.

Obviously, some people have more direct, specific questions, and there is no problem sending those to me or to anyone else. But as for the generic "can I join your research group?" the answer is that while I welcome people into the group if they demonstrate good ideas and talent in my area, until you are here and take my graduate course and spend time talking with me and my colleagues, how can we know if the match is good? And most such inquiries are from people who haven't yet figured out quite how many good projects are underway at Cornell. Perhaps, on arrival, you'll take Andrew Myer's course in language based security and will realize this is your passion. So at Cornell, we urge you to take time to find out what areas we cover and who is here, to take some courses, and only then affiliate with a research group. But please knock on my door any time you like! I'm more than happy to talk to any student in the department about anything we're doing here!