|

|
Kenneth P. Birman
N. Rama Rao Professor of Computer Science
435 Gates Hall, Cornell University
Ithaca, New York 14853
W: 607-255-9199; M: 607-227-0894; F: 607-255-9143
Email: [email protected]
CV: Jan 2017
|
I�m on sabbatical, hence not
teaching in F16 or S17, but I�ve been writing a kind of technology
blog (really more of a series of essays).�
I welcome comments.��
Current Research (full
publications list). At the end of this subsection is a list of links
to recent recorded videos.
- Derecho.
This
project looks at ways of leveraging remote DMA (RDMA) technologies to
move large data objects at wire speeds.
The download site is (derecho.codeplex.com).
Derecho
is the world's fastest Paxos library: it outperforms Corfu (the prior record-setting
system) and also scales to very large numbers of replicas.
Derecho
also has an in-memory multicast mode that runs several orders of magnitude
faster than any previous atomic multicast, and is more than 15,000x faster than our
own Vsync library (embarassing
to admit, but the fact is that RDMA is just surreal and will change our
world, so if someone is going to eat my lunch, it might as well be
me!). A very cool feature is that Derecho automatically manages
subgroups of a group (you associate them with classes in your code, so
there might be a load-balancer subgroup, a cache layer, a back-end).
Plus those can be sharded:
the cache could be 1000 nodes in subgroups of 2 each, for example.
We currently do this as a one-level flat hierarchy but the protocols can
actually support a kind of recursive application of the same ideas.
Moreover, Derecho is highly efficient: if an application as a whole has
1025 members in the top-level group but most of the traffic involves
multicasts in little shards of 2 or 3 members each, Derecho offers strong
consistency for those multicasts (atomic, or Paxos)
and yet only the participants in a particular exchange of bytes see those
bytes or incur any load at all. This is unlike prior multicast
systems where subgroups involved sending to everyone and then discarding
unwanted data.
Derecho also scales in a fantastically
impressive way. We've tested with up to 256 replicas. For
smaller levels of replication, it comes nearly for free: you can make one
replica at 12.5GB/s for example, or you can make 4 for essentially the
identical delay. You'll see some slight increase in delay and loss
of throughput by 64 replicas, and we are down by a factor of 3 or so in
the 256 case, but we think this could actually go away once we begin to
use a new feature of the verbs package called cross-channel
synchronization.
Core papers: both under submission, so
cite these as technical reports (preprints), dated Sept 21, 2016.
1) The Derecho
subgroup and sharding model.
2) The
Derecho Paxos and atomic multicast protocol.
Derecho is built from two components, one
to replicate data at RDMA speeds (RDMC) and one to build strong Paxos-style protocols over the replication substrate
(SST). Over these components, we offer a very clean C++ 17 group
communication API.
RDMC
is our underlying data replication (accessible stand-alone but normally
used throughg the Derecho API, which offers RDMC
as its "raw" mode). It can copy (replicate) huge blocks of
memory (even gigabytes) or files at very close to the full data rate of an
optical network. For example, on a 100Gbps Mellanox
RDMA technology (more than 3x faster than memory-to-memory copying with memcpy on a fast server!) we can make 2 replicas of
data in source memory at 112Gbps, and in scalability experiments, we found
that we could make 512 replicas for about 3x or 4x the time needed to make
just one, which is kind of crazy if you think about it. With small
events we are sending about 300k per second in groups of these
sizes. It is almost embarassing to point
out that these speeds are about 10,000x faster than Vsync
or any version of Paxos you've ever heard
of.
The second component,
which we call SST, is used to transform RDMC into
a strong consistency model, much like the one used in Vsync:
The novelty here centers on a concept we are calling monotonic properties
and logic, which (as the name suggests) is all about things that stay true
once a program learns them (stable in a formal
sense), and ways to build up stronger protocols over these basic
primitives.
A talk on Derecho is here.
Key people: The PhD students playing lead
roles on this work are Sagar Jha,
Matt Milano and Edward Tremel. Robbert van
Renesse and I work closely with them. We
are treating Derecho as an open-source project and hoping to create a real
community around it that would include industry as well as other academic
groups. Why not join us in creating an RDMA revolution? The
download site is Derecho.codeplex.com. But I should warn that some
parts of the code base are still evolving as of fall 2016. We expect
it to be stable by the beginning of the winter.
- Freeze
Frame File System. We also have a cool
new file system for real-time applications. It soaks up updates from
time-synchronized sensors or other sources and captures the data into an
in-memory data structure. Then you can read this data at any instant in time that you wish, with very good temporal precision and also with
logical consistency guarantees. We attached our new file system to
Spark and it actually outperforms the Berkeley Spark+HDFS
setup, and we think we can also support the widely popular Kafka API,
which would make the solution immediately useful to two large user
communities. So our vision is that perhaps you capture real-time
data from some kind of IoT setting, or from
smart cars, or the smart grid. Then you flag "events of
interest" for instant analysis, and the analytic program (running on
Hadoop or on MPI) might want to visit snapshots of the past -- maybe lots
of them, at fine temporal resolution. We can materialize that data
instantly for you, on demand, replicate it onto a cluster of nodes running
Spark, and give all sorts of guarantees... a very cool new option!
The two lead authors on this work are Weijia Song and Theo Gkountouvas.
Weijia will present out first paper on FFFS at
ACM Symposium on Cloud Computing (SOCC) 2016.
- Cloud
computing, Vsync. My longer term research (talking now about the past 25 years!) is
mostly concerned with what people are calling "Cloud Computing."
Of course, Cloud Computing is just the buzzword of the day: Back before
these terms gained wide use, we would have said that I work on reliable
distributed computing, applications involving reliable information
collection or dissemination, and problems associated with security in
complex distributed systems. Over the years one of my main activities has
been to create software libraries for people who want to build this kind
of system, I've always been a bit of a hacker, and writing code keeps me
sane.
Until we created Derecho, my most recent such
library was the one now called Vsync, and I've
recorded a series of detailed videos explaining precisely how to use it.
You'll find links to the videos (and even suggestions for projects you
could do) on vsync.codeplex.com
and hopefully, with this help, the library itself will be easy for you to
install and start using. I maintain it myself and welcome email with
bug reports, requests for help, suggestions, etc. Vsync includes simple ways to replicate data and build
fault-tolerant services, implementations of state machine replication and
virtual synchrony, a version of the famous Paxos
protocol, and more elaborate mechanisms like distributed locking and
key-value storage, real-time data aggregation and other features.
People work with it mostly from C#, C++/CLI (a version of C++ with some
extra .NET features to deal with managed memory) and IronPython
(one of the very popular versions of Python).
But
even though Derecho has an API a lot like Vsync,
be aware that Vsync is painfully slow compared
to Derecho. In some sense, Vsync is a toy
system once you compare it with this sort of cutting edge option.
- Connection to Paxos.
These days there is intense interested in Paxos,
which makes it interesting to note that there is a core protocol called gbcast
(short for "group atomic broadcast") in Vsync
that dates back to 1983 and was probably the first of the "Paxos-equivalent" protocols to have become widely
used -- Leslie Lamport's Paxos
paper was first released in TR form in 1990. Gbcast
doesn't actually look like Paxos but in fact it
does bi-simulate classic Paxos and can even be
transformed into Paxos, or Paxos
into it through a series of relatively simple steps. (In this
use, bisimulate means that every run Gbcast could produce is a run Paxos
could produce and vice versa). This said, I don't think of Gbcast as the first Paxos
protocol, mostly because (1) Gbcast looks
nothing like Paxos, (2) Tommy Joseph and I
proved Gbcast correct in a 1983 paper, but our
proof wasn't the kind of proof by refinement for which Leslie Lamport's work became so famous, and (3) The detailed
handling of durability, namely recovery from a complete outage, is
different from that used in Paxos: Paxos defines durability in terms of a complete log
that captures and saves protocol state, but in virtual synchrony we focus
more on application durability.
These days my view is that you really need
both. Virtual synchrony is the right model for partition-free
membership management in a group communication system (no "split
brain"), but the Paxos ordering and
durability properties are important too ("no rollback, state machine
replication"). You simply want both. In virtual synchrony
we struggled with speed and ended up with an optimistic early delivery
mode and a flush that you have to call before taking actions with external
consequences. This is clumsy, and in Derecho we do away with
it. But the Paxos log makes little sense
unless you actually want to use the log as the entire store of application
state. So you want an atomic multicast, and sometimes you want true
durable Paxos. To me this insight was a
long time arriving. I think Leslie Lamport
actually agrees with me on all of these points, by the way: they aren't
massive topics for heated debate.
Today there are many protocols that are equivalent
to Paxos in this same bisimulation
sense. Leslie Lamport likes to call our
version "virtually synchronous Paxos,"
and while at first I found that a bit odd, now I like it more and
more. Of course there is a mind-boggling causality issue raised, but
the universe doesn't seem to have ended or anything like that.
Calling the gbcast protocol an early Paxos implementation might help people appreciate that
virtual synchrony isn't somehow in opposition to Paxos.
�
Using all of this technology for IoT
applications, notably the Smart Electric Power Grid. Under a new grant from the Department of
Energy GENI program and with further support from NSF and DARPA, we�re applying
our insights concerning high assurance cloud computing and Vsync
to the challenge of controlling the smart power grid. We're currently
working to transition the GridCloud system (we built
it using Vsync but also incorporating best-of-breed
electric power technologies from our colleagues at Washington State University
in Pullman) for experimental use by the New England and New York ISOs and NYPA,
a major TO. Beyond the smart grid, we believe that this style of embedded
cloud-hosted real-time system could have many other applications. Some of
the more exciting components of GridCloud include CloudMake, a new management tool, and the forensic file
system mentioned above, for capturing real-time data streams. Theo Gkountovas leads on CloudMake,
and he and Weijia Song have collaborated on the file
system. Z Teo's work on IronStack
is also very promising: He has a way to use SDN network routers in very
demanding settings like weather-challenged environments. And Edward
Tremel is looking at options for privacy-preserving data mining in large deploy
Video links:
- A talk I
gave at MesosCon on our new RDMA research
(Derecho and the Freeze Frame File System) here.
- SOSP '15 History Day talk
on fault-tolerance and consistency, the CATOCS controversy, and the
modern-day CAP conjecture. My video is here and an
accompanying essay is here.
- Robbert van Renesse and me discussing how we got into this area of
research: here.
- Instruction
videos about learning to use the Vsync library
(previously known as Isis2): On the documentation page here or on YouTube (tagged "Vsync").
I'm planning to revise these into a series of short 10-minute
segments. We also plan to do a set of short segments on using
Derecho.
Older
work. I've really worked in Cloud Computing for most of my career, although it
obviously wasn't called cloud computing in the early days. As a result, our
papers in this area date back to 1985. Some examples of mission-critical
systems on which my software was used in the past include the New York Stock
Exchange and Swiss Exchange, the French Air Traffic Control system, the AEGIS
warship and a wide range of applications in settings like factory process
control and telephony. In fact, every stock quote or trade on the NYSE from
1995 until early 2006 was reported to the overhead trading consoles through
software I personally implemented - a cool (but also scary) image, for me at
least! During the ten years this system was running, many computers crashed
during the trading day, and many network problems have occurred - but the
design we developed and implemented has managed to reconfigure itself
automatically and kept the overall system up, without exception. They didn't
have a single trading disruption during the entire period. As far as I know,
the other organizations listed above have similar stories to report.
Today, these kinds of ideas are gaining
"mainstream" status. For example, IBM's Websphere
6.0 product includes a multicast layer used to replicate data and other runtime
state for high-availability web service applications and web sites. Although
IBM developed its own implementation of this technology, we've been told by the
developers that the architecture was based on Cornell's Horus and Ensemble
systems, described more fully below. The CORBA architecture includes a
fault-tolerance mechanism based on some of the same ideas. And we've also
worked with Microsoft on the technology at the core of the next generation of
that company's clustering product. So, you'll find Cornell's research not just
on these web pages, but also on web sites worldwide and in some of the world's
most ambitious data centers and high availability computing systems.
In fact we still
have very active dialogs with many of these companies: Cisco, IBM, Intel,
Microsoft, Amazon, and others. An example of an ongoing dialog is this: we
recently worked with Cisco to invent a new continuous availability option for
their core Internet routers, the CRS-1 series. You can read about this work here.
My group often works with vendors and
industry researchers. We maintain a very active dialog with the US government
and military on research challenges emerging from a future
generation communication systems now being planned by organizations like
the Air Force and the Navy. We've even worked on new ways of controlling the
electric power grid, but not in time to head off the big blackout in 2003!
Looking to the future, we are focused on needs arising in financial systems,
large-scale military systems, and even health-care networks. (In this
connection, I should perhaps mention that although we do get research support
from the government and the US military, none of our research is classified or
even sensitive, and all of it focuses on widely used commercial standards and
platforms. Most of our software is released for free, under open source
licenses.)
I'm just one of several members of a group
in this area at Cornell. My closest colleagues and co-leaders of the group are Robbert van Renesse and Hakim
Weatherspoon. We also collaborate with Gun Sirer,
Paul Francis, Al Demers and Johannes Gehrke, as well
as with other systems faculty members at Cornell: Andrew, Fred, Rafael, Joe,
etc. The systems group is close-knit, and many of our students are jointly
advised by other faculty members in the systems area. Werner Vogels worked with us until September 2004, when he joined
Amazon.com as Vice President and Director for Systems Research.
Four generations of reliable distributed systems research! Overall, our group has developed three generations of technology and is
now working on a fourth generation system: The Isis Toolkit, developed mostly
during 1987-1993, the Horus system, developed starting in 1990 until around
1995, the Ensemble system, 1995-1999. Right now we're developing a number of
new systems including Isis2, Gradient, and the reliable TCP solution
mentioned above, and working with others to integrate those solutions into
settings where reliability, security, consistency and scalability are
make-or-break requirements.
Older Research web pages:
Live Objects, Quicksilver, Maelstrom, Ricochet and Tempest projects
Ensemble project
Horus project
Isis Toolkit (really
old stuff! This is from the very first version of Isis).
A collection of papers on Isis, edited by
myself with Robbert van Renesse,
may still be available -- it was called Reliable Distributed Computing with
the Isis Toolkit and was in the IEEE Press Computer Science series.
Graduate
Studies in Computer Science at Cornell: At this time of the year, we get large numbers of inquiries about our
PhD program. I want to recommend that people interested in the program not contact
faculty members like me directly with routine questions like "can your
research group fund me". As you'll see from the web page, Cornell does
admissions by means of a committee, so individual faculty members don't
normally play a role. This is different from many other schools -- I realize that
at many places, each faculty member admits people into her/his own group. But
at Cornell, we admit you first, then you come here, and then you affiliate with
a research group after a while. Funding is absolutely guaranteed for people in
the MS/PhD program during the whole time they are at Cornell. On the other
hand, students in the MEng program generally need to pay their own way.
Obviously, some people have more direct,
specific questions, and there is no problem sending those to me or to anyone
else. But as for the generic "can I join your research group?" the
answer is that while I welcome people into the group if they demonstrate good
ideas and talent in my area, until you are here and take my graduate course and
spend time talking with me and my colleagues, how can we know if the match is
good? And most such inquiries are from people who haven't yet figured out quite
how many good projects are underway at Cornell. Perhaps, on arrival, you'll
take Andrew Myer's course in language based security and will realize this is
your passion. So at Cornell, we urge you to take time to find out what areas we
cover and who is here, to take some courses, and only then affiliate with a
research group. But please knock on my door any time you like! I'm more than happy
to talk to any student in the department about anything we're doing here!