Google's Chromium sandbox

Did you know...?

LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net.

By Jake Edge
August 19, 2009

Creating a sandbox—a safe area in which to run untrusted code—is a difficult problem. The successful sandbox implementations tend to come with completely new languages (e.g. Java) that are specifically designed to support that functionality. Trying to sandbox C code is a much more difficult task, but one that the Google Chrome web browser team has been working on.

The basic idea is to restrict the WebKit-based renderer—along with the various image and other format libraries that are linked to it—so that browser-based vulnerabilities are unable to affect the system as a whole. A successful sandbox for the browser would eliminate a whole class of problems that plague Firefox and other browsers that require frequent, critical security updates. Essentially, the browser would protect users from bugs in the rendering of maliciously-crafted web pages, so that they could not lead to system or user data compromise.

The Chrome browser, and its free software counterpart, Chromium, is designed around the idea of separate processes for each tab, both for robustness and security. A misbehaving web page can only affect the process controlling that particular tab, so it won't bring the entire browser down if it causes the process to crash. In addition, these processes are considered to be "untrusted", in that they could have been compromised by some web page exploiting a bug in the renderer. The sandbox scheme works by severely restricting the actions that untrusted processes can take directly.

At some level, Linux already has a boundary that isolates programs from the underlying system: system calls. A program that does no system calls should not be able to affect anything else, at least permanently. But it is a trivial program indeed that does not need to call on some system services. A largely unknown kernel feature, seccomp, allows processes to call a very small subset of system calls—just read(), write(), sigreturn(), and exit()—aborting a process that attempts to call any other. That is the starting point for the Chromium sandbox.

But, there are other system calls that the browser might need to make. For one thing, memory allocation might require the brk() system call. Also, the renderer needs to be able to share memory with the X server for drawing. And so on. Any additional system calls, beyond the four that seccomp allows, have to be handled differently.

A proposed change to seccomp that would allow finer-grained control over which system calls were allowed didn't get very far. In any case, that wasn't a near-term solution, so Markus Gutschke of the Chrome team went in another direction. By splitting the renderer process into trusted and untrusted threads, some system calls could be allowed for the untrusted thread by making the equivalent of a remote procedure call (RPC) to the trusted thread. The trusted thread could then verify that the system call, and its arguments, were reasonable and, if so, perform the requested action.

Chrome team member Adam Langley describes it this way:

So that's what we do: each untrusted thread has a trusted helper thread running in the same process. This certainly presents a fairly hostile environment for the trusted code to run in. For one, it can only trust its CPU registers - all memory must be assumed to be hostile. Since C code will spill to the stack when needed and may pass arguments on the stack, all the code for the trusted thread has to [be] carefully written in assembly.

The trusted thread can receive requests to make system calls from the untrusted thread over a socket pair, validate the system call number and perform them on its behalf. We can stop the untrusted thread from breaking out by only using CPU registers and by refusing to let the untrusted code manipulate the VM in unsafe ways with mmap, mprotect etc.

There are still problems with that approach, however. For one thing, the renderer code is large, with many different system calls scattered throughout. Turning each of those into an RPC is possible, but then would have to be maintained by the Chromium developers going forward. The upstream projects (WebKit, et. al.) would not be terribly interested in those changes, so each new revision from upstream would need to be patched and then checked for new system calls.

Another approach might be to use LD_PRELOAD trickery to intercept the calls in glibc. That has its own set of problems as Langley points out: "we could try and intercept at dynamic linking time, assuming that all the system calls are via glibc. Even if that were true, glibc's functions make system calls directly, so we would have to patch at the level of functions like printf rather than write."

So, a method of finding and patching the system calls at runtime was devised. It uses a disassembler on the executable code, finds each system call and turns it into an RPC to the trusted thread. Correctly parsing x86 machine code is notoriously difficult, but it doesn't have to be perfect. Because the untrusted thread runs in seccomp mode, any system call that is missed will not lead to a security breach, as the kernel will abort the thread if it attempts any but the trusted four system calls. As Langley puts it:

But we don't need a perfect disassembler so long as it works in practice for the code that we have. It turns out that a simple disassembler does the job perfectly well, with only a very few corner cases.

The last piece of the puzzle is handling time-of-check-to-time-of-use race conditions. System call arguments that are passed in memory, via pointers or for system calls with too many arguments to fit in registers, can be changed by the, presumably subverted, untrusted thread between the time they are checked for validity and when they are used. To handle that, a trusted process, which is shared between all of the renderers, is created to check system calls that cannot be verified within the address space of the untrusted renderer.

The trusted process shares a few pages of memory with each trusted thread, which are read-only to the trusted thread, and read-write for the trusted process. System calls that cannot be handled by the trusted thread, either because some of the arguments live in memory, or because the verification process is too complex to be reasonably done in assembly code, are handed off to the trusted process. The arguments are copied by the trusted process into its address space, so they are immune to changes from the untrusted code.

While the current implementation is for x86 and x86-64—though there are still a few issues to be worked out with the V8 Javascript engine on x86-64—there is a clear path for other architectures. Adapting or writing a disassembler and writing the assembly language trusted thread are the two pieces needed to support each additional architecture. According to Langley:

The former is probably easier on many other architectures because they are likely to be more RISC like. The latter takes some work, but it's a coding problem only at this point.

There are some potential pitfalls in this sandbox mechanism. Bugs in the implementation of the trusted pieces—either coding errors or mistakes made in determining which system calls and arguments are "safe"—could certainly lead to problems. Currently, deciding which calls to allow is done on an ad hoc basis, by running the renderer, seeing which calls it makes, and deciding which are reasonable. The outcome of those decisions are then codified in syscall_table.c.

One additional, important area that is not covered by the sandbox are plugins like Flash. Restricting what plugins can do does not fit well with what users expect, which makes plugins a major vector for attack. Langley said that the plugin support on Linux is relatively new, but "our experience on Windows is that, in order for Flash to do all the things that various sites expect it to be able to do, the sandbox has to be so full of holes that it's rather useless". He is currently looking at SELinux as a way to potentially restrict plugins, but, for now, they are wide open.

This is a rather—some would say overly—complex scheme. It is still in the experimental stage, so changes are likely, but it does show one way to protect browser users from bugs in the HTML renderer that might lead to system or data compromise. It certainly doesn't solve all of the web's security problems, but could, over time, largely eliminate a whole class of attacks. It is definitely a project worth keeping an eye on.

[ Many thanks to Adam Langley, whose document was used as a basis for this article, and who patiently answered questions from the author. ]

(Log in to post comments)

Google's Chromium sandbox

Posted Aug 19, 2009 15:37 UTC (Wed) by johill (subscriber, #25196) [Link]

You once said 'process' rather than 'thread', I think that was an error.

Also -- I first wondered why they weren't using processes to start with to get the secure/insecure boundary more defined, but once you think about it more it doesn't seem like you could then do the disasm stuff ... might be worth mentioning that :)

Either way, interesting method, and nice article!

Google's Chromium sandbox

Posted Aug 19, 2009 16:23 UTC (Wed) by jake (editor, #205) [Link]

I should have been more clear about why a thread is needed. Certain operations, memory allocation for example, cannot be done in one process on behalf of another because they don't share address space.

I don't think, but don't know for sure, that it is required to have a thread to do the disassembling. I believe that is done by the untrusted thread before it handles any user input, and before it enters seccomp mode.

jake

Google's Chromium sandbox

Posted Aug 20, 2009 0:43 UTC (Thu) by cventers (subscriber, #31465) [Link]

I should have been more clear about why a thread is needed. Certain operations, memory allocation for example, cannot be done in one process on behalf of another because they don't share address space.

On the contrary, I experimented with a technique to do just that. This may not be the perfect solution for Chrome's needs, but I played around with the idea of open()ing a shared memory segment on the vfs, using ftruncate() to resize it, and then sending the fd via a UNIX-domain socket to the untrusted process and allowing it to mmap() the pages.

Now, in my case, I was using this technique to allow dynamically-grown, runtime-allocated shared memory segments between untrusted processes. There are still complications (such as the need to install a SIGBUS handler since the untrusted process might ftruncate() the mmaped fd() to 0, causing the trusted process to fault when it tries to access its own mmap()), and perhaps the requirements for this kind of an implementation are not easy to satisfy for desktop applications. But it's Linux, and there's more than one way to do it. My implementation had the advantage of being architecture-agnostic, as well-behaved user-space code should be.

Google's Chromium sandbox

Posted Aug 20, 2009 0:58 UTC (Thu) by agl (guest, #4541) [Link]

That seems like a perfectly reasonable way to allocate memory for another
process. However, we would still need non-seccomp processes to receive the
file descriptor from the socket (recvmsg) and to do the mmap. The first
process need only share the descriptor table with the untrusted process, but
the second needs to share an address space for mmap to be effective. We
merge these two processes into one and, since it shares an address space, we
call it the 'trusted thread'.

Google's Chromium sandbox

Posted Aug 20, 2009 8:59 UTC (Thu) by mingo (subscriber, #31122) [Link]

Btw., (and i raised this on lkml too in the past - at that time the code i referred to was not upstream yet), there's a way you could further increase the restrictions (and hence, the security) of the untrusted seccomp thread: by the use of the C expressions filter engine that is included in the upstream kernel. (right now used by ftrace and will also be used by perfcounters)

The engine accepts an ASCII C-ish expression runtime, such as:

 "fd <= 2 && addr == 0x1234000 && len == 4096"

... and turns/parses that into a cached list of safe predicaments that the kernel will execute atomically on syscall arguments. Once parsed (by the kernel), the execution of the filter expression is very fast.

Despite it being used for tracing currently, the filter engine is generic and can be reused not just to limit trace entries of syscalls, but also to restrict execution on syscalls.

This is real, working code very close to what you need. With latest -tip you can use the filter engine on a per syscall basis, and the kernel knows about the parameter names of system calls. So on a testbox i can do this:

  # cd /debug/tracing/events/syscalls/sys_enter_read

  # echo "fd <= 2 && buf == 0x120000 && count == 1024" > filter

  # cat filter 
  fd <= 2 && buf == 0x120000 && count == 1024

... and from that point on the kernel can execute that filter expression to limit trace entries that match the expression.

All you need is a small extension to seccomp to allow the installation of such expressions from user-space, by passing in the ASCII string. The filter engine can be used by unprivileged user-space as well. (but obviously the untrusted sandboxed thread should not be allowed to modify it.)

The filter engine has no deep dependence on tracing (other than being used by it currently) - it is a safe parser and atomic script execution engine that can be utilized by unprivileged tasks too and so it could be reused in seccomp and could be reused by other Linux security frameworks as well, such as selinux or netfilter.

Google's Chromium sandbox

Posted Aug 20, 2009 14:41 UTC (Thu) by paragw (guest, #45306) [Link]

Does this approach work on a per process basis? I.e. do the restrictions
apply to a particular process/thread while others are not impacted?

How would one deal with which process can specify which other process or
thread can do what syscalls with what arguments and is the change permanent
and localized w.r.t the target thread? How does one go about safely modifying
the restrictions dynamically - the restricted thread needs to open a FD with
user permission that wasn't in the originally specified restrictions list?

From what you described there seem to be some significant usability problems
(need to have tracing enabled, debug file system mounted, user-space access
to the filtering mechanism and per PID operation etc.) that need to be
addressed before it can become generally usable?

Google's Chromium sandbox

Posted Aug 20, 2009 19:33 UTC (Thu) by mingo (subscriber, #31122) [Link]

Does this approach work on a per process basis? I.e. do the restrictions apply to a particular process/thread while others are not impacted?

It's an engine - and as such it takes ASCII strings, turns them into a 'filter object' in essence which you can then attach to anything and pass in values to evaluate.

Note that there's nothing 'tracing' about that concept.

Right now we attach such filters to tracepoints - such as syscall tracepoints.

It could be attached via seccomp and to an untrusted process as well, with minimal amount of code, if there's interest to share this facility for such purposes.

Google's Chromium sandbox

Posted Aug 19, 2009 15:58 UTC (Wed) by johill (subscriber, #25196) [Link]

Hmm, the permitted list of syscalls needs comments.

Why, for example, can an untrusted process look into my filesystem using getdents() without any checking?

I think that file should come with comments as to why it is allowed, etc., because otherwise it's JUST a collection of arbitrary things, with that information at least it would be verifiable why/that it is needed.

Google's Chromium sandbox

Posted Aug 19, 2009 16:32 UTC (Wed) by foom (subscriber, #14868) [Link]

Why, for example, can an untrusted process look into my filesystem using getdents() without any checking?

Presumably because getdents takes an already-open fd, and open is sandboxed.

Qemu user space emulation

Posted Aug 19, 2009 16:07 UTC (Wed) by leonb (guest, #3054) [Link]

Naive question:
Why not run the untrusted programs under
qemu user space emulation and catch the syscalls?

- L.

Qemu user space emulation

Posted Aug 19, 2009 16:19 UTC (Wed) by johill (subscriber, #25196) [Link]

Well, that would have the same verification problems when needing to talk to the host, with the extra expense of having to emulate _all_ instructions, rather than just syscalls, so much slower.

VEX

Posted Aug 19, 2009 17:55 UTC (Wed) by abacus (subscriber, #49001) [Link]

I'm surprised that the article doesn't mention the VEX library, the core of the Valgrind tool suite. This is a library that allows to disassemble i386, x86_86 and ppc assembly code to an intermediate representation and also back to assembly. I don't doubt that the Chromium authors know about the existence of VEX and that they had good reasons to write their own disassembly code instead of using the VEX library. But I'm curious to know why.

VEX

Posted Aug 19, 2009 19:04 UTC (Wed) by agl (guest, #4541) [Link]

As the text mentions, the disassembler didn't actually turn out to be all
that much code, so the motivation to use something pre-existing was less.

But also, we wouldn't want to transform all the code back and forth. By
patching the code rather than transforming it we can reuse nearly all the
.text pages and save memory.

Google's Chromium sandbox

Posted Aug 19, 2009 20:54 UTC (Wed) by kjp (subscriber, #39639) [Link]

It looks like in essence, instead of trapping straight to the kernel, you are restricting the untrusted renderer to trap to a supervisor, that can then validate and trap to the kernel.

Was there consideration of using x86 ring 1 or 2 for this purpose? Is that too architecture dependent?

Anyway... still an interesting idea. The syscall table looks refreshingly small. I noticed things like socket, connect aren't in there... I take it the network IO is still running in the trusted/main process?

Google's Chromium sandbox

Posted Aug 19, 2009 22:03 UTC (Wed) by agl (guest, #4541) [Link]

I didn't consider it, but I believe that using CPU for protection (ring 1/2)
would require changes in the kernel. The beauty of seccomp is that it's been
in the kernel for several years now and is quite widely deployed.

Also, you're correct that all network IO runs in the main browser process.
This is actually a little unfortunate: it would be best to have a separate,
sandboxed process for that but, alas, that's only a wishlist item for now.

Google's Chromium sandbox

Posted Aug 19, 2009 22:22 UTC (Wed) by ikm (subscriber, #493) [Link]

Actually, seccomp looks like it was meant to be used exactly like this. That's why it was basically only given read and write, nothing more.

Google's Chromium sandbox

Posted Aug 19, 2009 23:36 UTC (Wed) by ncm (subscriber, #165) [Link]

Couldn't another process use ptrace to perform memory allocations and similar system calls on behalf of the restricted one, as gdb does? The restricted thread can actually be stopped during the call, making it unable to do anything to interfere. The secure thread would just be a two-instruction halt loop, then.

Google's Chromium sandbox

Posted Aug 20, 2009 1:33 UTC (Thu) by njs (guest, #40338) [Link]

Clever approach, though it does have the minor flaw that you become unable to ever use a debugger on the main web browser guts (since only one process can ptrace() at a time).

Google's Chromium sandbox

Posted Aug 20, 2009 2:40 UTC (Thu) by ncm (subscriber, #165) [Link]

Nah, the parent and gdb hand off. Whenever the child process sends a request for a system call, that trips a breakpoint, and gdb lets go of the child, which stalls waiting on the parent. Gdb attaches to the parent, and the parent attaches to the child and does its business. When the system call is done, the parent releases its ptrace and hits a breakpoint of its own, and then gdb parks the parent on a read call, detaches from the parent and re-attaches to the child, has it send a wakeup to the parent, and then we're back where we started.

Google's Chromium sandbox

Posted Oct 15, 2009 21:57 UTC (Thu) by SEJeff (subscriber, #51588) [Link]

Whenever Roland finishes polishing it up, utrace will be merged in the
kernel. utrace removes the one ptrace / process limitation.

Sandboxing made easy

Posted Aug 20, 2009 0:14 UTC (Thu) by man_ls (guest, #15091) [Link]

This is probably a stupid question, but I have to ask. Why not use read() and write() to make the untrusted part communicate with the trusted part, via a pipe? The untrusted part (a process) could decipher the HTML, and then send the result in an intermediate form to the trusted part (another process) for it to display that on the screen. Any compromise would have to generate an intermediate "poisoned" form that did something bad to the trusted part, but sending the malicious payload would be really difficult.

It does look quite complex, but the sandboxing is not trivial either.

Sandboxing made easy

Posted Aug 20, 2009 0:33 UTC (Thu) by Simetrical (guest, #53439) [Link]

Apparently because you need to be able to do things like allocate memory on
the heap, and the restricted thread can't do that. So you need a trusted
thread running in the same process.

Sandboxing made easy

Posted Aug 20, 2009 18:13 UTC (Thu) by man_ls (guest, #15091) [Link]

Ah, but of course -- sounds obvious once it is pointed out. Stupid dangers of memory management!

Sandboxing made easy

Posted Aug 20, 2009 16:25 UTC (Thu) by martine (guest, #59979) [Link]

That is in fact already how Chrome works, and yes, it is rather complicated.
See
http://dev.chromium.org/developers/design-documents/multi...
architecture

This article about the architecture used to make the HTML-decoding process
both sandboxed but still powerful enough to convert HTML into images (which
are then sent back to the trusted process).

Generic sandbox needed

Posted Aug 22, 2009 12:34 UTC (Sat) by Wout (guest, #8750) [Link]

It seems to me that this kind of sandboxing is required by many (all?) programs dealing with potentially hostile data. Web data, photo's, video's, mp3's, ISO's, ... are all potentially dangerous. Some attacks are just more common then others. So you'd like all desktop applications to defend themselves. Applications need a (kernel provided) way to create their own sandbox before touching untrusted data. Approaches like Chromium's seem like engineering around a kernel limitation.

If the kernel would provide a flexible mechanism for an application to limit what it can do, the threat of hostile data could be reduced. A combination of user level chroot ("This application doesn't need anything outside this directory.") and an allowed system call mask ("This application will only use these system calls, it doesn't need the rest.") should severely limit what an attacker can do.

Generic sandbox needed

Posted Sep 4, 2009 20:18 UTC (Fri) by cmccabe (guest, #60281) [Link]

> It seems to me that this kind of sandboxing is required by many (all?)
> programs dealing with potentially hostile data....
>
> If the kernel would provide a flexible mechanism for an application to
> limit what it can do, the threat of hostile data could be reduced.

I thought that this was what selinux was all about.

The basic idea behind selinux is that rather than using identity-based security, you use capability-based security

Identity-based security works like this: I am a process started by bob, therefore I can do everything bob can do. Capability-based security works like this: bob starts a process and gives it only the capabilities it needs to do the work it's supposed to do.

So bob runs a spell-checker program (aspell or whatever), it shouldn't have the capability to open network sockets and send messages to evilhackers.com. It's the difference between giving the application a few keys, to open the doors it needs, and giving it the whole keyring, which is what we do with traditional uid / gid based security.

It seems like what the google people are trying to do here is to reinvent the selinux concept with seccomp. I'm curious as to why. I guess selinux is difficult to set up and configure, and a lot of distributions have been slow to adopt it. Perhaps they are also trying to be cross-platform?

I'm also curious why Google is using threads rather than processes here. If you don't want to share your memory with the untrusted guy, processes are the obvious solution. As other have noted, you can always use posix shared memory if you feel the need to directly access the memory of the untrusted guy. As a bonus, you could run the untrusted processes as "nobody," and prevent them from doing a lot of nasty things -- even on a system like openBSD, where seccomp and selinux are unheard-of.

P.S.
I seem to remember that the openBSD ssh daemon was written in a similar way. There was an trusted part which ran as root, and an untrusted part which ran as a regular user.

Google's Chromium sandbox

Posted Aug 23, 2009 8:47 UTC (Sun) by oak (guest, #2786) [Link]

Why not do everything required in the process (mmaps, file opens etc) then
drop into seccomp mode to run the non-trusted code that need to be
secured? This way the non-trusted code can request whatever it needs over
an already opened pipe etc. and the extra thread would then be needed only
for handling its memory allocations.

And btw, one can easily do a DOS with memory allocations. Just alloc
large enough amount of memory (but not so large that it would trigger
OOM-killer) and then constantly write over it. Device is frozen swapping
until the process is killed.

As to LD_PRELOAD and ptrace(), former doesn't catch syscalls done directly
in ASM and AFAIK ptrace is racy (if I remember correctly, this was
mentioned in the discussions about utrace).

Regarding things like Flash. Until that can be secured, this doesn't
really do browser any safer for the normal users. Most of the content on
web that non-technical people use and are interested uses Flash in some
way. Especially for media delivery. What's the point of securing a mouse
hole if the barn doors are wide open?

Google's Chromium sandbox

Posted Aug 23, 2009 14:49 UTC (Sun) by i3839 (guest, #31386) [Link]

As part of my bachelor project I have worked on rewriting a ptrace based jailer. The old implementation was too big and complicated, the new one is only a few thousand lines of code big. This is a generic jailer which is not racy. Among other things it prevents time-of-check-to-time-of-use race conditions, but it also prevents races between different system calls like rename and open, and symlink trickery. The current version supports Linux 2.6, but 2.4 or BSD support can be added too. Adding support for other architectures than x86 is trivial.

For its design see http://www.cs.vu.nl/~guido/publications/ps/secrypt07.pdf
The rewritten version does some things differently and doesn't yet support all features of the original one. The code isn't released yet, but we plan to release it under a BSD-like license. If interested email Guido or me ([email protected]).

Google's Chromium sandbox

Posted Aug 29, 2009 5:20 UTC (Sat) by gmatht (subscriber, #58961) [Link]

I am not the person to which your question was addressed (my contribution to
chrome is limited to one patch to an install script).

However, I am "interested" in packaging this for Ubuntu. I really don't have
time now, but I may drop you an email in a few months. Having an easy to use
sandbox tool would be very nice.

Google's Chromium sandbox

Posted Oct 12, 2009 21:01 UTC (Mon) by cwitty (subscriber, #4600) [Link]

Sounds interesting, but:

"Forbidden

You don't have permission to access /~guido/publications/ps/secrypt07.pdf on this server."

Google's Chromium sandbox

Posted Oct 21, 2009 10:36 UTC (Wed) by i3839 (guest, #31386) [Link]

Weird, works for me. Perhaps a temporary server glitch? Please try again.