Ayende @ Rahien

Inside RavenDB 4.0: Chapter 6 is done

Mon, 17 Jul 2017 06:00:00 GMT

I’ve just completed writing chapter 6 (distributed RavenDB) and pushed a preview up. This put the page count at over 200 pages so far, with another two thirds or so left.

This chapter was really hard to write, and I would really appreciate any feedback that you have on the text and on the distributed nature of RavenDB 4.0 in general. It is very similar and a different beast entirely then 3.x.

Reviewing Resin: Part III

Fri, 14 Jul 2017 09:00:00 GMT

In the previous part, I started looking at UpsertTransacction, but got sidetracked into the utils functions. Let us focus back on this. The key parts of UpsertRansaction are:

Let us see what they are. The DocumentStream is the source of the documents that will be written in this transaction, its job is to get the documents to be indexed, to give them a unique id if they don’t already have one and hash them.

I’m not sure yet what is the point there, but we have this:

Which sounds bad. The likelihood is small, but it isn’t a crypto hash, so likely very easily broken. For example, look at what happened to MurmurHash.

I think that this is later used to handle some partitioning in the trie, but I’m not sure yet. We’ll look at the _storeWriter later. Let us see what the UpsertTransaction does. It builds a trie, then push each of the document from the stream to through the trie. The code is doing a lot of allocations, but I’m going to stop harping at that from now on.

The trie is called for each term for each document with the following information:

The code isn’t actually using tuple, I just collapsed a few classes to make it clear what the input is.

This is what will eventually allow the trie to do lookups on a term and find the matching document, I’m assuming.

That method is going to start a new task for that particular field name, if it is new, and push the new list of words for that field into the work queue for that task. The bad thing here is that we are talking about a blocking task, so if you have a lot of fields, you are going to spawn off a lot of threads, one per field name.

What I know now is that we are going to have a trie per field, and it is likely, based on the design decisions made so far, that a trie isn’t a small thing.

Next, the UpsertTransaction need to write the document, this is done taking the document we are processing and turning that into a dictionary of short to string. I’m not sure how it is supposed to handle multiple values for the same field, but I’ll ignore that for now. That dictionary is then saved into a file and its length and positions are returned.

I know that I said that I won’t talk about performance, but I looked at the serialization code and I saw that it is using compression, like this. This is done on a field by field basis, while you could probably benefit from compressing them all together.

Those are a lot of allocations, and then we go a bit deeper:

First, we have the allocation of the memory stream, then the ToArray call, and that happens, per field, per document. Actually, if we go up, we’ll see:

So it is allocations all the way down.

Okay, let us focus on what is going on in terms of files:

"write.lock" – this one is pretty obvious
*.da – stands for document address. Holds a series of (long Position, int Size) of document addresses. I assume that this is using the same sort as something else, not sure yet. The fact that this is fixed size means that we can easily skip into it.
*.rdoc – documents are stored here. Contains the actual serialized data for the documents (the Dictionary<short, Field>), this is the target for the addresses that are held by the “*.da” files.
*.pk – holds document hashes. Holds a list of document pk hash and a flag saying if it is deleted, I’m assuming. From context, it looks like the hash is a way to update documents across transactions.
*.kix – key index. Text file holding the names of all the fields across the entire transaction.
*.pos – posting file. This one holds the tries that were built during the transaction. This is basically just List<(int DocumentId, int Count)>, but I’m not sure how they are used just yet. It looks like this is how Resin is able to get the total term frequency per document. It looks like this is also sorted.
*.tri – the trie files that actually contain the specific values for a particular field. The name pattern is “{indexVersion}-{fieldName}.tri”. That means that your field names are limited to valid file names, by the way.

The last part of the UpsertTransaction is the commit, which essentially boil down to this:

I think that this was very insightful read, I have a much better understanding of how Resin actually work. I’m going to speculate wildly, and then use my next post to check further into that.

Let us say that we want to search for all users who live in New York City. We can do that by opening the “636348272149533175-City.tri” file. The 636348272149533175 is the index version, by the way.

Using the trie, we search for the value of New York City. The trie value actually give us a (long Position, int Size) into the 636348272149533175.pos file, which holds the posting. Basically, we now have an array of (int DocumentId, int Count) of the documents that matched that particular value.

If we want to retrieve those documents, we can use the 636348272149533175.da file, which holds the addresses of the documents. Again, this is effectively an array of (long Position, int Size) that we can index into using the DocumentId. This points to the location on the 636348272149533175.rdoc file, which holds the actual document data.

I’m not sure yet what the point of *.pa and *.kix is, but I’m sure the next post we’ll figure it out.

RavenDB 4.0: The admin’s backdoor is piping hot

Thu, 13 Jul 2017 09:00:00 GMT

We take security very seriously. With the move to X509 certificates only for authentication (on all RavenDB editions) I feel that we have a really good story around securing RavenDB and controlling access to it.

Almost. One of the more annoying things about security is that you also need to consider the hard cases, such as the administrators messing up badly. As in, losing the credentials that allows you to administrator RavenDB. This can happen because the database has just run without issue for so long that no one can remember where the keys are. That isn’t supposed to happen, but RavenDB has been in production usage for close to a decade now, which mean that we have seen our fair share of mess ups (both our own and by customers).

In some cases, we have had to help a customer manage a third system handover between different hosting providers, which felt very much half like forensic and half like hacking. In short, when we design a system now, we also consider the fact that as secure as we want the system to be, there must be a way for an authorized person to get in.

If this made you cringe, you are in good company. I both love and hate this feature. I love it because it is going to be very useful, I hate it because it was a headache to figure it right. But I’m jumping ahead of myself. What is this backdoor that I’m talking about?

Properly configured RavenDB will require a client certificate (that was registered in the cluster) to access the server. However, in addition to listening over HTTPS, RavenDB will also listen for commands on standard input. An admin can use the standard input / output as a way to talk with RavenDB without requiring any authentication. Basically, we expose a mini shell that you can use to enter commands and inspect and change our state.

Here is how it looks like when running in console mode:

From a security point of view, if a user is able to access my standard input, that usually means that they are the one that have run this process or are able to so. RavenDB obviously won’t have any setuid bits turned on, so no need to worry about a user tricking us to do something that the user don’t have permissions to do.

So using the console is a really nice way for us to offer the administrator an escape hatch to start messing with the internals of RavenDB in interesting way. However, that only work if you are running RavenDB in interactive mode. What about when running as a service or daemon? They don’t have a standard input that is available to the admin. In fact, in most production deployments, you won’t have an easy time at all trying to connect to the console.

So that option is out, sadly. Or is it?

The nice thing about operating systems is that we can lean on them. In this case, we expose the exact same console that we have for stdin / stdout using Named Pipes (actually, Unix Sockets in Linux / Mac, but pretty much the same idea). The idea is that those are both methods for inter process communication that are local to the machine and can be secured by the operating system directly. In this case, we make sure that the pipe is only accessible to the RavenDB user (and to root / Administrator, obviously). That means that an admin can log into the box, run a single command and land in the RavenDB admin shell where he can manage the server. For example, by registering a new certificate in the server .

Because only the user running the RavenDB process or an administrator / root can access the pipe (ensured by setting the proper ACL on the pipe during creation) we know that there isn’t any security risk here. An admin can already override any security in the box, and the permissions are always on the user level, not the process level, so if you are running as the same user as the RavenDB process you can already do anything that RavenDB can do.

After we ensured that our security isn’t harmed by this option, we can relax knowing that we have an easy (and safe) way for the administrator to manage the server in an emergency.

In fact, the most obvious usage of this feature is during initial cluster setup, when you don’t have anything yet. This allow you to enter the system as a trusted party and do the initial configuration.

Reviewing Resin: Part II

Wed, 12 Jul 2017 09:00:00 GMT

In the first pat of this series, I looked into how Resin is tokenizing and analyzing text. I’m still reading the code from the tests (this is because the Tests folder sorted higher then the Resin folder, basically) and I now moved to the second file, CollectorTests.

That one has a really interesting start:

There are a lot of really interesting things here, UpsertTransaction, document structure, issuing queries, etc. UpsertTransaction is a good place to start looking around, so let us poke in. When looking at it, we can se a lot of usage in the Utils class, so I’ll look at that first.

This is probably a bad idea. While using the current time ticks seems like it would generate ever increasing values, that is actually not the case, certainly not with local time (clock shift, daylight saving, etc). Using that for the purpose of generating a file id is probably a mistake. It is better to use our own counter, and just keep track of the last one we used on the file system itself.

Then we have this:

It took me a while to figure out what was going on there, and then more to frantically search where this is used. Basically, this is used in fuzzy searches, and it will allocate a new instance of the string on each call. Given that fuzzy search is popular in full text search usage, and that this is called a lot during any such search, this is going to allocate like crazy. It would be better to move the entire thing to using mutable buffers, instead of passing strings around.

Then we go to the locking, and I had to run it a few times to realize what is going on.

And this isn’t the way to do this at all. Basically, this relies on the file system to fail when you are trying to copy a file into an already existing file. However, that is a really bad way to go about doing that. The OS and the file system already have locking primitives that you can use, and they are going to be much better then this option. For example, consider what happens after a crash, is the directory locked or not? There is no real way to answer that, since the process might have crashed, leaving the file in place, or it might be doing things, expected that this is locked.

Moving on, we have this simple looking method:

I know I’m harping on that, but this method is doing a lot of allocations by using lambdas, and depending on the number of files, the delegate indirection can be quite costly. For that matter, there is also the issue of error handling. If there is a lock file in this directory when this is called, this will throw.

Our final code for this post is:

I really don’t like this code, it is something that look like it is cheap, but it will:

Sort all the index files in the folder
Open all of them
Read some data
Sum over that data

Leaving aside that the deserialization code has the typical issue of not checking that the the entire buffer was read, this can cause a lot of I/O on the system, but luckily this function is never called.

Okay, so we didn’t actually get to figure out what UpsertTransaction is, we’ll look at that in the next post.

Reviewing Resin: Part I

Tue, 11 Jul 2017 09:00:00 GMT

Resin is a “Cross-platform document database and search engine with query language, API and CLI”. It is written in C#, and while I admit that reading C# code isn’t as challenging as diving into a new language, a project that has a completely new approach to a subject that is near and dear to my heart is always welcome. It is also small, coming at about 6,500 lines of code, so that make for quick reading.

I’m reviewing commit ddbffff88995226fa52236f6dd6af4a48c833f7a.

As usual, I’m going to start reading the code in alphabetical file order, and then jump around as it make sense. The very first file I run into is Tests/AnalyzerTests where we find the following:

This is really interesting, primarily because of what it tells me. Analyzers are pretty much only used for full text search, such as Lucene or Noise. Jumping into the analyzer, we see:

This tell me quite a few things. To start with, this is a young project. The first commit is less then 18 months ago and I’m judging it with the same eye I use to looking at our own code. This code needs to be improved, for several reasons.

First, we have a virtual method call here, probably intended to be an extension point down the line. Currently, it isn’t used, and we pay the virtual call cost for no reason. Next we have the return value. IEnumerable is great, but this method is using yield, which means that we’ll have a new instance created per document. For the same reason, the tokenDic is also problematic. This one is created per field’s value, which is going to cost.

One of the first thing you want to have when you start worrying about performance is controlling your allocations. Reducing allocations in this case, by reusing the dictionary instance, or avoiding the yield would help. Lucene did a lot of stuff right in that regard, and it ensures that you can reuse instances wherever possible (almost always), since that can dramatically improve performance.

Other than this, we can also see that we have Analyze and Index features, for now I’m going to assume that they are identical to Lucene until proven otherwise. This was the analyzer, but what is going on with the tokenizer? Usually that is a lot more low level.

The core of the tokenizer is this method (I prettified it a bit to make it fit better on screen):

As far as I can tell so far, most of the effort in the codebase has gone into the data structures used, not to police allocations or get absolute performance. However, even so this would be one of the first places I would look at whenever performance work would start. (To be fair, speaking with the author of this code, I know there hasn’t been any profiling / benchmarking on the code).

This code is going to be at the heart of any indexing, and for each value, it is going to:

Allocate another string with the lowered case value.
Allocate a character buffer of the same size as the string.

Process that character buffer.

Allocate another string from that buffer.
Split that string.
Use a lambda on each of the parts and evaluate that against the stopwords.

That is going to have a huge amount of allocations / computation that can be saved. Without changing anything external to this function, we can write the following:

This will do the same, but at a greatly reduced allocation cost. A better alternative here would be to change the design. Instead of having to allocate a new list, send a buffer and don’t deal with strings directly, instead, deal with a spans. Until we .NET Core 2.0 is out, I’m going to skip spans and just use direct tokens, like so:

There are a few important things here. First, the code now don’t do any string allocations, instead, it is operating on the string characters directly. We have the IsStopword method that is now more complex, because it needs to do the check without allocating a string and while being efficient about it. How it left as an exercise for the reader, but it shouldn’t be too hard.

One thing that might not be obvious is that tokens list that we get as an argument. The idea here is that the caller code can reuse this list (and memory) between calls, but that would require a major change in the code.

In general, not operating on strings at all would be a great thing indeed. We can work with direct character buffers, which will allow us to be mutable. Again, spans would probably fit right into this and be of great help.

That is enough for now, I know we just started, but I’ll continue this in the next post.

RavenDB 4.0: Securing the keys to the kingdom

Mon, 10 Jul 2017 09:00:00 GMT

A major design goal for RavenDB is that it would be easy and convenient to user. A major constraint is that it must be secured. As you can imagine, those two are quite often work against one another. Security is often anything but easy to use, and it is rarely convenient.

Previously, we have used Windows Authentication and OAuth to secure access to RavenDB. That works and has been deployed in the wild for quite some time. It is also a major pain whenever there is an issue. If the connection to the domain controller drops, we might have authentication delays of many seconds, and trying to debug Active Directory issues in production deployments can be… a bit of a pain, in the same way that an audit by the IRS that starts with SWAT team bashing down your door is mildly annoying. OAuth, on the other hand, is much better, since it is under our control, and we can figure out exactly what is going on with it if need be.

Since RavenDB 4.0 is running on Windows, Linux & Mac, we decided to drop the Windows Authentication support and just use OAuth. The problem is that if we choose to support HTTP, we have to rely on extremely complex protocols that attempt to secure authentication using plain text, but don’t usually deliver good results and are typically a pain to debug and support. Or, we can use HTTPS and just let SSL/TLS to handle it all for us. A good example of the difference can be seen in OAuth 1.0 vs OAuth 2.0.

When we built RavenDB 1.0, roughly around 2009, the operating environment was quite different. In 2017, not using HTTPS is pretty much a sin into itself. As we started security modeling for RavenDB 4.0, it became obvious that we couldn’t really support any security on top of HTTP without effectively having to implement most of the properties of HTTPS ourselves. I’m many things, but I’m not a security expert, not by a long shot. Given the chance to implement my own security protocol, I would gladly do that, for a toy project or a weekend hackfest. But there is no way I would trust my own security in production against serious attacks. That pretty much led us to the realization that we have to require HTTPS for anything that require security.

That includes running inside the organization, exposed to the public internet, running inside the cloud or in a shared datacenter, etc. Pretty much, unless you have HTTPS, there is no real point in talking about security. Given that, it meant that we could shift our baseline approach to security. If we are always going to require HTTPS for security, it means that we are operating in an environment that is much nicer for us to apply security.

Now, you can choose to run HTTP only, and avoid the need for certificate management, etc. However, at that point, you aren’t running a secure system, or you are already running it in a trusted and secured environment. In that case, we want to be clear that there isn’t any point to try to apply security policy (such as who can access what). Any network sniffer can figure out the access tokens and pretend to be whomever they want, if you are using HTTP.

With HTTPS required, we now move to the realm of having the admin take care of the certificates, securing them, renewal, etc. That is the part where it isn’t as easy or convenient as we could wish for. However, once we had that as a baseline, it opens an interesting path for security. Instead of relying on our own solution, we can use the builtin one and use x509 certificates from the client for authentication. This has the advantage that it is widely supported, standardized and secured. It is a bit less convenient then just a password, but the advantage is that any security system already in place know how to deal with, store, authorize and manage access to certificates.

The idea is that you can go to RavenDB and either register or generate a x509 certificate. To that certificate an administrator can assign permissions (such as what dbs it is allowed to access). From that point on, a client (RavenDB, browser, curl, etc) can connect to RavenDB and just issue REST requests. There is no need to do anything else for the system to work. Contrast that with how you would typically have to deal authentication using OAuth, by sending the token, keeping it fresh manually, etc.

Using x509 also has the distinct advantage that it is widely trusted. We intend to provide this level of security to all editions of RavenDB (so the Community Edition will also be able to use it).

A nice accidental feature of this decision is that we are going to be able to apply authentication at the connection level, and connection pooling means that we are likely going to have connections live for a long time. That means that we only need to pay the authentication cost once, instead of per request, with OAuth.

To simplify matters, we’ll likely just use the client certificates for authenticating the client, so we’ll not care if they are from a trusted root, etc. We’ll just require that the admin register the valid certificate with the cluster so they will be recognized. If you need to stop using a certificate, you can delete its registration or generate a new certificate to take its place. On the client side, it means that the DocumentStore will expose a X509Certificate property that you can set (or the equivalent in other clients). That means that you can use your own policies on the client to determine how to store the certificate.

On the server side, by the way, we’ll expose an extension point that will allow you to retrieve the certificate using your own policies. For example, if you are using Azure Key Vault or Hashicorp Vault or even your own HSM. This is done by invoking a process you specify, so you can write your own scripts / mini programs and apply whatever logic you need. This creates a clean separation between RavenDB and the secret store in use.

Authentication between servers is also done using SSL and certificates. We expect that we’ll commonly have all the servers running the same wildcard certificate, in which case they will obviously trust each other. Alternatively, you can also specify additional certificates that will be treated as servers. This is useful for when you are running with separate certificate for each server, but it is also a critical part of certificate rotation. When your certificate is about to expire, the admin will register the new certificate as trusted, and then start replacing the certificates of each of the nodes in turn. This allow us to run with both old and new certificates concurrently during this process.

We considered relying on some properties of the certificate itself, but it seemed like an error prune process. It is better to have the admin explicitly state, both for clients and server certificates which one we should actually trust, and at what level.

I would really appreciate any commentary you have about this feature, both in terms of ease of use, acceptability and obviously its security.

Bad bugs makes for self assigning issues

Fri, 07 Jul 2017 09:00:00 GMT

One of our developers just added the following bug:

This is in an area that of the code that this particular developer is not regularly traversing*. The image above includes the full contents of the bug. And that caused me to immediately assign it back to its opener.

The problem? If you say that you got an error, include the error. In many cases, you can save a lot of time and guessing.

For an internal bug, where the person who opened it is available, we have much lower bar for bug report quality. Most bugs are closed relatively quickly anyway. But lower bar for bug report quality still means there is a bar.

* I started to say, not responsible for, but we don’t have code ownership, so that wouldn’t have been right.

The ghost of the zombie of revisions past

Thu, 06 Jul 2017 09:00:00 GMT

I talked about difficult naming decisions, and this one was certainly one of the more lively ones.

We bounced between zombies, orphans and ghosts, with a bunch of crazy stuff going in between. At one point it was suggested we’ll just make up a word, but my suggestion to use Welchet was sadly declined by all, including a rather rude comment by the author of this blog about what kind of jokes are appropriate for the workplace.

After we settled the discussion on ghosts, there was another discussion about whatever we should use Inky, Blinky, Pinky and Clyde. I tell you, when we aren’t building distributed databases, the office is a hotbed for nerd references.

And then an idea cam along. Which I really liked, so we talked about this in the morning and I’m showing screenshots at a blog post a bit before midnight. The feature is called the revision bin.

In the UI, you can see it as one of the top level elements.

In essence, this is a recycle bin for revisions. RavenDB can be configured to keep revisions of documents as they change, and even keep track of them after they were deleted. However, that presented a problem. If you deleted a document that had revisions, how would you tell that it was there in the first place? Just knowing the document id and looking for that wouldn’t work very well. So we created the revisions bin, whose content looks like this:

And from there you can go to:

For that matter, if we recreate this document again, you’ll be able to see its entire history, including across deletes.

Now admittedly this is a nice looking UI, and the skull on the menu is a nice touch, if a bit morbid. However, why make such a noise about such a feature?

The answer is that the revisions bin isn’t that important, but keeping track of deletes of documents using revisions is quite important, since it allow subscriptions and ETL to handle them in a clean and easy to grok manner. And in order to actually explain that, we needed to be able to show the users what we are talking about.

RavenDB 4.0 on Mac OSX

Wed, 05 Jul 2017 09:00:00 GMT

So we just got this result:

We are not in the process of making sure that it all actually works, but it is very encouraging that we have been able to get there.

This will very likely be in the next beta build for RavenDB.

RavenDB 4.0: Unbounded results sets

Tue, 04 Jul 2017 09:00:00 GMT

Unbounded result sets are a pet peeve of mine. I have seen them destroy application performance more then once. With RavenDB, I decided to cut that problem at the knees and placed a hard limit on the number of results that you can get from the server. Unless you configured it differently, you couldn’t get more than 1,024 results per query. I was very happy with this decisions, and there have been numerous cases where this has been able to save an application from serious issues.

Unfortunately, users hated it. Even though it was configurable, and even though you could effectively turn it off, just the fact that it was there was enough to make people angry.

Don’t get me wrong, I absolutely understand some of the issues raised. In particular, if the data goes over a certain size we suddenly show wrong results or error, leaving the app in a “we need to fix this NOW”. It is an easy mistake to make. In fact, in this blog, I noticed a few months back that I couldn’t get entries from 2014 to show up in the archive. The underlying reason was exactly that, I’m getting the number of items per month, and I’ve been blogging for more than 128 months, so the data got truncated.

In RavenDB 4.0 we removed the limit. If you don’t specify a limit in a query, you’ll get exactly how many results there are in the database. You can ask RavenDB to raise an error if you didn’t specify a limit clause, which is a way for you to verify that you won’t run into this issue in production, but it is off by default and will probably better match the new user expectations.

The underlying issue of loading too many results is still there, of course. And we still want to do something about it. What we did was raise alerts.

I have made a query on a large set (160,000 results, about 400 MB in all) and the following popped up in the RavenDB Studio:

This tells the admin that it have some information that it needs to look at. This is intentionally non obtrusive.

When you click on the notifications, you’ll get the following message.

And if you’ll click on the details, you’ll see the actual details of the operations that triggered this warning.

I actually created an issue so we’ll supply you with more information (such as the index, the query, duration and the total size that it generated over the network).

I think that this gives the admin enough information to act upon, but will not cause hardship to the application. This make it something that we Should Fix instead Get the OnCall Guy.

Batch processing with subscriptions in RavenDB 4.0

Mon, 03 Jul 2017 09:00:00 GMT

Subscription is a somewhat neglected feature in RavenDB. It was created to handle a specific customer need and grew from there, but it had relatively little traction and was a bit of a pain to use. When we looked at the things we wanted to do in RavenDB 4.0 re-working how people use subscription was high enough in the list that it got a dedicated dev for about a year.

Here is how a subscription looks like in RavenDB 3.x.

It is only available from code, and the model used is heavily influenced by Reactive Extensions. It give you reliable subscription to data, even if the client or server went down, it could recover on restart, but it was complex to do the more advanced things. There are events that you can register to respond to things that are happening, but there isn’t a complete story. Other things, such as automatic failover or responding to deletes were flat out impossible.

With RavenDB 4.0, we decided to do things differently. I talked about this before several times, but recently we completed a major restructuring and simplification of the user visible behavior that I’m really happy about. To start with, we ditched the Reactive Extensions and IObservable model. This is just not the right fit for the kind of things we want to do. Instead, we are going with full blown batch processing.

Instead of being called once per item, we are going to call you one per batch. This is actually how things are going over the wire, and exposing it directly to the user make our life a lot easier. It also means that you have much better model to actually do things in a batch mode. Such as applying modification to all the items in the batch and saving them back in a single operation.

Subscriptions in RavenDB 4.0 are also fault tolerant and highly available (both client & server), allow to access versioned and deleted snapshots, allow to apply complex filtering and transformations on the server side and in general a lot more suitable for the task we intend them for.

Perhaps what is more exciting is that subscriptions are available to all the clients, and in some cases, it just make more sense to write them as a batch processing script. Consider:

This is the kind of thing that can really make the operations team happy, because they can do targeted jobs with very little friction. I spend the whole of Chapter 5 talking about subscriptions, and I think it is well worth it.

We won’t be fixing this race condition

Fri, 30 Jun 2017 09:00:00 GMT

During the work on restoring backup, the developer in charge came up with the following problematic scenario.

Start restoring backup of database Northwind on node A, which can take quite some time for large database
Create a database named Northwind on node B while the restore is taking place.

The problem is that during the restore the database doesn’t exists in a proper form in the cluster until it is done restoring. During that time, if an administrator is attempting to create a database it will look like it is working, but it will actually create a new database on all the other nodes and fail on the node where the restore is going on.

When the restore will complete, it will either remove the previously created database or it will join it and replicate the restored data to the rest of the nodes, depending exactly on when the restore and the new db creation happened.

Now, trying to resolve this issue involve us coordinating the restore process around the cluster. However, that also means that we need to do heartbeats during the restore process (to the entire cluster), handle timeouts and recovery and effectively take upon us a pretty big burden of pretty complicated code. Indeed, the first draft of the fix for this issue suffered from the weakness that it would only work when running on a single node, and only work in a cluster mode in very specific cases.

In this case, it is a very rare scenario that require an admin (not just a standard user) to do two things that you’ll not usually expect them together, and the outcome of this is a bit confusing even if you managed, but there isn’t any data loss.

The solution was to document that during the restore process you shouldn’t create a database with the same name but instead let RavenDB complete and then let the database span additional nodes. That is a much simpler alternative to going in to a distributed mode reasoning just for something that is an operator error in the first place.

Bug stories: The memory ownership in the timeout

Thu, 29 Jun 2017 09:00:00 GMT

We are running a lot of tests right now on RavenDB, in all sort of interesting configurations. Some of the more interesting results came from testing wildly heterogeneous systems. Put a node on a fast Windows machine, connect it to a couple of Raspberry PIs, a cheap Windows tablet over WiFi and a slow Linux machine and see how that kind of cluster is handling high load.

This has turned out a number of bugs, the issue with the TCP read buffer corruption is one such example, but another is the reason for this post. In one of our test runs, the RavenDB process crashed with invalid memory access. That was interesting to see. Tracking down the issue led us to the piece of code that is handling incoming replication. In particular, the issue was possible if the following happened:

Node A is significantly slower than node B, primarily with regards to disk I/O.
Node B is being hit with enough load that it send large requests to node A.
There is a Node C that is also trying to replicate the same information (because it noticed that node A isn’t catching fast enough and is trying to help).

The root cause was that we had a bit of code that looked like this:

Basically, we read the data from the network into a buffer, and now we hand it off to the transaction merger to run. However, if there is a lot of load on the server, it is possible that the transaction merger will not have a chance to return in time. We try to abort the connection here, since something is obviously wrong, and we do just that. The problem is that we sent a buffer to the transaction merger, and while it might not have gotten to processing our request yet (or haven’t completed it, at least), there is no way for us to actually be able to pull the request out (it might have already started executing, after all).

The code didn’t consider that, and what happened when we did get a timeout is that the buffer was returned to the pool, and if it was freed in time, we would get an access violation exception if we were lucky, or just garbage in the buffer (that we previously validated, so we didn’t check again) that would likely also cause a crash.

The solution was to wait for the task to complete, but ping the other host to let it know that we are still alive, and that the connection shouldn’t be aborted.

The things that come out late at night

Wed, 28 Jun 2017 09:47:00 GMT

The following is the opening paragraphs for discussion RavenDB 4.0 clustering and distribution model in the Inside RavenDB 4.0 book.

You might be familiar with the term "murder of crows" as a way to refer to a group for crows[1]. It has been used in literature and arts many times. Of less reknown is the group term for ravens, which is "unkindness". Personally, in the name of all ravens, I'm torn between being insulted and amused.

Professionally, setting up RavenDB as a cluster on a group of machines is a charming exercise (however, that term is actually reserved for finches) that bring a sense of exaltation (taken too, by larks) by how pain free this is. I'll now end my voyage into the realm of ornithology's etymology and stop speaking in tongues.

On a more serious note, the fact that RavenDB clustering is easy to setup is quite important, because it means that it is much more approachable.

[1] If you are interested in learning why, I found this answer fascinating

It was amusing to write, and it got me to actually start writing that part of the book. Although I’m not sure if this will survive editing and actually end up in the book.

Bug stories: How do I call myself?

Wed, 28 Jun 2017 09:00:00 GMT

This bug is actually one of the primary reasons we had a Beta 2 release for RavenDB 4.0 so quickly.

The problem is easy to state, we had a problem in any non trivial deployment setup where clients would be utterly unable to connect to us. Let us examine what I mean by non trivial setup, shall we?

A trivial setup is when you are running locally, binding to “http://localhost:8080”. In this case, everything is simple, and you can bind to the appropriate interface and when a client connects to you, you let it know that your URL is “http://localhost:8080”.

Hm… this doesn’t make sense. If a client just connected to us, why do we need to let it know what is the URL that it need to connect to us?

Well, if there is just a single node, we don’t. But RavenDB 4.0 allows you to connect to any node in the cluster and ask it where a particular database is located. So the first thing that happens when you connect to a RavenDB server is that you find out where you really need to go. In the case of a single node, the answer is “you are going to talk to me”, but in the case of a cluster, it might be some other node entirely. And this is where things begin to be a bit problematic. The problem is that we need to know what to call ourselves when a client connects to us.

That isn’t as easy as it might sound. Consider the case where the user configure the server url to be “http://0.0.0.0:8080”. We can’t give that to the client, so we default to sending back the host name in that case. And this is where things started to get tricky. In many cases, the host name is not something that make sense.

Oh, for internal deployments, you can usually rely on it, but if you are deploying to AWS, for example, the machine host name is of very little use in routing to that particular machine. Or, for that matter, a docker container host name isn’t particularly useful when you consider it from the outside.

The problem is that with RavenDB, we had a single configuration value that was used both for the binding to the network and for letting the user know how to connect to us. That didn’t work when you had routers in the middle. For example, if my public docker IP is 10.0.75.2, that doesn’t mean that this is the IP that I can bind to inside the container. And the same is true whenever you have any complex network topology (putting nginx in front of the server, for example).

The resolution for that was pretty simple, we added a new configuration value that will separate the host that we bind to from the host that we report to the outside world. In this manner, you can bind to one IP but let the world know that you should be reached via another.

Bug stories: The data corruption in the cluster

Tue, 27 Jun 2017 09:00:00 GMT

The bug started as pretty much all others. “We have a problem when replicating from a Linux machine to a Windows machine, I’m seeing some funny values there”. This didn’t raise any alarm bells, after all, that was the point of checking what was going on in a mixed mode cluster. We didn’t expect any issues, but it wasn’t surprising that they happened.

The bug in question showed up as an invalid database id in some documents. In particular, it meant that we might have node A, node B and node C in the cluster, and running a particular scenario suddenly started also reporting node Ω, node Σ and other fun stuff like that.

And so the investigation began. We were able to reproduce this error once we put enough load on the cluster (typically around the 20th million document write or so), and it was never consistent.

We looked at how we save the data to disk, we looked at how we ready it, we scanned all the incoming and outgoing data. We sniffed raw TCP sockets and we looked at everything from the threading model to random corruption of data on the wire to our own code reading the data to manual review of the TCP code in the Linux kernel.

The later might require some explanation, it turned out that setting TCP_NODELAY on Linux would make the issue go away. That only made things a lot harder to figure out. What was worse, this corruption only ever happened in this particular location, never anywhere else. It was maddening, and about three people worked on this particular issue for over a week with the sole result being: “We know where it roughly happening, but no idea why or how”.

That in itself was a very valuable thing to have, and along the way we were able to fix a bunch of other stuff that was found under this level of scrutiny. But the original problem persisted, quite annoyingly.

Eventually, we tracked it down to this method:

We were there before, and we looked at the code, and it looked fine. Except that it wasn’t. In particular, there is a problem when the range we want to move is overlapped with the range we want to move it to.

For example, consider that we have a buffer of 32KB, and we read from the network 7 bytes. We then consumed 2 of those bytes. In the image below, you can see that as the Origin, with the consumed bytes shown as ghosts.

What we need to do now is to move the “Joyou” to the beginning of the buffer, but note that we need to move it from 2 – 7 to 0 – 5, which are overlapping ranges. The issue is that we want to be able to fully read “Joyous”, which require us to do some work to make sure that we can do that. This ReadExactly piece of code was written with the knowledge that at most it will be called with 16 bytes to read, and the buffer size is 32KB, so there was an implicit assumption that those ranges can’t overlap.

when they do… Well, you can see in the image how the data is changed with each iteration of the loop. The end result is that we have corrupted our buffer and mess everything up. The Linux TCP stack had no issue, it was all in our code. The problem is that while it is rare, it is perfectly fine to fragment the data you send into multiple packets, each with very small length. The reason why TCP_NODELAY “fixed” the issue was that it probably didn’t trigger the multiple small buffers one after another in that particular scenario. It is also worth noting that we tracked this down to specific load pattern that would cause the sender to split packets in this way to generate this error condition.

That didn’t actually fix anything, since it could still happen, but I traced the code, and I think that this happened with more regularity since we hit the buffer just right to send a value over the buffer size in just the wrong way. The fix for this, by the way, is to avoid the manual buffer copying and to use memove(), which is safe to use for overlapped ranges.

That leave us with the question, why did it take us so long to find this out? For that matter, how could this error surface only in this particular case? There is nothing really special with the database id, and this particular method is called a lot by the code.

Figuring this out took even more time, basically, this bug was hidden by the way our code validate the incoming stream. We don’t trust data from the network, and we run it through a set of validations to ensure that it is safe to consume. When this error happened in the normal course of things, higher level code would typically detect that as corruption and close the connection. The other side would retry and since this is timing dependent, it will very likely be able to proceed. The issue with database ids is that they are opaque binary values (they are guids, so no structure at all that is meaningful for the application). That means that only when we got this particular issue on that particular field (and other field at all) will we be able to pass validation and actually get the error.

The fix was annoyingly simply given the amount of time we spent finding it, but we have been able to root out a significant bug as a result of the real world tests we run.

Zombies vs. Ghosts: The great debate

Mon, 26 Jun 2017 09:00:00 GMT

We have a feature in RavenDB that may leave behind some traces when a document is gone. The actual details aren’t really important for the story. Those traces are there for a reason, and a user have a good reason to want to see them in the UI.

That meant that we needed to come up with a name for them. After a short pause, we selected Zombies, because they are the remnants of real documents that are hanging around. That seem to mesh well with the technical terminology already in use (zombie processes, for example) and a reference to the current popularity of zombies in culture (books, movies, etc) which many of our guys enjoy.

Note that in this case, I’m specifically using the term guys to refer to our male developers. One of our female developers didn’t like the terminology. Because Zombies are creepy, and we don’t want that in our UI.

There was a discussion on the terminology we’ll use that was very interesting, because it was on clearly defined gender lines. None of the guys had any issue with the term, and that included a few that considered zombie movies to be yucky as well. All the women, on the other hand, thought (to varying degrees) that zombies isn’t the appropriate term to use.

We threw a few other ones around, such as orphans, but one of the features we wanted to have is the ability to wipe those traces, and “kill all orphans” is not something that I think would go well in our UI.

Eventually the idea to use the term ghosts was brought up, and it was liked by all. It has all the connotations desired to explain what this is (the remnants of a deleted document), but the images it evoked was Casper the Friendly Ghost and Pacman, apparently.

Given that while none of the guys thought there was a problem with zombies, but no one was also particularly attached to the name, and on the other hand we had strong opposition to the term and an alternative that made everyone happy, we switched to that terminology.

Fun fact, I was telling my wife this story and I wasn’t able to complete the description of the debate before she suggested using the Pacman image.

Inventory management in MongoDB: A design philosophy I find baffling

Sat, 24 Jun 2017 11:45:00 GMT

I’m reading MongoDB in Action right now. It is an interesting book and I wanted to learn more about the approach to using MongoDB, rather then just be familiar with the feature set and what it can do. But this post isn’t about the book, it is about something that I read, and as I was reading it I couldn’t help but put down the book and actually think it through.

More specifically, I’m talking about this little guy. This is a small Ruby class that was presented in the book as part of an inventory management system. In particular, this piece of code is supposed to allow you to sell limited inventory items and ensure that you won’t sell stuff that you don’t have. The example is that if you have 10 rakes in the stores, you can only sell 10 rakes. The approach that is taken is quite nice, by simulating the notion of having a document per each of the rakes in the store and allowing users to place them in their cart. In this manner, you prevent the possibility of a selling more than you actually have.

What I take strong issue with is the way this is implemented. MongoDB doesn’t have multi document transactions, but the solution presented requires it. Therefor, the approach outlined in the book is to try to build transactional semantics from the client side. I write databases for a living, and I find that concept utterly baffling. Clients shouldn’t try to do stuff like that, not only would they most likely get it wrong, but they’ll do that extremely inefficiently.

Let us consider the following tidbit of code:

The idea here is that the fetcher is supposed to be able to atomically add the products to the order. If there aren’t enough available products to be added, the entire thing is supposed to be rolled back. As a business operation, this make a lot of sense. The actual implementation, however, made me wince.

What it does, if it was SQL, is the following:

I intentionally used SQL here, both to simplify the issue for people who aren’t familiar with MongoDB and to explain the major dissonance that I have with this approach. That little add_to_cart call that we had earlier resulted in no less than eight network roundtrips. That is in the happy case. There is also the failure mode to consider, which involved resetting all the work done so far.

The thing that really bothers me is that I can’t believe that this is something that you’ll actually want to do except as an intellectual exercise. I mean, sure, how we can pretend to get transactions from non transactional store is interesting, but given the costs of doing this or the possibility of failure or the fact that this is a non atomic state transition or… you get my point, right?

In the case of this code, the whole process is non atomic. That means that outside observers can see the changes as they are happening. It also opens you up for a lot of Bad Stuff in terms of abusing the system. If the user is malicious, they can use the fact that this “transaction” is going to be running back and forth to the database (and thus taking a lot of time) and just open another tab to initiate an action while this is going on, resulting in operations on invalid state. In the example that the book give, we can use that to force purchases of invalid items.

If you think that this is unrealistic, consider this page, which talks about doing things like making money appear from thin air using just this sort of approaches.

Another thing that really bugged me about this code is that it has “error handling” I use that in quotes because it is like a security blanket for a 2 years old. Having it there might calm things up, but it doesn’t actually change anything. In particular, this kind of error handling looks right, but it is horribly broken if you consider what kind of actual errors can happen here. If the process running this code failed for any reason, the “transaction” is going to stay in an invalid state. It is possible that one of your rake will just disappear into thin air, as a result. It is supposed to be in someone’s cart, but it isn’t. The same can be the case if the server had an issue midway or just a regular network hiccup.

I’m aware that this is code that was written explicitly to be readable and easy to explain, rather then be able to withstand the vagaries of production, but still, this is a very dangerous thing to do.

As an aside, not quite related to the topic of this post, but one thing that really bugged me in the book so far is the number of remote requests that are commonly required to do things. Is there an assumption that the database in question is nearby or very cheap to access, because the entire design philosophy I use is to assume that going over the network is expensive, so let us give the users a lot of ways to reduce that cost. In contrast, at least in the book, there is a lot of stuff that is just making remote calls like there is a fire sale that will close in 5 minutes.

To be fair to the book, it notes that there is a possibility of failure here and explain how to handle one part of it (it missed the error conditions in the error handling) and call this out explicitly as something that should be done with consideration.

PR Review: avoid too many parameters

Fri, 23 Jun 2017 09:00:00 GMT

During code review I run into these two sections, which raised a flag. Can you tell why?

The problem with this type of code is two fold. First, we add optional parameters, to reduce the number of breaking changes we have. The problem with that is that we already have parameters on the call, and eventually you’ll get to something like this:

Which is the queen of optional parameters method, and you can probably guess how it looks internally.

In the first case, we can add the new optional parameter to the… options variable that we are already sending this method. This way, we don’t have to worry about breaking changes, and we already have a way to setup options, determine defaults, etc.

In the second case, we are passing two bools to the method, and there isn’t a preexisting parameters object. Instead of creating one, we can use a Flags enum, whose bits we can set to determine what exactly the behavior of this method should be. That is generally much easier to maintain in the long run.

Inside RavenDB 4.0 book–Chapter 4 & 5 are done

Thu, 22 Jun 2017 09:00:00 GMT

The RavenDB 4.0 book is going really well, this week I have managed to write about 20,000 words and the current page count is at 166. At this rate, it is going to turn into a monster in terms of how big it is going to be.

The book so far covers the client API, how to model data in a document database and how to use batch processing in RavenDB with subscriptions. The full drafts are available here, and I would really appreciate any feedback you have.

Next topic is to start talking about clustering and this is going to be real fun.

I’m also looking for a technical editor for the book. Someone who can both tell me that I missed a semi column in a code listing and tell me why my phrasing / spelling / grammar is poor (and likely all three at once and some other stuff that I don’t even know). If you know someone, or better yet, interested and capable, send me a line.