ArchitectureOverview - Cassandra Wiki

This is an overview of Cassandra architecture aimed at Cassandra users.

Developers should probably look at the Developers links on the wiki's front page

Information is mainly based on J Ellis OSCON 09 presentation

Why Cassandra

MySQL drives too many random I/Os
File-based solutions require far too many locks

The new face of data

Scale out, not up
Online load balancing, cluster growth
Flexible schema
Key-oriented queries
CAP-aware

CAP theorem

The CAP theorem (Brewer) states that you have to pick two of Consistency, Availability, Partition tolerance: You can't have the three at the same time and get an acceptable latency.

Cassandra values Availability and Partitioning tolerance (AP). Tradeoffs between consistency and latency are tunable in Cassandra. You can get strong consistency with Cassandra (with an increased latency). But, you can't get row locking: that is a definite win for HBase.

Note: Hbase values Consistency and Partitioning tolerance (CP)

History and approaches

Two famous papers

Bigtable: A distributed storage system for structured data, 2006
Dynamo: amazon's highly available keyvalue store, 2007

Two approaches

Bigtable: "How can we build a distributed db on top of GFS?"
Dynamo: "How can we build a distributed hash table appropriate for the data center?"

Cassandra 10,000 ft summary

Dynamo partitioning and replication
Log-structured ColumnFamily data model similar to Bigtable's

Cassandra highlights

High availability
Incremental scalability
Eventually consistent
Tunable tradeoffs between consistency and latency
Minimal administration
No SPF (Single Point of Failure)

p2p distribution model -- which drives the consistency model -- means there is no single point of failure.

Keys distribution and Partition

Dynamo architecture & Lookup

In a ring of nodes A, B, C, D, E, F and G Nodes B, C and D store keys in the range (a,b) including key k

You can decide where the key should go in Cassandra using the InitialToken parameter for your Partitioner, see Storage Configuration

Architecture details

O(1) node lookup
Explicit replication
Eventually consistent

Architecture layers

Core Layer	Middle Layer	Top Layer
Messaging service Gossip Failure detection Cluster state Partitioner Replication	Commit log Memtable SSTable Indexes Compaction	Tombstones Hinted handoff Read repair Bootstrap Monitoring Admin tools

Writes

Find details on the Cassandra Write Path here

Reads

Find details on the Cassandra Read Path here

Remove

Deletion marker (tombstone) necessary to suppress data in older SSTables, until compaction Read repair complicates things a little Eventually consistent complicates things more Solution: configurable delay before tombstone GC, after which tombstones are not repaired

Consistency

See also the API documentation.

Consistency describes how and whether a system is left in a consistent state after an operation. In distributed data systems like Cassandra, this usually means that once a writer has written, all readers will see that write.

On the contrary to the strong consistency used in most relational databases (ACID for Atomicity Consistency Isolation Durability) Cassandra is at the other end of the spectrum (BASE for Basically Available Soft-state Eventual consistency). Cassandra weak consistency comes in the form of eventual consistency which means the database eventually reaches a consistent state. As the data is replicated, the latest version of something is sitting on some node in the cluster, but older versions are still out there on other nodes, but eventually all nodes will see the latest version.

More specifically: R=read replica count W=write replica count N=replication factor Q=QUORUM (Q = N / 2 + 1)

If W + R > N, you will have consistency
W=1, R=N
W=N, R=1
W=Q, R=Q where Q = N / 2 + 1

Cassandra provides consistency when R + W > N (read replica count + write replica count > replication factor).

You get consistency if R + W > N, where R is the number of records to read, W is the number of records to write, and N is the replication factor. A ConsistencyLevel of ONE means R or W is 1. A ConsistencyLevel of QUORUM means R or W is ceiling((N+1)/2). A ConsistencyLevel of ALL means R or W is N. So if you want to write with a ConsistencyLevel of ONE and then get the same data when you read, you need to read with ConsistencyLevel ALL.

stats