Chapter 1 NDB Cluster APIs: Overview and Concepts

Table of Contents

1.1 NDB Cluster API Overview: Introduction
1.1.1 NDB Cluster API Overview: The NDB API
1.1.2 NDB Cluster API Overview: The MGM API
1.2 NDB Cluster API Overview: Terminology
1.3 The NDB Transaction and Scanning API
1.3.1 Core NDB API Classes
1.3.2 Application Program Basics
1.3.3 Review of NDB Cluster Concepts
1.3.4 The Adaptive Send Algorithm

Abstract

This chapter provides a general overview of essential NDB Cluster, NDB API, and MGM API concepts, terminology, and programming constructs.

For an overview of Java APIs that can be used with NDB Cluster, see Section 4.1, “MySQL Cluster Connector for Java: Overview”.

For information about using Memcache with NDB Cluster, see Chapter 6, ndbmemcache—Memcache API for NDB Cluster.

For information about writing JavaScript applications using Node.js with MySQL, see Chapter 5, MySQL NoSQL Connector for JavaScript.

1.1 NDB Cluster API Overview: Introduction

Abstract

This section introduces the NDB Transaction and Scanning APIs as well as the NDB Management (MGM) API for use in building applications to run on NDB Cluster. It also discusses the general theory and principles involved in developing such applications.

1.1.1 NDB Cluster API Overview: The NDB API

The NDB API is an object-oriented application programming interface for NDB Cluster that implements indexes, scans, transactions, and event handling. NDB transactions are ACID-compliant in that they provide a means to group operations in such a way that they succeed (commit) or fail as a unit (rollback). It is also possible to perform operations in a no-commit or deferred mode, to be committed at a later time.

NDB scans are conceptually rather similar to the SQL cursors implemented in MySQL 5.0 and other common enterprise-level database management systems. These provide high-speed row processing for record retrieval purposes. (NDB Cluster naturally supports set processing just as does MySQL in its non-Cluster distributions. This can be accomplished through the usual MySQL APIs discussed in the MySQL Manual and elsewhere.) The NDB API supports both table scans and row scans; the latter can be performed using either unique or ordered indexes. Event detection and handling is discussed in Section 2.3.21, “The NdbEventOperation Class”, as well as Section 2.5.8, “NDB API Event Handling Example”.

In addition, the NDB API provides object-oriented error-handling facilities in order to provide a means of recovering gracefully from failed operations and other problems. (See Section 2.5.3, “NDB API Example: Handling Errors and Retrying Transactions”, for a detailed example.)

The NDB API provides a number of classes implementing the functionality described above. The most important of these include the Ndb, Ndb_cluster_connection, NdbTransaction, and NdbOperation classes. These model (respectively) database connections, cluster connections, transactions, and operations. These classes and their subclasses are listed in Section 2.3, “NDB API Classes, Interfaces, and Structures”. Error conditions in the NDB API are handled using NdbError.

Note

NDB API applications access the NDB Cluster's data store directly, without requiring a MySQL Server as an intermediary. This means that such applications are not bound by the MySQL privilege system; any NDB API application has read and write access to any NDB table stored in the same NDB Cluster at any time without restriction.

It is possible to distribute the MySQL grant tables, converting them from the default storage engine to NDB. Once this has been done, NDB API applications can access any of the MySQL grant tables. This means that such applications can read or write user names, passwords, and any other data stored in these tables.

1.1.2 NDB Cluster API Overview: The MGM API

The NDB Cluster Management API, also known as the MGM API, is a C-language programming interface intended to provide administrative services for the cluster. These include starting and stopping NDB Cluster nodes, handling NDB Cluster logging, backups, and restoration from backups, as well as various other management tasks. A conceptual overview of the MGM API and its uses can be found in Chapter 3, The MGM API.

The MGM API's principal structures model the states of individual modes (ndb_mgm_node_state), the state of the NDB Cluster as a whole (ndb_mgm_cluster_state), and management server response messages (ndb_mgm_reply). See Section 3.4, “MGM API Structures”, for detailed descriptions of these.

1.2 NDB Cluster API Overview: Terminology

Abstract

This section provides a glossary of terms which are unique to the NDB and MGM APIs, or that have a specialized meaning when applied in the context of either or both of these APIs.

The terms in the following list are useful to an understanding of NDB Cluster, the NDB API, or have a specialized meaning when used in one of these:

Backup.  A complete copy of all NDB Cluster data, transactions and logs, saved to disk.

Restore.  Return the cluster to a previous state, as stored in a backup.

Checkpoint.  Generally speaking, when data is saved to disk, it is said that a checkpoint has been reached. When working with the NDB storage engine, there are two sorts of checkpoints which work together in order to ensure that a consistent view of the cluster's data is maintained. These two types, local checkpoints and global checkpoints, are described in the next few paragraphs:

Local checkpoint (LCP).  This is a checkpoint that is specific to a single node; however, LCPs take place for all nodes in the cluster more or less concurrently. An LCP involves saving all of a node's data to disk, and so usually occurs every few minutes, depending upon the amount of data stored by the node.

More detailed information about LCPs and their behavior can be found in the MySQL Manual; see in particular Defining NDB Cluster Data Nodes.

Global checkpoint (GCP).  A GCP occurs every few seconds, when transactions for all nodes are synchronized and the REDO log is flushed to disk.

A related term is GCI, which stands for Global Checkpoint ID. This marks the point in the REDO log where a GCP took place.

Node.  A component of NDB Cluster. 3 node types are supported:

  • A management (MGM) node is an instance of ndb_mgmd, the NDB Cluster management server daemon.

  • A data node an instance of ndbd, the NDB Cluster data storage daemon, and stores NDB Cluster data. This may also be an instance of ndbmtd, a multithreaded version of ndbd.

  • An API nodeis an application that accesses NDB Cluster data. SQL node refers to a mysqld (MySQL Server) process that is connected to the NDB Cluster as an API node.

For more information about these node types, please refer to Section 1.3.3, “Review of NDB Cluster Concepts”, or to NDB Cluster Programs, in the MySQL Manual.

Node failure.  An NDB Cluster is not solely dependent upon the functioning of any single node making up the cluster, which can continue to run even when one node fails.

Node restart.  The process of restarting an NDB Cluster node which has stopped on its own or been stopped deliberately. This can be done for several different reasons, listed here:

  • Restarting a node which has shut down on its own. (This is known as forced shutdown or node failure; the other cases discussed here involve manually shutting down the node and restarting it).

  • To update the node's configuration.

  • As part of a software or hardware upgrade.

  • In order to defragment the node's DataMemory.

Initial node restart.  The process of starting an NDB Cluster node with its file system having been removed. This is sometimes used in the course of software upgrades and in other special circumstances.

System crash (system failure).  This can occur when so many data nodes have failed that the NDB Cluster's state can no longer be guaranteed.

System restart.  The process of restarting an NDB Cluster and reinitializing its state from disk logs and checkpoints. This is required after any shutdown of the cluster, planned or unplanned.

Fragment.  Contains a portion of a database table. In the NDB storage engine, a table is broken up into and stored as a number of subsets, usually referred to as fragments. A fragment is sometimes also called a partition.

Replica.  Under the NDB storage engine, each table fragment has number of replicas in order to provide redundancy.

Transporter.  A protocol providing data transfer across a network. The NDB API supports 4 different types of transporter connections: TCP/IP (local), TCP/IP (remote), SCI, and SHM. TCP/IP is, of course, the familiar network protocol that underlies HTTP, FTP, and so forth, on the Internet. SCI (Scalable Coherent Interface) is a high-speed protocol used in building multiprocessor systems and parallel-processing applications. SHM stands for Unix-style shared memory segments. For an informal introduction to SCI, see this essay at www.dolphinics.com.

NDB.  This originally stood for Network DataBase. It now refers to the MySQL storage engine (named NDB or NDBCLUSTER) used to enable the NDB Cluster distributed database system.

ACC (Access Manager).  An NDB kernel block that handles hash indexes of primary keys providing speedy access to the records. For more information, see The DBACC Block.

TUP (Tuple Manager).  This NDB kernel block handles storage of tuples (records) and contains the filtering engine used to filter out records and attributes when performing reads or updates. See The DBTUP Block, for more information.

TC (Transaction Coordinator).  Handles coordination of transactions and timeouts in the NDB kernel (see The DBTC Block). Provides interfaces to the NDB API for performing indexes and scan operations.

For more information, see NDB Kernel Blocks, elsewhere in this Guide..

See also NDB Cluster Overview, in the MySQL Manual.

1.3 The NDB Transaction and Scanning API

Abstract

This section discusses the high-level architecture of the NDB API, and introduces the NDB classes which are of greatest use and interest to the developer. It also covers most important NDB API concepts, including a review of NDB Cluster Concepts.

1.3.1 Core NDB API Classes

The NDB API is an NDB Cluster application interface that implements transactions. It consists of the following fundamental classes:

In addition, the NDB API defines an NdbError structure, which contains the specification for an error.

It is also possible to receive events triggered when data in the database is changed. This is accomplished through the NdbEventOperation class.

The NDB event notification API is not supported prior to MySQL 5.1.

For more information about these classes as well as some additional auxiliary classes not listed here, see Section 2.3, “NDB API Classes, Interfaces, and Structures”.

1.3.2 Application Program Basics

The main structure of an application program is as follows:

  1. Connect to a cluster using the Ndb_cluster_connection object.

  2. Initiate a database connection by constructing and initialising one or more Ndb objects.

  3. Identify the tables, columns, and indexes on which you wish to operate, using NdbDictionary and one or more of its subclasses.

  4. Define and execute transactions using the NdbTransaction class.

  5. Delete Ndb objects.

  6. Terminate the connection to the cluster (terminate an instance of Ndb_cluster_connection).

1.3.2.1 Using Transactions

The procedure for using transactions is as follows:

  1. Start a transaction (instantiate an NdbTransaction object).

  2. Add and define operations associated with the transaction using instances of one or more of the NdbOperation, NdbScanOperation, NdbIndexOperation, and NdbIndexScanOperation classes.

  3. Execute the transaction (call NdbTransaction::execute()).

  4. The operation can be of two different types—Commit or NoCommit:

    • If the operation is of type NoCommit, then the application program requests that the operation portion of a transaction be executed, but without actually committing the transaction. Following the execution of a NoCommit operation, the program can continue to define additional transaction operations for later execution.

      NoCommit operations can also be rolled back by the application.

    • If the operation is of type Commit, then the transaction is immediately committed. The transaction must be closed after it has been committed (even if the commit fails), and no further operations can be added to or defined for this transaction.

    See Section 2.3.30.5, “NdbTransaction::ExecType”.

1.3.2.2 Synchronous Transactions

Synchronous transactions are defined and executed as follows:

  1. Begin (create) the transaction, which is referenced by an NdbTransaction object typically created using Ndb::startTransaction(). At this point, the transaction is merely being defined; it is not yet sent to the NDB kernel.

  2. Define operations and add them to the transaction, using one or more of the following, along with the appropriate methods of the respectiveNdbOperation class (or possibly one or more of its subclasses):

    At this point, the transaction has still not yet been sent to the NDB kernel.

  3. Execute the transaction, using the NdbTransaction::execute() method.

  4. Close the transaction by calling Ndb::closeTransaction().

For an example of this process, see Section 2.5.1, “NDB API Example Using Synchronous Transactions”.

To execute several synchronous transactions in parallel, you can either use multiple Ndb objects in several threads, or start multiple application programs.

1.3.2.3 Operations

An NdbTransaction consists of a list of operations, each of which is represented by an instance of NdbOperation, NdbScanOperation, NdbIndexOperation, or NdbIndexScanOperation (that is, of NdbOperation or one of its child classes).

See Section 1.3.2.3.1, “NDB Access Types”, for general information about NDB Cluster access operation types.

1.3.2.3.1 NDB Access Types

The data node process has a number of simple constructs which are used to access the data in an NDB Cluster. We have created a very simple benchmark to check the performance of each of these.

There are four access methods:

  • Primary key access.  This is access of a record through its primary key. In the simplest case, only one record is accessed at a time, which means that the full cost of setting up a number of TCP/IP messages and a number of costs for context switching are borne by this single request. In the case where multiple primary key accesses are sent in one batch, those accesses share the cost of setting up the necessary TCP/IP messages and context switches. If the TCP/IP messages are for different destinations, additional TCP/IP messages need to be set up.

  • Unique key access.  Unique key accesses are similar to primary key accesses, except that a unique key access is executed as a read on an index table followed by a primary key access on the table. However, only one request is sent from the MySQL Server, and the read of the index table is handled by the data node. Such requests also benefit from batching.

  • Full table scan.  When no indexes exist for a lookup on a table, a full table scan is performed. This is sent as a single request to the ndbd process, which then divides the table scan into a set of parallel scans on all NDB data node processes.

  • Range scan using ordered index.  When an ordered index is used, it performs a scan in the same manner as the full table scan, except that it scans only those records which are in the range used by the query transmitted by the MySQL server (SQL node). All partitions are scanned in parallel when all bound index attributes include all attributes in the partitioning key.

1.3.2.3.2 Single-row operations

After the operation is created using NdbTransaction::getNdbOperation() or NdbTransaction::getNdbIndexOperation(), it is defined in the following three steps:

  1. Specify the standard operation type using NdbOperation::readTuple().

  2. Specify search conditions using NdbOperation::equal().

  3. Specify attribute actions using NdbOperation::getValue().

Here are two brief examples illustrating this process. For the sake of brevity, we omit error handling.

This first example uses an NdbOperation:

// 1. Retrieve table object
myTable= myDict->getTable("MYTABLENAME");

// 2. Create an NdbOperation on this table
myOperation= myTransaction->getNdbOperation(myTable);

// 3. Define the operation's type and lock mode
myOperation->readTuple(NdbOperation::LM_Read);

// 4. Specify search conditions
myOperation->equal("ATTR1", i);

// 5. Perform attribute retrieval
myRecAttr= myOperation->getValue("ATTR2", NULL);

For additional examples of this sort, see Section 2.5.1, “NDB API Example Using Synchronous Transactions”.

The second example uses an NdbIndexOperation:

// 1. Retrieve index object
myIndex= myDict->getIndex("MYINDEX", "MYTABLENAME");

// 2. Create
myOperation= myTransaction->getNdbIndexOperation(myIndex);

// 3. Define type of operation and lock mode
myOperation->readTuple(NdbOperation::LM_Read);

// 4. Specify Search Conditions
myOperation->equal("ATTR1", i);

// 5. Attribute Actions
myRecAttr = myOperation->getValue("ATTR2", NULL);

Another example of this second type can be found in Section 2.5.5, “NDB API Example: Using Secondary Indexes in Scans”.

We now discuss in somewhat greater detail each step involved in the creation and use of synchronous transactions.

  1. Define single row operation type.  The following operation types are supported:

    All of these operations operate on the unique tuple key. When NdbIndexOperation is used, then each of these operations operates on a defined unique hash index.

    Note

    If you want to define multiple operations within the same transaction, then you need to call NdbTransaction::getNdbOperation() or NdbTransaction::getNdbIndexOperation() for each operation.

  2. Specify Search Conditions.  The search condition is used to select tuples. Search conditions are set using NdbOperation::equal().

  3. Specify Attribute Actions.  Next, it is necessary to determine which attributes should be read or updated. It is important to remember that:

    • Deletes can neither read nor set values, but only delete them.

    • Reads can only read values.

    • Updates can only set values. Normally the attribute is identified by name, but it is also possible to use the attribute's identity to determine the attribute.

    NdbOperation::getValue() returns an NdbRecAttr object containing the value as read. To obtain the actual value, one of two methods can be used; the application can either

    The NdbRecAttr object is released when Ndb::closeTransaction() is called. For this reason, the application cannot reference this object following any subsequent call to Ndb::closeTransaction(). Attempting to read data from an NdbRecAttr object before calling NdbTransaction::execute() yields an undefined result.

1.3.2.3.3 Scan Operations

Scans are roughly the equivalent of SQL cursors, providing a means to perform high-speed row processing. A scan can be performed on either a table (using an NdbScanOperation) or an ordered index (by means of an NdbIndexScanOperation).

Scan operations have the following characteristics:

  • They can perform read operations which may be shared, exclusive, or dirty.

  • They can potentially work with multiple rows.

  • They can be used to update or delete multiple rows.

  • They can operate on several nodes in parallel.

After the operation is created using NdbTransaction::getNdbScanOperation() or NdbTransaction::getNdbIndexScanOperation(), it is carried out as follows:

  1. Define the standard operation type, using NdbScanOperation::readTuples().

    Note

    See Section 2.3.29.7, “NdbScanOperation::readTuples()”, for additional information about deadlocks which may occur when performing simultaneous, identical scans with exclusive locks.

  2. Specify search conditions, using NdbScanFilter, NdbIndexScanOperation::setBound(), or both.

  3. Specify attribute actions using NdbOperation::getValue().

  4. Execute the transaction using NdbTransaction::execute().

  5. Traverse the result set by means of successive calls to NdbScanOperation::nextResult().

Here are two brief examples illustrating this process. Once again, in order to keep things relatively short and simple, we forego any error handling.

This first example performs a table scan using an NdbScanOperation:

// 1. Retrieve a table object
myTable= myDict->getTable("MYTABLENAME");

// 2. Create a scan operation (NdbScanOperation) on this table
myOperation= myTransaction->getNdbScanOperation(myTable);

// 3. Define the operation's type and lock mode
myOperation->readTuples(NdbOperation::LM_Read);

// 4. Specify search conditions
NdbScanFilter sf(myOperation);
sf.begin(NdbScanFilter::OR);
sf.eq(0, i);   // Return rows with column 0 equal to i or
sf.eq(1, i+1); // column 1 equal to (i+1)
sf.end();

// 5. Retrieve attributes
myRecAttr= myOperation->getValue("ATTR2", NULL);

The second example uses an NdbIndexScanOperation to perform an index scan:

// 1. Retrieve index object
myIndex= myDict->getIndex("MYORDEREDINDEX", "MYTABLENAME");

// 2. Create an operation (NdbIndexScanOperation object)
myOperation= myTransaction->getNdbIndexScanOperation(myIndex);

// 3. Define type of operation and lock mode
myOperation->readTuples(NdbOperation::LM_Read);

// 4. Specify search conditions
// All rows with ATTR1 between i and (i+1)
myOperation->setBound("ATTR1", NdbIndexScanOperation::BoundGE, i);
myOperation->setBound("ATTR1", NdbIndexScanOperation::BoundLE, i+1);

// 5. Retrieve attributes
myRecAttr = MyOperation->getValue("ATTR2", NULL);

Some additional discussion of each step required to perform a scan follows:

  1. Define Scan Operation Type.  It is important to remember that only a single operation is supported for each scan operation (NdbScanOperation::readTuples() or NdbIndexScanOperation::readTuples()).

    Note

    If you want to define multiple scan operations within the same transaction, then you need to call NdbTransaction::getNdbScanOperation() or NdbTransaction::getNdbIndexScanOperation() separately for each operation.

  2. Specify Search Conditions.  The search condition is used to select tuples. If no search condition is specified, the scan will return all rows in the table. The search condition can be an NdbScanFilter (which can be used on both NdbScanOperation and NdbIndexScanOperation) or bounds (which can be used only on index scans - see NdbIndexScanOperation::setBound()). An index scan can use both NdbScanFilter and bounds.

    Note

    When NdbScanFilter is used, each row is examined, whether or not it is actually returned. However, when using bounds, only rows within the bounds will be examined.

  3. Specify Attribute Actions.  Next, it is necessary to define which attributes should be read. As with transaction attributes, scan attributes are defined by name, but it is also possible to use the attributes' identities to define attributes as well. As discussed elsewhere in this document (see Section 1.3.2.2, “Synchronous Transactions”), the value read is returned by the NdbOperation::getValue() method as an NdbRecAttr object.

1.3.2.3.4 Using Scans to Update or Delete Rows

Scanning can also be used to update or delete rows. This is performed as follows:

  1. Scanning with exclusive locks using NdbOperation::LM_Exclusive.

  2. (When iterating through the result set:) For each row, optionally calling either NdbScanOperation::updateCurrentTuple() or NdbScanOperation::deleteCurrentTuple().

  3. (If performing NdbScanOperation::updateCurrentTuple():) Setting new values for records simply by using NdbOperation::setValue(). NdbOperation::equal() should not be called in such cases, as the primary key is retrieved from the scan.

Important

The update or delete is not actually performed until the next call to NdbTransaction::execute() is made, just as with single row operations. NdbTransaction::execute() also must be called before any locks are released; for more information, see Section 1.3.2.3.5, “Lock Handling with Scans”.

Features Specific to Index Scans.  When performing an index scan, it is possible to scan only a subset of a table using NdbIndexScanOperation::setBound(). In addition, result sets can be sorted in either ascending or descending order, using NdbIndexScanOperation::readTuples(). Note that rows are returned unordered by default unless sorted is set to true.

It is also important to note that, when using NdbIndexScanOperation::BoundEQ (see Section 2.3.23.1, “NdbIndexScanOperation::BoundType”) with a partition key, only fragments containing rows will actually be scanned. Finally, when performing a sorted scan, any value passed as the NdbIndexScanOperation::readTuples() method's parallel argument will be ignored and maximum parallelism will be used instead. In other words, all fragments which it is possible to scan are scanned simultaneously and in parallel in such cases.

1.3.2.3.5 Lock Handling with Scans

Performing scans on either a table or an index has the potential to return a great many records; however, Ndb locks only a predetermined number of rows per fragment at a time. The number of rows locked per fragment is controlled by the batch parameter passed to NdbScanOperation::readTuples().

In order to enable the application to handle how locks are released, NdbScanOperation::nextResult() has a Boolean parameter fetchAllowed. If NdbScanOperation::nextResult() is called with fetchAllowed equal to false, then no locks may be released as result of the function call. Otherwise the locks for the current batch may be released.

This next example shows a scan delete that handles locks in an efficient manner. For the sake of brevity, we omit error-handling.

int check;

// Outer loop for each batch of rows
while((check = MyScanOperation->nextResult(true)) == 0)
{
  do
  {
    // Inner loop for each row within the batch
    MyScanOperation->deleteCurrentTuple();
  }
  while((check = MyScanOperation->nextResult(false)) == 0);

  // When there are no more rows in the batch, execute all defined deletes
  MyTransaction->execute(NoCommit);
}

For a more complete example of a scan, see Section 2.5.4, “NDB API Basic Scanning Example”.

1.3.2.3.6 Error Handling

Errors can occur either when operations making up a transaction are being defined, or when the transaction is actually being executed. Catching and handling either sort of error requires testing the value returned by NdbTransaction::execute(), and then, if an error is indicated (that is, if this value is equal to -1), using the following two methods in order to identify the error's type and location:

This short example illustrates how to detect an error and to use these two methods to identify it:

theTransaction = theNdb->startTransaction();
theOperation = theTransaction->getNdbOperation("TEST_TABLE");
if(theOperation == NULL)
  goto error;

theOperation->readTuple(NdbOperation::LM_Read);
theOperation->setValue("ATTR_1", at1);
theOperation->setValue("ATTR_2", at1);  //  Error occurs here
theOperation->setValue("ATTR_3", at1);
theOperation->setValue("ATTR_4", at1);

if(theTransaction->execute(Commit) == -1)
{
  errorLine = theTransaction->getNdbErrorLine();
  errorOperation = theTransaction->getNdbErrorOperation();
}

Here, errorLine is 3, as the error occurred in the third method called on the NdbOperation object (in this case, theOperation). If the result of NdbTransaction::getNdbErrorLine() is 0, then the error occurred when the operations were executed. In this example, errorOperation is a pointer to the object theOperation. The NdbTransaction::getNdbError() method returns an NdbError object providing information about the error.

Note

Transactions are not automatically closed when an error occurs. You must call Ndb::closeTransaction() or NdbTransaction::close() to close the transaction.

See Section 2.3.16.2, “Ndb::closeTransaction()”, and Section 2.3.30.1, “NdbTransaction::close()”.

One recommended way to handle a transaction failure (that is, when an error is reported) is as shown here:

  1. Roll back the transaction by calling NdbTransaction::execute() with a special ExecType value for the type parameter.

    See Section 2.3.30.6, “NdbTransaction::execute()” and Section 2.3.30.5, “NdbTransaction::ExecType”, for more information about how this is done.

  2. Close the transaction by calling NdbTransaction::close().

  3. If the error was temporary, attempt to restart the transaction.

Several errors can occur when a transaction contains multiple operations which are simultaneously executed. In this case the application must go through all operations and query each of their NdbError objects to find out what really happened.

Important

Errors can occur even when a commit is reported as successful. In order to handle such situations, the NDB API provides an additional NdbTransaction::commitStatus() method to check the transaction's commit status.

See Section 2.3.30.2, “NdbTransaction::commitStatus()”.

1.3.3 Review of NDB Cluster Concepts

Abstract

This section covers the NDB Kernel, and discusses NDB Cluster transaction handling and transaction coordinators. It also describes NDB record structures and concurrency issues.

The NDB Kernel is the collection of data nodes belonging to an NDB Cluster. The application programmer can for most purposes view the set of all storage nodes as a single entity. Each data node is made up of three main components:

  • TC: The transaction coordinator.

  • ACC: The index storage component.

  • TUP: The data storage component.

When an application executes a transaction, it connects to one transaction coordinator on one data node. Usually, the programmer does not need to specify which TC should be used, but in some cases where performance is important, the programmer can provide hints to use a certain TC. (If the node with the desired transaction coordinator is down, then another TC will automatically take its place.)

Each data node has an ACC and a TUP which store the indexes and data portions of the database table fragment. Even though a single TC is responsible for the transaction, several ACCs and TUPs on other data nodes might be involved in that transaction's execution.

1.3.3.1 Selecting a Transaction Coordinator

The default method is to select the transaction coordinator (TC) determined to be the "nearest" data node, using a heuristic for proximity based on the type of transporter connection. In order of nearest to most distant, these are:

  1. SCI

  2. SHM

  3. TCP/IP (localhost)

  4. TCP/IP (remote host)

If there are several connections available with the same proximity, one is selected for each transaction in a round-robin fashion. Optionally, you may set the method for TC selection to round-robin mode, where each new set of transactions is placed on the next data node. The pool of connections from which this selection is made consists of all available connections.

As noted in Section 1.3.3, “Review of NDB Cluster Concepts”, the application programmer can provide hints to the NDB API as to which transaction coordinator should be uses. This is done by providing a table and a partition key (usually the primary key). If the primary key is the partition key, then the transaction is placed on the node where the primary replica of that record resides. Note that this is only a hint; the system can be reconfigured at any time, in which case the NDB API chooses a transaction coordinator without using the hint. For more information, see Section 2.3.2.4.11, “Column::getPartitionKey()”, and Section 2.3.16.34, “Ndb::startTransaction()”.

The application programmer can specify the partition key from SQL by using the following construct:

CREATE TABLE ... ENGINE=NDB PARTITION BY KEY (attribute_list);

For additional information, see Partitioning, and in particular KEY Partitioning, in the MySQL Manual.

1.3.3.2 NDB Record Structure

The NDB storage engine used by NDB Cluster is a relational database engine storing records in tables as with other relational database systems. Table rows represent records as tuples of relational data. When a new table is created, its attribute schema is specified for the table as a whole, and thus each table row has the same structure. Again, this is typical of relational databases, and NDB is no different in this regard.

Primary Keys.  Each record has from 1 up to 32 attributes which belong to the primary key of the table.

Transactions.  Transactions are committed first to main memory, and then to disk, after a global checkpoint (GCP) is issued. Since all data are (in most NDB Cluster configurations) synchronously replicated and stored on multiple data nodes, the system can handle processor failures without loss of data. However, in the case of a system-wide failure, all transactions (committed or not) occurring since the most recent GCP are lost.

Concurrency Control.  NDB uses pessimistic concurrency control based on locking. If a requested lock (implicit and depending on database operation) cannot be attained within a specified time, then a timeout error results.

Concurrent transactions as requested by parallel application programs and thread-based applications can sometimes deadlock when they try to access the same information simultaneously. Thus, applications need to be written in a manner such that timeout errors occurring due to such deadlocks are handled gracefully. This generally means that the transaction encountering a timeout should be rolled back and restarted.

Hints and Performance.  Placing the transaction coordinator in close proximity to the actual data used in the transaction can in many cases improve performance significantly. This is particularly true for systems using TCP/IP. For example, a Solaris system using a single 500 MHz processor has a cost model for TCP/IP communication which can be represented by the formula

[30 microseconds] + ([100 nanoseconds] * [number of bytes])

This means that if we can ensure that we use popular links we increase buffering and thus drastically reduce the costs of communication. The same system using SCI has a different cost model:

[5 microseconds] + ([10 nanoseconds] * [number of bytes])

This means that the efficiency of an SCI system is much less dependent on selection of transaction coordinators. Typically, TCP/IP systems spend 30 to 60% of their working time on communication, whereas for SCI systems this figure is in the range of 5 to 10%. Thus, employing SCI for data transport means that less effort from the NDB API programmer is required and greater scalability can be achieved, even for applications using data from many different parts of the database.

A simple example would be an application that uses many simple updates where a transaction needs to update one record. This record has a 32-bit primary key which also serves as the partitioning key. Then the keyData is used as the address of the integer of the primary key and keyLen is 4.

1.3.4 The Adaptive Send Algorithm

Abstract

Discusses the mechanics of transaction handling and transmission in NDB Cluster and the NDB API, and the objects used to implement these.

When transactions are sent using NdbTransaction::execute(), they are not immediately transferred to the NDB Kernel. Instead, transactions are kept in a special send list (buffer) in the Ndb object to which they belong. The adaptive send algorithm decides when transactions should actually be transferred to the NDB kernel.

The NDB API is designed as a multi-threaded interface, and so it is often desirable to transfer database operations from more than one thread at a time. The NDB API keeps track of which Ndb objects are active in transferring information to the NDB kernel and the expected number of threads to interact with the NDB kernel. Note that a given instance of Ndb should be used in at most one thread; different threads should not share the same Ndb object.

There are four conditions leading to the transfer of database operations from Ndb object buffers to the NDB kernel:

  1. The NDB Transporter (TCP/IP, SCI, or shared memory) decides that a buffer is full and sends it off. The buffer size is implementation-dependent and may change between NDB Cluster releases. When TCP/IP is the transporter, the buffer size is usually around 64 KB. Since each Ndb object provides a single buffer per data node, the notion of a full buffer is local to each data node.

  2. The accumulation of statistical data on transferred information may force sending of buffers to all storage nodes (that is, when all the buffers become full).

  3. Every 10 milliseconds, a special transmission thread checks whether or not any send activity has occurred. If not, then the thread will force transmission to all nodes. This means that 20 ms is the maximum amount of time that database operations are kept waiting before being dispatched. A 10-millisecond limit is likely in future releases of NDB Cluster; checks more frequent than this require additional support from the operating system.

  4. For methods that are affected by the adaptive send algorithm (such as NdbTransaction::execute()), there is a force parameter that overrides its default behavior in this regard and forces immediate transmission to all nodes. See the individual NDB API class listings for more information.

The conditions listed above are subject to change in future releases of NDB Cluster.