Table of Contents
Abstract
This chapter provides a general overview of essential NDB Cluster, NDB API, and MGM API concepts, terminology, and programming constructs.
For an overview of Java APIs that can be used with NDB Cluster, see Section 4.1, “MySQL Cluster Connector for Java: Overview”.
For information about using Memcache with NDB Cluster, see Chapter 6, ndbmemcache—Memcache API for NDB Cluster.
For information about writing JavaScript applications using Node.js with MySQL, see Chapter 5, MySQL NoSQL Connector for JavaScript.
Abstract
This section introduces the NDB Transaction and Scanning APIs as well as the NDB Management (MGM) API for use in building applications to run on NDB Cluster. It also discusses the general theory and principles involved in developing such applications.
The NDB API is an object-oriented application programming interface for NDB Cluster that implements indexes, scans, transactions, and event handling. NDB transactions are ACID-compliant in that they provide a means to group operations in such a way that they succeed (commit) or fail as a unit (rollback). It is also possible to perform operations in a “no-commit” or deferred mode, to be committed at a later time.
NDB scans are conceptually rather similar to the SQL cursors implemented in MySQL 5.0 and other common enterprise-level database management systems. These provide high-speed row processing for record retrieval purposes. (NDB Cluster naturally supports set processing just as does MySQL in its non-Cluster distributions. This can be accomplished through the usual MySQL APIs discussed in the MySQL Manual and elsewhere.) The NDB API supports both table scans and row scans; the latter can be performed using either unique or ordered indexes. Event detection and handling is discussed in Section 2.3.21, “The NdbEventOperation Class”, as well as Section 2.5.8, “NDB API Event Handling Example”.
In addition, the NDB API provides object-oriented error-handling facilities in order to provide a means of recovering gracefully from failed operations and other problems. (See Section 2.5.3, “NDB API Example: Handling Errors and Retrying Transactions”, for a detailed example.)
The NDB API provides a number of classes implementing the
functionality described above. The most important of these
include the Ndb,
Ndb_cluster_connection,
NdbTransaction, and
NdbOperation classes. These
model (respectively) database connections, cluster connections,
transactions, and operations. These classes and their subclasses
are listed in Section 2.3, “NDB API Classes, Interfaces, and Structures”. Error conditions in
the NDB API are handled using
NdbError.
NDB API applications access the NDB Cluster's data store
directly, without requiring a MySQL Server as an intermediary.
This means that such applications are not bound by the MySQL
privilege system; any NDB API application has read and write
access to any NDB table stored in
the same NDB Cluster at any time without restriction.
It is possible to distribute the MySQL grant tables,
converting them from the default storage engine to
NDB. Once this has been done, NDB
API applications can access any of the MySQL grant tables.
This means that such applications can read or write user
names, passwords, and any other data stored in these tables.
The NDB Cluster Management API, also known as the MGM API, is a C-language programming interface intended to provide administrative services for the cluster. These include starting and stopping NDB Cluster nodes, handling NDB Cluster logging, backups, and restoration from backups, as well as various other management tasks. A conceptual overview of the MGM API and its uses can be found in Chapter 3, The MGM API.
The MGM API's principal structures model the states of
individual modes
(ndb_mgm_node_state), the state
of the NDB Cluster as a whole
(ndb_mgm_cluster_state), and
management server response messages
(ndb_mgm_reply). See
Section 3.4, “MGM API Structures”, for detailed descriptions of
these.
Abstract
This section provides a glossary of terms which are unique to the NDB and MGM APIs, or that have a specialized meaning when applied in the context of either or both of these APIs.
The terms in the following list are useful to an understanding of NDB Cluster, the NDB API, or have a specialized meaning when used in one of these:
Backup. A complete copy of all NDB Cluster data, transactions and logs, saved to disk.
Restore. Return the cluster to a previous state, as stored in a backup.
Checkpoint.
Generally speaking, when data is saved to disk, it is said that
a checkpoint has been reached. When working with the
NDB storage engine, there are two sorts of
checkpoints which work together in order to ensure that a
consistent view of the cluster's data is maintained. These two
types, local checkpoints
and global checkpoints,
are described in the next few paragraphs:
Local checkpoint (LCP). This is a checkpoint that is specific to a single node; however, LCPs take place for all nodes in the cluster more or less concurrently. An LCP involves saving all of a node's data to disk, and so usually occurs every few minutes, depending upon the amount of data stored by the node.
More detailed information about LCPs and their behavior can be found in the MySQL Manual; see in particular Defining NDB Cluster Data Nodes.
Global checkpoint (GCP). A GCP occurs every few seconds, when transactions for all nodes are synchronized and the REDO log is flushed to disk.
A related term is GCI, which stands for “Global Checkpoint ID”. This marks the point in the REDO log where a GCP took place.
Node. A component of NDB Cluster. 3 node types are supported:
A management (MGM) node is an instance of ndb_mgmd, the NDB Cluster management server daemon.
A data node an instance of ndbd, the NDB Cluster data storage daemon, and stores NDB Cluster data. This may also be an instance of ndbmtd, a multithreaded version of ndbd.
An API nodeis an application that accesses NDB Cluster data. SQL node refers to a mysqld (MySQL Server) process that is connected to the NDB Cluster as an API node.
For more information about these node types, please refer to Section 1.3.3, “Review of NDB Cluster Concepts”, or to NDB Cluster Programs, in the MySQL Manual.
Node failure. An NDB Cluster is not solely dependent upon the functioning of any single node making up the cluster, which can continue to run even when one node fails.
Node restart. The process of restarting an NDB Cluster node which has stopped on its own or been stopped deliberately. This can be done for several different reasons, listed here:
Restarting a node which has shut down on its own. (This is known as forced shutdown or node failure; the other cases discussed here involve manually shutting down the node and restarting it).
To update the node's configuration.
As part of a software or hardware upgrade.
In order to defragment the node's
DataMemory.
Initial node restart. The process of starting an NDB Cluster node with its file system having been removed. This is sometimes used in the course of software upgrades and in other special circumstances.
System crash (system failure). This can occur when so many data nodes have failed that the NDB Cluster's state can no longer be guaranteed.
System restart. The process of restarting an NDB Cluster and reinitializing its state from disk logs and checkpoints. This is required after any shutdown of the cluster, planned or unplanned.
Fragment.
Contains a portion of a database table. In the
NDB storage engine, a table is broken up into
and stored as a number of subsets, usually referred to as
fragments. A fragment is sometimes also called a
partition.
Replica.
Under the NDB storage engine, each table
fragment has number of replicas in order to provide redundancy.
Transporter. A protocol providing data transfer across a network. The NDB API supports 4 different types of transporter connections: TCP/IP (local), TCP/IP (remote), SCI, and SHM. TCP/IP is, of course, the familiar network protocol that underlies HTTP, FTP, and so forth, on the Internet. SCI (Scalable Coherent Interface) is a high-speed protocol used in building multiprocessor systems and parallel-processing applications. SHM stands for Unix-style shared memory segments. For an informal introduction to SCI, see this essay at www.dolphinics.com.
NDB.
This originally stood for “Network DataBase”. It
now refers to the MySQL storage engine (named
NDB or
NDBCLUSTER) used to enable the NDB
Cluster distributed database system.
ACC (Access Manager). An NDB kernel block that handles hash indexes of primary keys providing speedy access to the records. For more information, see The DBACC Block.
TUP (Tuple Manager). This NDB kernel block handles storage of tuples (records) and contains the filtering engine used to filter out records and attributes when performing reads or updates. See The DBTUP Block, for more information.
TC (Transaction Coordinator). Handles coordination of transactions and timeouts in the NDB kernel (see The DBTC Block). Provides interfaces to the NDB API for performing indexes and scan operations.
For more information, see NDB Kernel Blocks, elsewhere in this Guide..
See also NDB Cluster Overview, in the MySQL Manual.
Abstract
This section discusses the high-level architecture of the NDB API, and introduces the NDB classes which are of greatest use and interest to the developer. It also covers most important NDB API concepts, including a review of NDB Cluster Concepts.
The NDB API is an NDB Cluster application interface that implements transactions. It consists of the following fundamental classes:
Ndb_cluster_connection
represents a connection to a cluster.
Ndb is the main class, and
represents a connection to a database.
NdbDictionary provides
meta-information about tables and attributes.
NdbTransaction represents a
transaction.
NdbOperation represents an
operation using a primary key.
NdbScanOperation represents
an operation performing a full table scan.
NdbIndexOperation
represents an operation using a unique hash index.
NdbIndexScanOperation
represents an operation performing a scan using an ordered
index.
NdbRecAttr represents an
attribute value.
In addition, the NDB API defines an
NdbError structure, which
contains the specification for an error.
It is also possible to receive events triggered when data in the
database is changed. This is accomplished through the
NdbEventOperation class.
The NDB event notification API is not
supported prior to MySQL 5.1.
For more information about these classes as well as some additional auxiliary classes not listed here, see Section 2.3, “NDB API Classes, Interfaces, and Structures”.
The main structure of an application program is as follows:
Connect to a cluster using the
Ndb_cluster_connection
object.
Initiate a database connection by constructing and
initialising one or more
Ndb objects.
Identify the tables, columns, and indexes on which you wish
to operate, using
NdbDictionary and one or
more of its subclasses.
Define and execute transactions using the
NdbTransaction class.
Delete Ndb objects.
Terminate the connection to the cluster (terminate an
instance of
Ndb_cluster_connection).
The procedure for using transactions is as follows:
Start a transaction (instantiate an
NdbTransaction object).
Add and define operations associated with the transaction
using instances of one or more of the
NdbOperation,
NdbScanOperation,
NdbIndexOperation, and
NdbIndexScanOperation
classes.
Execute the transaction (call
NdbTransaction::execute()).
The operation can be of two different
types—Commit or
NoCommit:
If the operation is of type
NoCommit, then the application
program requests that the operation portion of a
transaction be executed, but without actually
committing the transaction. Following the execution of
a NoCommit operation, the program
can continue to define additional transaction
operations for later execution.
NoCommit operations can also be
rolled back by the application.
If the operation is of type Commit,
then the transaction is immediately committed. The
transaction must be closed after it has been committed
(even if the commit fails), and no further operations
can be added to or defined for this transaction.
Synchronous transactions are defined and executed as follows:
Begin (create) the transaction, which is referenced by an
NdbTransaction object
typically created using
Ndb::startTransaction().
At this point, the transaction is merely being defined; it
is not yet sent to the NDB kernel.
Define operations and add them to the transaction, using
one or more of the following, along with the appropriate
methods of the
respectiveNdbOperation
class (or possibly one or more of its subclasses):
At this point, the transaction has still not yet been sent to the NDB kernel.
Execute the transaction, using the
NdbTransaction::execute()
method.
Close the transaction by calling
Ndb::closeTransaction().
For an example of this process, see Section 2.5.1, “NDB API Example Using Synchronous Transactions”.
To execute several synchronous transactions in parallel, you
can either use multiple Ndb
objects in several threads, or start multiple application
programs.
An NdbTransaction consists of
a list of operations, each of which is represented by an
instance of NdbOperation,
NdbScanOperation,
NdbIndexOperation, or
NdbIndexScanOperation (that
is, of NdbOperation or one of
its child classes).
See Section 1.3.2.3.1, “NDB Access Types”, for general information about NDB Cluster access operation types.
The data node process has a number of simple constructs which are used to access the data in an NDB Cluster. We have created a very simple benchmark to check the performance of each of these.
There are four access methods:
Primary key access. This is access of a record through its primary key. In the simplest case, only one record is accessed at a time, which means that the full cost of setting up a number of TCP/IP messages and a number of costs for context switching are borne by this single request. In the case where multiple primary key accesses are sent in one batch, those accesses share the cost of setting up the necessary TCP/IP messages and context switches. If the TCP/IP messages are for different destinations, additional TCP/IP messages need to be set up.
Unique key access. Unique key accesses are similar to primary key accesses, except that a unique key access is executed as a read on an index table followed by a primary key access on the table. However, only one request is sent from the MySQL Server, and the read of the index table is handled by the data node. Such requests also benefit from batching.
Full table scan.
When no indexes exist for a lookup on a table, a full
table scan is performed. This is sent as a single
request to the ndbd process, which
then divides the table scan into a set of parallel
scans on all NDB data node
processes.
Range scan using ordered index. When an ordered index is used, it performs a scan in the same manner as the full table scan, except that it scans only those records which are in the range used by the query transmitted by the MySQL server (SQL node). All partitions are scanned in parallel when all bound index attributes include all attributes in the partitioning key.
After the operation is created using NdbTransaction::getNdbOperation() or NdbTransaction::getNdbIndexOperation(), it is defined in the following three steps:
Specify the standard operation type using
NdbOperation::readTuple().
Specify search conditions using
NdbOperation::equal().
Specify attribute actions using
NdbOperation::getValue().
Here are two brief examples illustrating this process. For the sake of brevity, we omit error handling.
This first example uses an
NdbOperation:
// 1. Retrieve table object
myTable= myDict->getTable("MYTABLENAME");
// 2. Create an NdbOperation on this table
myOperation= myTransaction->getNdbOperation(myTable);
// 3. Define the operation's type and lock mode
myOperation->readTuple(NdbOperation::LM_Read);
// 4. Specify search conditions
myOperation->equal("ATTR1", i);
// 5. Perform attribute retrieval
myRecAttr= myOperation->getValue("ATTR2", NULL);
For additional examples of this sort, see Section 2.5.1, “NDB API Example Using Synchronous Transactions”.
The second example uses an
NdbIndexOperation:
// 1. Retrieve index object
myIndex= myDict->getIndex("MYINDEX", "MYTABLENAME");
// 2. Create
myOperation= myTransaction->getNdbIndexOperation(myIndex);
// 3. Define type of operation and lock mode
myOperation->readTuple(NdbOperation::LM_Read);
// 4. Specify Search Conditions
myOperation->equal("ATTR1", i);
// 5. Attribute Actions
myRecAttr = myOperation->getValue("ATTR2", NULL);
Another example of this second type can be found in Section 2.5.5, “NDB API Example: Using Secondary Indexes in Scans”.
We now discuss in somewhat greater detail each step involved in the creation and use of synchronous transactions.
Define single row operation type. The following operation types are supported:
NdbOperation::insertTuple():
Inserts a nonexisting tuple.
NdbOperation::writeTuple():
Updates a tuple if one exists, otherwise inserts a
new tuple.
NdbOperation::updateTuple():
Updates an existing tuple.
NdbOperation::deleteTuple():
Deletes an existing tuple.
NdbOperation::readTuple():
Reads an existing tuple using the specified lock
mode.
All of these operations operate on the unique tuple key.
When NdbIndexOperation
is used, then each of these operations operates on a
defined unique hash index.
If you want to define multiple operations within the
same transaction, then you need to call
NdbTransaction::getNdbOperation()
or
NdbTransaction::getNdbIndexOperation()
for each operation.
Specify Search Conditions.
The search condition is used to select tuples. Search
conditions are set using
NdbOperation::equal().
Specify Attribute Actions. Next, it is necessary to determine which attributes should be read or updated. It is important to remember that:
Deletes can neither read nor set values, but only delete them.
Reads can only read values.
Updates can only set values. Normally the attribute is identified by name, but it is also possible to use the attribute's identity to determine the attribute.
NdbOperation::getValue()
returns an NdbRecAttr
object containing the value as read. To obtain the
actual value, one of two methods can be used; the
application can either
Use its own memory (passed through a pointer
aValue) to
NdbOperation::getValue(),
or
receive the attribute value in an
NdbRecAttr object
allocated by the NDB API.
The NdbRecAttr object
is released when
Ndb::closeTransaction()
is called. For this reason, the application cannot
reference this object following any subsequent call to
Ndb::closeTransaction().
Attempting to read data from an
NdbRecAttr object
before calling
NdbTransaction::execute()
yields an undefined result.
Scans are roughly the equivalent of SQL cursors, providing a
means to perform high-speed row processing. A scan can be
performed on either a table (using an
NdbScanOperation) or an
ordered index (by means of an
NdbIndexScanOperation).
Scan operations have the following characteristics:
They can perform read operations which may be shared, exclusive, or dirty.
They can potentially work with multiple rows.
They can be used to update or delete multiple rows.
They can operate on several nodes in parallel.
After the operation is created using
NdbTransaction::getNdbScanOperation()
or
NdbTransaction::getNdbIndexScanOperation(),
it is carried out as follows:
Define the standard operation type, using
NdbScanOperation::readTuples().
See Section 2.3.29.7, “NdbScanOperation::readTuples()”, for additional information about deadlocks which may occur when performing simultaneous, identical scans with exclusive locks.
Specify search conditions, using
NdbScanFilter,
NdbIndexScanOperation::setBound(),
or both.
Specify attribute actions using
NdbOperation::getValue().
Execute the transaction using
NdbTransaction::execute().
Traverse the result set by means of successive calls to
NdbScanOperation::nextResult().
Here are two brief examples illustrating this process. Once again, in order to keep things relatively short and simple, we forego any error handling.
This first example performs a table scan using an
NdbScanOperation:
// 1. Retrieve a table object
myTable= myDict->getTable("MYTABLENAME");
// 2. Create a scan operation (NdbScanOperation) on this table
myOperation= myTransaction->getNdbScanOperation(myTable);
// 3. Define the operation's type and lock mode
myOperation->readTuples(NdbOperation::LM_Read);
// 4. Specify search conditions
NdbScanFilter sf(myOperation);
sf.begin(NdbScanFilter::OR);
sf.eq(0, i); // Return rows with column 0 equal to i or
sf.eq(1, i+1); // column 1 equal to (i+1)
sf.end();
// 5. Retrieve attributes
myRecAttr= myOperation->getValue("ATTR2", NULL);
The second example uses an
NdbIndexScanOperation to
perform an index scan:
// 1. Retrieve index object
myIndex= myDict->getIndex("MYORDEREDINDEX", "MYTABLENAME");
// 2. Create an operation (NdbIndexScanOperation object)
myOperation= myTransaction->getNdbIndexScanOperation(myIndex);
// 3. Define type of operation and lock mode
myOperation->readTuples(NdbOperation::LM_Read);
// 4. Specify search conditions
// All rows with ATTR1 between i and (i+1)
myOperation->setBound("ATTR1", NdbIndexScanOperation::BoundGE, i);
myOperation->setBound("ATTR1", NdbIndexScanOperation::BoundLE, i+1);
// 5. Retrieve attributes
myRecAttr = MyOperation->getValue("ATTR2", NULL);
Some additional discussion of each step required to perform a scan follows:
Define Scan Operation Type.
It is important to remember that only a single
operation is supported for each scan operation
(NdbScanOperation::readTuples()
or
NdbIndexScanOperation::readTuples()).
If you want to define multiple scan operations within
the same transaction, then you need to call
NdbTransaction::getNdbScanOperation()
or
NdbTransaction::getNdbIndexScanOperation()
separately for each operation.
Specify Search Conditions.
The search condition is used to select tuples. If no
search condition is specified, the scan will return
all rows in the table. The search condition can be an
NdbScanFilter (which
can be used on both
NdbScanOperation and
NdbIndexScanOperation)
or bounds (which can be used only on index scans - see
NdbIndexScanOperation::setBound()).
An index scan can use both
NdbScanFilter and
bounds.
When NdbScanFilter is used, each row is examined, whether or not it is actually returned. However, when using bounds, only rows within the bounds will be examined.
Specify Attribute Actions.
Next, it is necessary to define which attributes
should be read. As with transaction attributes, scan
attributes are defined by name, but it is also
possible to use the attributes' identities to define
attributes as well. As discussed elsewhere in this
document (see
Section 1.3.2.2, “Synchronous Transactions”),
the value read is returned by the
NdbOperation::getValue()
method as an
NdbRecAttr object.
Scanning can also be used to update or delete rows. This is performed as follows:
Scanning with exclusive locks using
NdbOperation::LM_Exclusive.
(When iterating through the result
set:) For each row, optionally calling either
NdbScanOperation::updateCurrentTuple()
or
NdbScanOperation::deleteCurrentTuple().
(If performing
NdbScanOperation::updateCurrentTuple():)
Setting new values for records simply by using
NdbOperation::setValue().
NdbOperation::equal()
should not be called in such cases, as the primary key
is retrieved from the scan.
The update or delete is not actually performed until the
next call to
NdbTransaction::execute()
is made, just as with single row operations.
NdbTransaction::execute()
also must be called before any locks are released; for
more information, see
Section 1.3.2.3.5, “Lock Handling with Scans”.
Features Specific to Index Scans.
When performing an index scan, it is possible to scan only
a subset of a table using
NdbIndexScanOperation::setBound().
In addition, result sets can be sorted in either ascending
or descending order, using
NdbIndexScanOperation::readTuples().
Note that rows are returned unordered by default unless
sorted is set to
true.
It is also important to note that, when using
NdbIndexScanOperation::BoundEQ (see
Section 2.3.23.1, “NdbIndexScanOperation::BoundType”) with
a partition key, only fragments containing rows will
actually be scanned. Finally, when performing a sorted scan,
any value passed as the
NdbIndexScanOperation::readTuples()
method's parallel argument will
be ignored and maximum parallelism will be used instead. In
other words, all fragments which it is possible to scan are
scanned simultaneously and in parallel in such cases.
Performing scans on either a table or an index has the
potential to return a great many records; however, Ndb locks
only a predetermined number of rows per fragment at a time.
The number of rows locked per fragment is controlled by the
batch parameter passed to
NdbScanOperation::readTuples().
In order to enable the application to handle how locks are
released,
NdbScanOperation::nextResult()
has a Boolean parameter
fetchAllowed. If
NdbScanOperation::nextResult()
is called with fetchAllowed equal
to false, then no locks may be released
as result of the function call. Otherwise the locks for the
current batch may be released.
This next example shows a scan delete that handles locks in an efficient manner. For the sake of brevity, we omit error-handling.
int check;
// Outer loop for each batch of rows
while((check = MyScanOperation->nextResult(true)) == 0)
{
do
{
// Inner loop for each row within the batch
MyScanOperation->deleteCurrentTuple();
}
while((check = MyScanOperation->nextResult(false)) == 0);
// When there are no more rows in the batch, execute all defined deletes
MyTransaction->execute(NoCommit);
}
For a more complete example of a scan, see Section 2.5.4, “NDB API Basic Scanning Example”.
Errors can occur either when operations making up a
transaction are being defined, or when the transaction is
actually being executed. Catching and handling either sort
of error requires testing the value returned by
NdbTransaction::execute(),
and then, if an error is indicated (that is, if this value
is equal to -1), using the following two
methods in order to identify the error's type and location:
NdbTransaction::getNdbErrorOperation()
returns a reference to the operation causing the most
recent error.
NdbTransaction::getNdbErrorLine()
yields the method number of the erroneous method in the
operation, starting with 1.
This short example illustrates how to detect an error and to use these two methods to identify it:
theTransaction = theNdb->startTransaction();
theOperation = theTransaction->getNdbOperation("TEST_TABLE");
if(theOperation == NULL)
goto error;
theOperation->readTuple(NdbOperation::LM_Read);
theOperation->setValue("ATTR_1", at1);
theOperation->setValue("ATTR_2", at1); // Error occurs here
theOperation->setValue("ATTR_3", at1);
theOperation->setValue("ATTR_4", at1);
if(theTransaction->execute(Commit) == -1)
{
errorLine = theTransaction->getNdbErrorLine();
errorOperation = theTransaction->getNdbErrorOperation();
}
Here, errorLine is 3,
as the error occurred in the third method called on the
NdbOperation object (in
this case, theOperation). If the result
of
NdbTransaction::getNdbErrorLine()
is 0, then the error occurred when the
operations were executed. In this example,
errorOperation is a pointer to the object
theOperation. The
NdbTransaction::getNdbError()
method returns an NdbError
object providing information about the error.
Transactions are not automatically
closed when an error occurs. You must call
Ndb::closeTransaction()
or
NdbTransaction::close()
to close the transaction.
See Section 2.3.16.2, “Ndb::closeTransaction()”, and Section 2.3.30.1, “NdbTransaction::close()”.
One recommended way to handle a transaction failure (that is, when an error is reported) is as shown here:
Roll back the transaction by calling
NdbTransaction::execute()
with a special
ExecType
value for the type parameter.
See Section 2.3.30.6, “NdbTransaction::execute()” and Section 2.3.30.5, “NdbTransaction::ExecType”, for more information about how this is done.
Close the transaction by calling
NdbTransaction::close().
If the error was temporary, attempt to restart the transaction.
Several errors can occur when a transaction contains
multiple operations which are simultaneously executed. In
this case the application must go through all operations and
query each of their
NdbError objects to find
out what really happened.
Errors can occur even when a commit is reported as
successful. In order to handle such situations, the NDB
API provides an additional
NdbTransaction::commitStatus()
method to check the transaction's commit status.
Abstract
This section covers the NDB Kernel, and discusses NDB Cluster transaction handling and transaction coordinators. It also describes NDB record structures and concurrency issues.
The NDB Kernel is the collection of data nodes belonging to an NDB Cluster. The application programmer can for most purposes view the set of all storage nodes as a single entity. Each data node is made up of three main components:
When an application executes a transaction, it connects to one transaction coordinator on one data node. Usually, the programmer does not need to specify which TC should be used, but in some cases where performance is important, the programmer can provide “hints” to use a certain TC. (If the node with the desired transaction coordinator is down, then another TC will automatically take its place.)
Each data node has an ACC and a TUP which store the indexes and data portions of the database table fragment. Even though a single TC is responsible for the transaction, several ACCs and TUPs on other data nodes might be involved in that transaction's execution.
The default method is to select the transaction coordinator (TC) determined to be the "nearest" data node, using a heuristic for proximity based on the type of transporter connection. In order of nearest to most distant, these are:
SCI
SHM
TCP/IP (localhost)
TCP/IP (remote host)
If there are several connections available with the same proximity, one is selected for each transaction in a round-robin fashion. Optionally, you may set the method for TC selection to round-robin mode, where each new set of transactions is placed on the next data node. The pool of connections from which this selection is made consists of all available connections.
As noted in Section 1.3.3, “Review of NDB Cluster Concepts”, the application programmer can provide hints to the NDB API as to which transaction coordinator should be uses. This is done by providing a table and a partition key (usually the primary key). If the primary key is the partition key, then the transaction is placed on the node where the primary replica of that record resides. Note that this is only a hint; the system can be reconfigured at any time, in which case the NDB API chooses a transaction coordinator without using the hint. For more information, see Section 2.3.2.4.11, “Column::getPartitionKey()”, and Section 2.3.16.34, “Ndb::startTransaction()”.
The application programmer can specify the partition key from SQL by using the following construct:
CREATE TABLE ... ENGINE=NDB PARTITION BY KEY (attribute_list);
For additional information, see Partitioning, and in particular KEY Partitioning, in the MySQL Manual.
The NDB storage engine used by
NDB Cluster is a relational database engine storing records in
tables as with other relational database systems. Table rows
represent records as tuples of relational data. When a new
table is created, its attribute schema is specified for the
table as a whole, and thus each table row has the same
structure. Again, this is typical of relational databases, and
NDB is no different in this regard.
Primary Keys. Each record has from 1 up to 32 attributes which belong to the primary key of the table.
Transactions. Transactions are committed first to main memory, and then to disk, after a global checkpoint (GCP) is issued. Since all data are (in most NDB Cluster configurations) synchronously replicated and stored on multiple data nodes, the system can handle processor failures without loss of data. However, in the case of a system-wide failure, all transactions (committed or not) occurring since the most recent GCP are lost.
Concurrency Control.
NDB uses
pessimistic concurrency
control based on locking. If a requested lock
(implicit and depending on database operation) cannot be
attained within a specified time, then a timeout error
results.
Concurrent transactions as requested by parallel application programs and thread-based applications can sometimes deadlock when they try to access the same information simultaneously. Thus, applications need to be written in a manner such that timeout errors occurring due to such deadlocks are handled gracefully. This generally means that the transaction encountering a timeout should be rolled back and restarted.
Hints and Performance. Placing the transaction coordinator in close proximity to the actual data used in the transaction can in many cases improve performance significantly. This is particularly true for systems using TCP/IP. For example, a Solaris system using a single 500 MHz processor has a cost model for TCP/IP communication which can be represented by the formula
[30 microseconds] + ([100 nanoseconds] * [number of bytes])
This means that if we can ensure that we use “popular” links we increase buffering and thus drastically reduce the costs of communication. The same system using SCI has a different cost model:
[5 microseconds] + ([10 nanoseconds] * [number of bytes])
This means that the efficiency of an SCI system is much less dependent on selection of transaction coordinators. Typically, TCP/IP systems spend 30 to 60% of their working time on communication, whereas for SCI systems this figure is in the range of 5 to 10%. Thus, employing SCI for data transport means that less effort from the NDB API programmer is required and greater scalability can be achieved, even for applications using data from many different parts of the database.
A simple example would be an application that uses many simple
updates where a transaction needs to update one record. This
record has a 32-bit primary key which also serves as the
partitioning key. Then the keyData is used
as the address of the integer of the primary key and
keyLen is 4.
Abstract
Discusses the mechanics of transaction handling and transmission in NDB Cluster and the NDB API, and the objects used to implement these.
When transactions are sent using
NdbTransaction::execute(), they
are not immediately transferred to the NDB Kernel. Instead,
transactions are kept in a special send list (buffer) in the
Ndb object to which they
belong. The adaptive send algorithm decides when transactions
should actually be transferred to the NDB kernel.
The NDB API is designed as a multi-threaded interface, and so it
is often desirable to transfer database operations from more
than one thread at a time. The NDB API keeps track of which
Ndb objects are active in
transferring information to the NDB kernel and the expected
number of threads to interact with the NDB kernel. Note that a
given instance of Ndb should be
used in at most one thread; different threads should
not share the same
Ndb object.
There are four conditions leading to the transfer of database
operations from Ndb object
buffers to the NDB kernel:
The NDB Transporter (TCP/IP, SCI, or shared memory) decides
that a buffer is full and sends it off. The buffer size is
implementation-dependent and may change between NDB Cluster
releases. When TCP/IP is the transporter, the buffer size is
usually around 64 KB. Since each
Ndb object provides a
single buffer per data node, the notion of a
“full” buffer is local to each data node.
The accumulation of statistical data on transferred information may force sending of buffers to all storage nodes (that is, when all the buffers become full).
Every 10 milliseconds, a special transmission thread checks whether or not any send activity has occurred. If not, then the thread will force transmission to all nodes. This means that 20 ms is the maximum amount of time that database operations are kept waiting before being dispatched. A 10-millisecond limit is likely in future releases of NDB Cluster; checks more frequent than this require additional support from the operating system.
For methods that are affected by the adaptive send algorithm
(such as
NdbTransaction::execute()),
there is a force parameter that
overrides its default behavior in this regard and forces
immediate transmission to all nodes. See the individual NDB
API class listings for more information.
The conditions listed above are subject to change in future releases of NDB Cluster.