WL#7069: Provide data dictionary information in serialized form
Affects: Server-8.0
—
Status: Complete
—
Priority: Medium
Executive summary ################# Meta data must be provided in a new serialized form, since the current .FRM files will be abandoned. The new serialized format will be used by NDB as a replacement for “.FRM shipping”, and by InnoDB to be stored in .IBD files to support transportable table spaces, and to support recovery as a last resort. To support this functionality, we need to provide an extension of the handler- and handlerton APIs with virtual functions for handling serialized meta data. Each storage engine can handle this information according to its preference, e.g.: - NDB can ship the meta data across the cluster to other nodes for re-creation. - InnoDB can store the meta data in tablespace files to make them self contained. Other storage engines, e.g. MyISAM, can rely on the default implementation, which will serialize the meta data and save it into a separate file, hence providing the same benefits as the .FRM files for simple engines: - Simple backup/restore by copying all files related to tables. - Recover from a severe crash by re-creating the dictionary based on the serialized dictionary data. - Retrieve needed table definitions from an offline server. Support must also be provided for re-creating dictionary items based on serialized meta data. High level description ###################### With the new DD, there is still a need for serialized dictionary information for the following purposes: - Repair dictionary: If normal recovery does not succeed, it may still be possible to get hold of dictionary information and table data and import this into a new server instance. - View and browse dictionary definitions: Dictionary definitions can be provided offline, independently of a running MySQL server. - Simple (but unreliable) backup: Backup may be done by copying files, but is unreliable since transaction consistency is not guaranteed. Restore must be done by an explicit import command. - Ship dictionary information: MySQL Cluster must be able to ship table meta data to a remote MySQL server within the same cluster. Thus, we need to support the following core functionality: 1. Provide serialized meta data for various dictionary items, and support SE specific handling of the data. The default handling will be to write the serialized data to a single file. Specific engines may take other actions, e.g. store it in a tablespace file (InnoDB) or ship it to a different process (NDB). 2. Provide an internal API for (re-)creating a dictionary item based on serialized meta data. For external usage by an end user, re-creating dictionary items may be done by means of explicit commands for importing tables (or, for InnoDB, tablespaces). Changes in current functionality: Auto discovery ================================================ Auto discovery is the mechanism which is invoked when a requested table is not defined in the dictionary. In 5.7, this happens if a requested .FRM file is not present. Then, the storage engines implementing the discover() function in the handlerton API are invoked and asked to provide the requested .FRM file somehow. As of now, two storage engines support the auto discovery mechanism: - MySQL Cluster will try to retrieve the file from the NDB dictionary. - Archive will try to retrieve the file from within the corresponding ARZ data file. For 8.0 and beyond, auto discovery is relevant when a requested table is not present in the data dictionary; this will be the situation equivalent to a missing .FRM file. Thus, MySQL Cluster will still depend on this mechanism for propagating meta data. However, it is suggested to abandon the auto discovery functionality for the Archive storage engine. DROP TABLE of an archive table will remove the .ARZ file, so abandoning auto discovery will not break recoverability afte dropping a table. Changes in current functionality: Populating the dictionary cache ================================================================= In 5.7, .FRM files contain meta data. On a cache miss, the .FRM file is looked up in the file system; this is the way the dictionary cache is populated as of 5.7. This has the side effect that for some storage engines, it has been possible to "import" tables by copying files into the proper directory. Below, we will refer to this side effect as "auto import". With the new DD, this "auto import" side effect will disappear. The equivalent of copying FRM files into the proper directory would actually be to allow inserting rows into the data dictionary tables. For obvious reasons, this cannot be allowed. Thus, we suggest to abandon the "auto import" mechanism and instead provide an explicit SQL command for importing tables. Use Cases ========= The following use cases may be relevant: Detect inconsistency when opening a table or importing a tablespace ------------------------------------------------------------------- Storing serialized dictionary information makes it possible to detect inconsistencies when opening tables, and when importing tablespaces (see below). Checking for consistency should be done by the SQL layer. Two types of consistency checks are relevant: Data vs. meta data, and meta data vs meta data (i.e., serialized dictionary information compared to the meta data stored in the global data dictionary). Transportable tablespaces (InnoDB) ---------------------------------- Transportable tablespaces as of 5.7 is supported using a .CFG file for transferring required meta data. With the new DD, the .CFG file will not be required when importing the data. Instead, the SDI may be used to re-create table(s) or validate the schema. One of the restrictions of transportable tablespaces as of now, is that they are only guaranteed to work if the import and export is done on the same server version. Thus, directly importing a pre- 8.0 tablespace into a server supporting the new data dictionary is not supported, nor is import in the other direction. With the new DD, the restriction regarding import and export being supported only within the same server version, will still apply as of now. However, in the future, this restriction may be relaxed. When importing a tablespace where the meta data refers to contents in other tablespaces, import may be aborted or not depending on the circumstances, and depending on the implemented support for multi tablespace operations. The important thing in the context of this worklog is to keep the opportunity open in the implementation to support multi tablespace operations in the future. Simple backup/restore or export/import (MyISAM) ----------------------------------------------- For simple engines, tables may be backed up by copying files, as mentioned above. InnoDB will rely on transportable tablespaces (see above) for achieving this functionality. The files that are copied may not be transaction consistent. There is no special support for ensuring transaction consistency in this scenario, this must be done by other means, e.g. by issuing FLUSH TABLES WITH READ LOCK for the duration of the data copying. Alternatively, an explicit export command may be provided. For “restoring” the data, the files must be copied into the relevant location under the server's data directory. Then, an explicit import command can be used to read the serialized dictionary information and update the server's data dictionary. Disaster recovery ----------------- If the data dictionary and the data file where it is located are left readable and consistent after a server crash, the server should start as usual without any need for the serialized dictionary data. There will be a recovery mechanism implemented within InnoDB, but this will not make use of the serialized dictionary information. However, if the data dictionary cannot be read in the usual way, parts of it may still be repaired using the serialized dictionary information: - If the server fails to initialize the data dictionary, it may be started without the system tablespace files, i.e., the files must be moved elsewhere, or the data directory must be changed, forcing the server to initialize a new system tablespace. This will be equivalent to starting a new server instance. - Tables and tablespaces may then be imported from the old server to the new server instance. If the serialized meta data in the definition files or the tablespace files is corrupted, tool support may be needed to repair it. Such tool support is beyond the scope of this worklog. Table definition shipping ------------------------- MySQL Cluster is shipping meta data between the different MySQL servers within the cluster to re-create the dictionary items in order to have the same dictionary contents available on each server. Shipping serialized dictionary information should be quite similar to shipping the .FRM files. When a table is created, its meta data should be serialized and shipped to the other MySQL servers in the cluster. On the receiving nodes, the meta data should be submitted to a call to the new dictionary API in order to create the dictionary entry for the table. Thus, this should work pretty much in the same way as today. There are, however two situations adding some complexity: Backup/restore and upgrade/downgrade (see below). Backup/restore -------------- In addition to the simple file copy based backup/restore procedure mentioned above, there are also other backup/restore solutions: - MySQL Enterprise Backup: This is a commercial product that is not part of the MySQL server. It will be up to them to adapt to the new data dictionary if any changes are needed. - Mysqldump: This utility performs logical backup which dumps the contents of the server as SQL statements for re-creating the dictionary items, and also with the data contents as insert statements. Thus, restore is done by executing the resulting SQL statements via the mysql client. Since this is a logical backup, it will most likely not be affected by the changes to the serialization format. - MySQL Cluster: There is special backup/restore functionality that works only for MySQL Cluster. This functionality is implemented by a backup command that is part of the NDB management daemon, and a restore utility which is a separate executable. The NDB data dictionary will be part of the backup. With pre-8.0 server versions, this means that the .FRM files that are stored within the NDB data dictionary will also be part of the backup. For MySQL Cluster using server versions 8.0 and later, the new serialized dictionary information will be stored in the NDB data dictionary instead of the .FRM files, and hence, the new serialized data will be in the backup files too. This means that backup/restore between MySQL Cluster versions that do not have the same type of data dictionary (i.e., one version is using the old data dictionary and the other version is using the new data dictionary) is supported only by using a workaround as described below. In general, except when using mysqldump, restore of an old backup (from a MySQL server not supporting the new data dictionary) directly into a server supporting the new data dictionary is not guaranteed to work. The suggested workaround is the following: 1. Start a MySQL server version compatible with the backup version (e.g. the same version as the backup was taken from). 2. Restore the backup into the MySQL server. 3. Upgrade the MySQL server. 4. Backup the server contents using the upgraded server version, now supporting the new data dictionary (or alternatively, export InnoDB tablespaces, copy MyISAM files, etc.). 5. Make the data available to the target server by restore, import, or another mechanism as appropriate. Upgrade/downgrade ----------------- Upgrade/downgrade between versions that both support the new dictionary should not be very different from the current situation in 5.7. The new data dictionary should not present any new problems, regardless of the storage engine involved. No special problems should be present when running with MySQL Cluster either. The important thing to consider is that new server versions must be able to de-serialize and interpret old serialized meta data (please note that we are talking about the old server still using the new serialized format, but since this format may change over time, compatibility must be ensured). This is required for crash recovery, importing tablespaces, restoring old MySQL Cluster backups (backed up by a cluster version supporting the new data dictionary), etc. For upgrade/downgrade between versions that not both support the new dictionary, there are two situations to consider: Upgrade with and without MySQL Cluster (in either case, downgrade is not supported). Without MySQL Cluster ..................... This scenario is based on offline upgrade where the server is stopped, a dedicated tool is used to support the upgrade by generating meta data on a form that can be understood by the new server version, and then, the new version of the server is started. With MySQL Cluster .................. The upgrade procedure for MySQL Cluster is to first upgrade the management server(s), then all the NDB processes, and finally the MySQL servers. In the context we are discussing here, context, the server upgrade is the difficult issue. In theory, we could do the upgrade offline if downtime is acceptable, however, for MySQL Cluster, we also need to be able to support online upgrade. This means online in terms of the cluster as a whole, each individual MySQL server will be temporarily down while being restarted, of course. Each MySQL server process can be upgraded individually. However, in a cluster where there are several servers running concurrently, there are additional problems: - The upgrade tool will generate serialized dictionary information, which will then be imported into the new 8.0 server. The old MySQL servers will not understand this information if receiving it, so shipping this information must probably be disabled until upgrade is finished. - For the same reason, it also makes sense to disable DDL operations for the duration of the upgrade. - When importing the serialized meta data into the starting 8.0 server, the serialized information will also be added to the data dictionary stored in NDB. Since the NDB processes have already been upgraded, they will be able to support this, but they also need to be able to support the old MySQL servers until upgrade is finished. Hence, NDB may need to support holding both the old and new serialized dictionary information blobs at the same time. It is up the the MySQL Cluster team how to handle this. The description above is just a very high level suggestion. Responsibilities ================ This section will define the scope of this worklog. We will also outline what we expect from other worklogs within the same release, and what we assume will be relevant to implement in forthcoming releases. Within the Scope of this Worklog -------------------------------- Within the scope of this worklog (WL#7069), we will do the following: - Define the format of the serialized dictionary information for tables and their strongly related dictionary information. - Define which contents of the data dictionary objects should be included in the serialized meta data. - Define and implement an API to be used for serializing dictionary information with the defined format and contents. - Define a virtual extension of the handler API to be called when a table is changed (created, altered or deleted), providing the meta data as a parameter. - Provide a default implementation of this API writing serialized information to a file. - Define extensions of the handlerton API for handling tablespaces to support maintenance of serialized dictionary information in the tablespace dictionary. - Extend the tablespace handling functions to also maintain the serialized dictionary information in the tablespace dictionary. - Extend current functions that create and alter tables and table related items to also make use of the tablespace API and handler API extension. - Define and implement an API for (re-)creating tables and their strongly related items based on serialized dictionary information. Within the Scope of other Worklogs ---------------------------------- We expect the following from other worklogs within the same release: - Runtime/InnoDB: Write serialized data to file (into an .IBD file), i.e., extend the handlerton API to provide necessary functions. - Runtime/InnoDB: Implement support importing and exporting tables and tablespaces - InnoDB: Provide support for extracting serialized dictionary information from tablepace files. - Server general: Make sure mysqldump will handle the new data dictionary implementation. Being a logical backup, mysqldump is probably not affected by the changes in the new data dictionary. - MySQL Cluster: Verify and implement support for backup/restore and upgrade/downgrade. Modify current usage of .FRM file based functions (writefrm etc.) to instead use functions in the new data dictionary API for storing the serialized meta data in the MySQL server.
Definitions and Abbreviations ============================= - DO: Dictionary object, i.e., an object in the DD cache representing, e.g., a table. - PDO: Persistent dictionary object: All dictionary objects that survive a system restart (i.e., not objects like those representing temporary tables). - SDI: Serialized dictionary information, meta data that is stored in addition to the data dictionary itself. - OID: Dictionary object identity, i.e., the "id" column from a DD table. - Strongly related dictionary objects: Aggregation or object pointers are strong relations, e.g. table to columns. - Weakly related dictionary objects: An object referring to another object by means of its id data member, name or similar has a weak relation to that object, e.g. table to schema. Functional Requirements ======================= Below, there are separate sections for storage engine specific requirements (InnoDB and MySQL Cluster), followed by general requirements grouped by overall functionality. InnoDB requirements ------------------- IR-01. For each tablespace, only SDI related to the tables contained in the tablespace are stored. If a table is stored in several tablespaces, the SDI is also stored in the same tablespaces as the table data. If a tablespace is spread over several files, an identical copy of the SDI may be stored in each file. It is up to the storage engine to decide how to distribute the SDI among the tablespace files. IR-02. The SDI is stored in the tablespace files with the the OID and object type as primary key along with a BLOB containing the SDI. IR-03. There must be one SDI blob for each table in each of the tablespaces where the table data is stored, and this blob shall contain all required information that is strongly related to the table, including foreign key constraints, indexes, etc. Having a blob for the entire table space would be too much, and one blob per index would be too scattered. IR-04. In addition to the strongly related dictionary information specified in IR-03, the containing schema SDI shall be serialized along with the table to ease reconstruction on import. The tablespace SDI shall also be serialized and stored in the tablespace dictionary. MySQL Cluster requirements -------------------------- CR-01. There shall be a function to read the SDI from the global data dictionary. This function will replace the current "readfrm" function. The SDI shall be on a format that can be stored (a sequence of bytes). CR-02. There shall be a function for creating a new dictionary entry (the meta data) using the SDI. This function will replace the current "writefrm" function. [NDB does not really need a physical file containing the SDI, just some way of creating the meta data (dictionary entry) using the SDI]. CR-03. There shall be a function to compare two SDI blobs to determine if they are equal or not. This function will replace the current "cmpfrm" function. CR-04. It shall be possible to check if a table exists, given the complete name of the table (complete name = schema + table name). A table exists if there is a dictionary entry for it. CR-05. The SDI shall contain the same information as the old .FRM files. Required information not present in the dictionary objects as explicit fields may be represented in the se_private_data field as key=value pairs. CR-06. It shall be possible to retrieve the table identity from the SDI. Table identity is based on the complete table name (schema + table name), plus engine ownership. This means that the SDI must contain these pieces of information. CR-07. The mechanism of checking for an NDB file to determine engine ownership shall be replaced by explicitly representing engine ownership in the meta data. CR-08. It shall be possible to open an existing table using table identity (complete name + engine ownership). CR-09. It shall be possible to force the data dictionary to swap the meta data of an old table with a new one. CR-10. The SDI shall be architecture independent. The information is put into the NDB backup files, and may be restored on an architecture with, e.g., different endianness. Architecture independence must be ensured for all information, including, e.g., column defaults. CR-11. There shall be support for iterating over existing tables. This is needed during server start for checking for meta data staleness. CR-12. There shall be support for iterating over databases/schemata. This is needed during server start for checking for meta data staleness. General requirements -------------------- SQL statements and SDI Contents ............................... FR-01. The following SQL statements shall have their implementation extended to also manage SDI: - InnoDB only, not implemented yet: CREATE/DROP/IMPORT/EXPORT TABLESPACE - CREATE/ALTER/DROP/RENAME/TRUNCATE TABLE - CREATE/DROP INDEX Thus, the following PDOs shall be supported in terms of generating serialized dictionary information: Tables, indexes, and columns. Foreign keys will also be included in the SDI when they are supported by the global DD. Additionally, for tablespace operations, SDI must be managed, both for the tablespace itself, and for the tables (and related information) present in the tablespace. FR-02. The following PDOs shall be supported in terms of (re-)creation: All PDOs specified in FR-01. FR-03. The following information shall be stored in a SDI: The set of strongly related dictionary objects for the PDO. Additionally, schemata and tablespace information shall be stored for tables to support re-creation. The server layer must ensure that all required fields will be present both when writing and after reading back the SDI. The SDI should start with: • dd.version • engine name • DO type Retrieving SDIs Internally -------------------------- FR-04. In a running server, it shall be possible to retrieve the SDI for any PDO given by FR-01 at any time (lock waits must be considered, though). FR-05. The SQL layer shall provide functions for retrieving the SDI from a PDO. FR-06. When a PDO listed in FR-01 is created, changed or deleted, the PDO, its associated tablespace PDO, and resulting SDI shall be submitted as parameters to a function pointer in the handlerton object, if provided by the storage engine. The exact negotiation between the SQL layer and the storage engine while allocating, creating and serializing the meta data will be specified in the low level design further below. Storing SDIs to File -------------------- FR-07. If the function pointer variables in FR-06 have not been set by the storage engine (have the value NULL) a default implementation will write the SDI for the PDO to a file. FR-08. The name of the file in FR-07 shall be generated by the server: - Use the character conversions currently used for the .FRM file names, but restrict the conversion to e.g. the first 16 characters of the table name. - Add a string representation of the OID of the dictionary object and a file name extension ".SDI". The OID ensures uniqueness, which is required since several tables may map to same name. - The relationship between the DD object name and the filename need not be an invariant, i.e., the names may be changed independently. - The existing names may be kept on upgrade from 5.7 to 5.8. It will be preferable to avoid renaming Re-creating PDOs --------------- FR-09. It shall be possible to repair table meta data using stored SDIs after a server crash where the data dictionary is left corrupted, or in other ways is unreadable. We must assume that the referenced bits in the tablespace file(s) can be read. This repair mechanism is the last resort of crash recovery if the InnoDB internal recovery mechanism cannot recover the dictionary. FR-10. The contents of the .SDI files may be edited to ensure correctness and consistency. The SDI blobs stored in the tablespace files may not be edited. Editing support is outside the scope of this worklog. Upgrade ------- FR-11. The SDI format must be versioned in a way that ensures backwards compatibility for dictionary items where new functionality has not been applied. If a server is upgraded, the SDI blobs must be upgraded to match the format of the new version. The SDI format version number will be the same as the DD version number stored in the dd.Version table. FR-12. When upgrading from a previous MySQL server version not supporting the new DD, to a MySQL server version supporting it, an external offline tool must be run to generate the new data dictionary. The tool will generate SDI representing the old meta data, and the SDI may then be imported into the new server. Requirements for serialization software ======================================= We may make use of third party software for generating the actual serialized representation. The following requirements are relevant. Top priority requirements ------------------------- PR-01. The software shall support serializing an object structure detecting multiple inclusion of the same instance, cycles, etc. according to the needs of the data dictionary structures, i.e., there is no need to implement handling of situations that will not occur. PR-02. The software shall generate a platform independent representation. PR-03. De-serializing SDIs from a supported DD version shall be supported. PR-04. Extending the schema definitions of the new data dictionary shall be easy to support as far as SDI handling is concerned. Adding new function calls for (de)serializing added members, and implementing new functions for (de)serializing added classes is acceptable. Highly recommended ------------------ HR-01. Interfacing to external tools shall be supported and encouraged. It shall be possible to interpret the information for usage in other tools, and preferably even in other programming languages. Thus, either the representation must be based on an open standard, or the software used for (de)serializaton must be available as a separate utility, library, or even in a different programming language. HR-02. The solution shall support querying object fields without de-serializing into the complete data dictionary object structure. NDB may need to retrieve e.g. table names to check for table existence, other engines may want to retrieve the engine private data to verify that the required capabilities are provided, etc. HR-03. It should be possible to submit a non default memory allocator, to support allocating on memroot rather them on the heap.
External interfaces (visible to users) ====================================== EI-1. New file format: Serialized dictionary information on a JSON format, replacing the .FRM files. EI-2. Protocol extension for NDB (MySQL Cluster responsibility). May simply continue using the same protocol, but with different serialized meta data. EI-3. Error messages: New error messages will be required. Internal interfaces =================== II-1. Extension of the handlerton API: Extensions to provide engine specific handling of tablespace meta data changes, using the serialization API. II-2. Extension of the handlerton API: We also need handlerton extensions for engine specific handling of serialized meta data, regardless of object type (e.g., InnoDB will store it in a tablespace, NDB will ship to other servers, etc.
Serialization software
######################
Classes of alternatives
=======================
There are several fundamentally different approaches to this:
1. The classes to be serialized can be defined in a dedicated language, and C++
code (with class definitions, serialization support, etc.) can be generated for
these definitions. An example of this is the Google Protobuf implementation.
2. Existing class definitions can be extended with third party support for
serializing the objects directly. Thus, code must be implemented to explicitly
define which members should be serialized, but the actual serialization and
encoding is handled by the third party software. The Boost serialization API is
an example of this.
3. Existing class definitions can be extended with transformations into an
intermediate representation which has third party support for serialization into
a standardized format. Here, as in the previous item, own code must be
implemented to say which members to include in the serialization. However,
unlike the previous item, the transformation into an intermediate representation
may need type conversions, handling platform dependencies, etc. An example is
using a JSON based approach, e.g. rapidjson, or a slightly more general
mechanim, like the Boost property tree, which can generate JSON as well as XML.
Using alternative 1 to serialize the new data dictionary objects directly into a
binary representation will probably be unacceptable since it would mean that the
dictionary classes would be generated by an external tool. We could use the
external class definition language to define classes to be used in an
intermediate representation, but it would require some work to keep the
definitions open to support extensibility, and it would be of little advantage
compared to mechanism 2 or 3 above.
Alternative 2 provides good support for all the absolute requirements, but is
weaker on the recommended requirements. In particular, the stored representation
is not likely to follow an open standard, making it hard to interpret the
serialized date in external tools.
Alternative 3 will need special care in the implementation to support PR-01, but
is likely to provide good support for the other requirements. Thus, this is the
alternative we would like to choose.
We have chosen to use rapidjson due to good performance and functionality
provided (SAX + DOM).
Overall design
##############
The implementation can be divided into 3 primary tasks:
1. Implement a serializer which has an overloaded serialize functions accepting
objects of various types, including
- Tables
- Columns
- Indices
- Foreign keys
- Partitions
- Schema
- Tablespace
The serialize method will serialize the object itself and all its
strongly related dictionary objects. Each DO impl type that is to be serialized
will have virtual member functions for serializing and deserializing itself and
closely connected objects. This implies that the (de)serialize mfs will not be
part of the DO interface.
Additional logic (setting up rapidjson etc.) will be placed in functions that
will call the virtual member functions.
2. Extend the handlerton API to support engine specific handling of serialized
meta data. For InnoDB, the handlerton api will provide the interface for
storing/retrieving SDI from tablespace files.
For other engines, specific behavior for storing and retrieving
the SDI may also be put in the handlerton interface.
The handlerton API for storing and and retrieving sdi-blobs
should be private, meaning that client code using the new dictionary
should not have to know details of how the handlerton for a
particular storage engine manipulates sdi-blobs.
3. Change client code to use new API.
We will later:
- Implement new commands for importing MyISAM tables and InnoDB tablespaces.
- Implement new server command line options to scan for MyISAM files and InnoDB
tablespace files.
- Extend import to make use of the new SDI items for verification or
re-creation of the dictionary item.
Low Level Design
################
Definitions:
DO_TYPE_ID: This is used in this worklog as a placeholder for a
datatype that will be used to identify dictionary objects. Most of the
current dictionary code appears to be using a <dictionary table
name><sequence number>)-string for this. But this may change as part
of the new cache implementation.
Below, the overall items outlined above are explained in more detail.
Serialization
=============
Need to provide functions accepting an object in the global data
dictionary as a parameter, and will generate a json string
representing the object. All json strings must contain the dd version
which created it as a top-level attribute of type
<JsonNumber:-ulonglong>.
Scope
=====
1. All top level dictionary objects; Table, Tablespace, Schema, can be
the starting point for serialization. The table blob may refer to the
schema by means of the Object_id. The added information may be used
on de-serialization for validation or re-creation of the items.
2. All strongly related dictionary items will be transitively
serialized into a nested structure. Weakly related child objects are
represented by their Object_id
3. The serialization may be extended to handle multiple references to
the same item, cycles, etc. depending on the needs.
4. Events, collations and charsets will not be serialized. But a
serialized schema will include their Object_id.
5. Triggers may be serialized in the future, but this is not included
in the scope of this worklog.
SDI data type
=============
It would be convenient to have a datatype for the sdi-blobs when
manipulating them in C++ code. The sdi-api uses a
(char*, uint64) pair for this. Since the SDI-blobs will contain json
it would be convenient to have a string-like type to ease parameter
passing and memory management. An utf-8 encoded unicode string should
be able to hold any json string we generate, and this can be stored in
an std::string (but not all positions in that string is necessarily av
valid codepoint). So an sdi-blob type could be as simple as:
typedef std::string sdi_t;
Identifying SDIs
================
The interface use an SDI_KEY, where SDI_KEY is defined as follows
/* The following structure is required to uniquely identify a
Serialized Dictionary Info Object */
struct sdi_key {
uint64 id;
uint32 type;
};
typedef struct sdi_key SDI_KEY;
to identify an SDI.
Note that unless SDI_KEY and DO_TYPE_ID can be merged, there needs to
be a mapping between the two.
Mapping Dictionary Objects to Tablespaces
=========================================
Each of the three top-level dictionary objects which can be the start
point for serialization must (for InnoDB, at least), be stored in one
or more tablespaces:
- Tablespace SDIs are stored in the tablespace they describe.
- Table SDIs are stored in the same tablespace as the table itself.
- Schemas SDIs must be stored in all tablespaces that contain a table
which belongs to the schema. This is a bit tricky as we don't have a
schema -> table reference in the dictionary. So to compute the set
of tablespaces to which we must store a schema SDI we need to (pseudo
code):
std::set<Tablespace*> tablespaces;
for (auto t : dd::tables) {
if (t.getSchema() == s) {
tablespaces.insert(t.getTablespace());
}
}
for (auto t : tablespaces) {
sdi_set(t, schema_sdi, ...)
}
Serializing Object Relationships
================================
There are 4 different object relationships among the dictionary
objects which will be serialized:
1) Owning (Parent-Child) Tight connections:
Table => Column
Table => Index
Table => Foreign Key
Table => Partition
Column => Column_element
Index => Index_element
Foreign_key => Foreign_key_element
Partition => Partition_value
Partition => Partition_index
Embedded in the the serialized parent.
2) Non-owning Tight connections:
Partition_index --> Index
Foreign_key --> Index
Foreign_key --> Column
Index_element --> Column
Represented by raw pointer member variables.
These connections require special attention as the raw pointer cannot stored
directly in the SDI, as the pointer value would be incorrect after
deserialization into new objects. Since the pointees always have an ordinal in
its parent's list which can be used as a logical pointer, it is sufficient to
include this ordinal position in the SDI in order to be able to obtain the (new)
pointer value during deserialization.
3) Loose connections:
Abstract Table --> Schema
Table --> Tablespace
Partition --> Tablespace
Partition_index --> Tablespace
Index --> Tablespace
These connections are represented by the associated object's Object_is in the
dictionary object and the corresponding dictionary tables. In general, it is
unlikely that this Object_id is still a valid reference at the time of
deserialization. To mitigate this the connection is instead represented by the
associated dictionary object's Entity_object::name() attribute. Note that this
does not guarantee that the identified object is "correct", as in identical to
the object which was referenced at the time of serialization. The code invoking
deserialization must make sure that the objects referenced by name exist and are
suitable.
Initially, deserialization will fail if the references cannot be resolved. Fall
back to defaults and/or explicitly provided alternatives, may be added later if
there are use-cases which require this.
4) References to objects that are not serialized:
Table -> Collation
Schema -> Collation
Column -> Collation
Represented by Object_id. The assumption here is that the id of the
non-serialized object will not change. If new ids are added and an SDI
containing such new ids (e.g. new collations) are imported into a server version
which does not have those collation ids, a default id has to be chosen/an error
reported.
4) Back-references, i.e. references back to the closely connected
parent object:
Table <- Tablespace file
Table <- Partition
Table <- Foreign_key
Table <- Index
Table <- Column
Partition <- Partition_value
Partition <- Partition_index
Index <- Index_element
Column <- Column_type_element
Collection <- Column_type_element
These are omitted to avoid cycles, and are recreated during deserialization.
Representing Binary Values in Json
==================================
Binary values (where endianness matters) will be stored as base64
encoded strings in Json. The DD contains 3 such values:
- Partition_value_impl::value
- Column_type_element::name (*)
- Column_impl::default_value
In the following we use the notation <JsonString:-b64_string> for such values.
(*) This value is not really binary, but it is a string value which
can be in any character set, and for this reason it is stored as
VARBINARY in the dictionary. If we can be sure that any such string
can be safely stored and retrieved as sequence of bytes, it may not be
necessary to use base64 encoding here.
Json Schema for Dictionary Objects
==================================
Sdi_header :=
"dd_version": <JsonNumber:-ulonglong>,
"dd_object_type": <JsonString:-std::string|JsonNumber:-uint32>, TBD
"dd_object": Schema | Table | Tablespace
Schema := {
"name": <JsonString:-std::string>, // Entity_object_impl::m_name
// reference to non-serialized object
"default_collation": <JsonNumber:-Object_id>,
"created": <JsonNumber:-ulonglong>,
"last_altered": <JsonNumber:-ulonglong>
}
Tablespace_file := {
"ordinal_position": <JsonNumber:-uint>,
"filename": <JsonString:-std::string>,
"se_private_data": <JsonString:-Properties::raw_string>
}
Tablespace := {
"name": <JsonString:-std::string>, // Entity_object_impl::m_name
"comment": <JsonString:-std::string>,
"options": Properties,
"se_priavate_data": <JsonString:-Properties::raw_string>,
"engine": <JsonString:-std::string>,
"files": [ Tablespace_file, ...]
}
Partition_value := {
"max_value" : <JsonBool:-bool>,
"null_value" : <JsonBool:-bool>,
"list_num" : <JsonNumber:-uint>,
"column_num" : <JsonNumber:-uint>,
"value" : <JsonString:-b64_string>,
//"partition" : back-reference to tightly connected owner omitted
}
Partition_index := {
"options" : <JsonString:-Properties::raw_string>,
"se_private_data" : <JsonString:-Properties::raw_string>,
//"partition" : back-reference to tightly connected owner omitted
"index" : <JsonNumber:-Object_id (from raw pointer)>
"tablespace" : <JsonNumber:-Object_id>,
}
Partition := {
"name": <JsonString:-std::string>, // Entity_object_impl::m_name
"level": <JsonNumber:-uint>,
"number": <JsonNumber:-uint>,
"comment": <JsonString:-std::string>,
"options":<JsonString:-Properties::raw_string> ,
"se_private_data": Properties,
// "table": back-reference to tightly connected owner omitted
"values": [Partition_value, ...],
"indexes": [Partition_index,...],
"tablespace": <JsonNumber:-Object_id>
}
Foreign_key_element := {
// "foreign_key": back-reference to tightly connected owner omitted
"column": <JsonNumber:-Object_id (from raw pointer)>,
"ordinal_position": <JsonNumber:-uint>,
"referenced_column_name": <JsonString:-std::string>
}
Foreign_key := {
"name": <JsonString:-std::string>, // Entity_object_impl::m_name
"match_option": <JsonNumber:-enum>,
"update_rule": <JsonNumber:-enum>,
"delete_rule": <JsonNumber:-enum>,
"unique_constraint": <JsonNumber:-Object_id (from raw pointer)>,
"referenced_table_catalog_name": <JsonString:-std::string>,
"referenced_table_schema_name": <JsonString:-std::string>,
"referenced_table_name": <JsonString:-std::string>,
// "table": back-reference to tightly connected owner omitted
"elements": [Foreign_key_element,...]
}
Index_element := {
"ordinal_position" : <JsonNumber:-uint>,
"length" : <JsonNumber:-uint>,
"order" : <JsonNumber:-enum>,
"options" : <JsonString:-Properties::raw_string>,
"hidden" : <JsonBool:-bool>,
//"index" : back-reference to tightly connected owner omitted
"column" : <JsonNumber:-Object_id (from raw pointer)>
}
Index := {
"name": <JsonString:-std::string>, // Entity_object_impl::m_name
"hidden": <JsonBool:-bool>,
"is_generated": <JsonBool:-bool>,
"ordinal_position": <JsonNumber:-uint>,
"comment": <JsonString:-std:string>,
"options": <JsonString:-Properties::raw_string>,
"se_private_data": <JsonString:-Properties::raw_string>,
"type": <JsonNumber:-enum>
"algorithm": <JsonNumber:-enum>
// "table": back-reference to tightly connected owner omitted
"elements": [Index_element,...],
"tablespace": <JsonNumber:-Object_id>,
"engine": <JsonString:-std::string>
}
Column_type_element := {
"name" : <JsonString:-b64_string>, // Column_type_element_impl::m_name
"index" : <JsonNumber:-uint>,
//"column": back-reference to tightly connected owner omitted
//"collection": back-reference to containing collection omitted
}
Column := {
"name": <JsonString:-std::string>, // Entity_object_impl::m_name
"type": <JsonNumber:-enum>,
"is_nullable": <JsonBool:-bool>,
"is_zerofill": <JsonBool:-bool>,
"is_auto_increment": <JsonBool:-bool>,
"hidden": <JsonBool:-bool>,
"ordinal_position": <JsonNumber:-uint>,
"char_length": <JsonNumber:-size_t>,
"numeric_precision": <JsonNumber:-uint>,
"numeric_scale" : <JsonNumber:-uint>,
"datetime_precision" : <JsonNumber:-uint>,
"default_value_null" : <Json:Number-uint>,
"default_value" : <JsonString:-b64_string>,
"default_option" : <JsonString:-std::string>,
"update_option" : <JsonString:-std::string>,
"comment" : <JsonString:-std::string>,
"options" : <JsonString:-Properties::raw_string>,
"se_private_data" : <JsonString:-Properties::raw_string>,
// "table": back-reference to tightly connected owner omitted
"enum_elements" : [Column_type_element,...],
"set_elements" : [Column_type_element,...]
}
Table := {
"name": <JsonString:-std::string>, // Entity_object_impl::m_name
// Abstract table members
"version" : <JsonNumber:-uint>,
"created" : <JsonNumber:-ulonglong>,
"last_altered" : <JsonNumber:-ulonglong>,
"options" : <JsonString:-Properties::raw_string>,
"columns" : [Column,...],
"schema" : <JsonNumber:-Object_id>,
// Table members
"hidden" : <JsonBool:-bool>,
"engine" : <JsonString:-std::string>,
"comment" : <JsonString:-std::string>,
"default_values" : <JsonString:-std::string>,
"se_private_data" : <JsonString:-Properties::raw_string>,
"partition_type" : <JsonNumber:-enum>,
"partition_expression" : <JsonString:-std::string>,
"default_partitioning" : <JsonNumber:-enum>,
"subpartition_type" : <JsonNumber:-enum>,
"subpartition_expression" : <JsonString:-std::string>,
"default_subpartitioning" : <JsonNumber:-enum>,
"indexes" : [Index,...],
"foreign_keys" : [Foreign_key,...],
"partitions" : [Partition,...],
// reference to non-serialized object
"collation" : <JsonNumber:-Objectid>,
"tablespace" : <JsonNumber:-Object_id>
}
Serialization Using Rapidjson
=============================
In Rapidjson the conversion of types to its Json string representation
is handled by a rapidjson::Writer, which expects to send its output to
something which implements the rapidjson stream concept. Rapidjson
provides a StringBuffer with UTF-encoding and the ability to use
custom allocators which implements this concept.
This leads to the question of whether there needs to be a serializer
class (and object), or if free (overloaded) functions of the form
serialize(rapidjson::Writer *w, T t)
{
w->Int(t.getX()).Bool(t.getY()). ...
}
would be sufficient. When using rapidjson there does not seem
to be a state that would need to be captured by a Serializer object as
that is already handled by the rapidjson::StringBuffer.
Rapidjson's allocator concept is simpler than that of std C++, and
allows, IIUC, the use of stateful allocators, so writing a wrapper for
using memroot should be doable.
In order to pass the resulting string to the sdi-api it is necessary to
materialize the whole sdi in memory.
Assuming that the average dictionary object is less than 1 kb,
and Innodb imposes a limit of 1000 columns per table and we assume that the
number of indices, foreign keys and (partitions) are not any greater,
we end up with a maximum size for the json string around 4 Mb, which
should not be a problem.
De-serialization Using Rapidjson
================================
When using Rapidjson to parse a json document, the result is a
rapidjson::Document which contains a graph (tree) of rapidjson value
objects. In order to create/populate a dictionary object based on this
information it will be necessary to traverse this structure and
identify the various dictionary objects described and intstantiate and
populate them based on the information in the json object graph. The
new dictionary objects are not connected or checked against the
dictionary, and the client which is invoking de-serialization is
responsible for merging the returned object graph with objects
already existing in the dictionary, if applicable.
T* de_serialize(const std::string &json_doc)
{
Document tdoc;
// Use rapidjson to parse json string
if (tdoc.Parse<0>().HasParseError(json_doc.c_str()))
{
return std::null_ptr; // error handling?
}
// traverse object graph and create and populate
T *skeleton = dd::create_object<T>();
skeletonp->setX(jsonObject["key"].GetInt())
[...]
return skeleton;
}
Pointer-Hydrating after De-serialization
========================================
When an DO contains raw pointer to another object which will be
serialized into the same sdi, it becomes necessary to somehow restore
those pointers when deserializing. The deseriailization process will
re-create the objects being pointed to, but we are not guaranteed that
the referenced object has been created before we create the object
referencing it.
To overcome this we choose to include the Object_id for all objects
(even if this value likely is incorrect for the DD which the objects
are being created) when serializing, and replace raw pointers with
the pointee's Object_id. This way we can create a mapping Object_id ->
Object during deserialization, and use this update the raw pointer
variables so that they refer to the correct new object.
Serialization Example (InnoDB)
==============================
ALTER SCHEMA CHARACTER SET = 'utf8';
-> Change dd::Schema_impl.default_collation
-> UPDATE SCHEMATA SET default_collation_id = <xx,utf-8> WHERE id = <>
-> Serialize dd::Schema_impl to schema_SDI
-> Iterate across tables in schema, and do
--> sdi_set(table.tablespace(), ..., schema_SDI, ...))
ALTER TABLE ADD COLUMN ... ;
-> Add new dd:Column_impl object, including any associated Column_type_element
objects.
-> Add pointer to Column_impl object to dd::Table_impl
-> INSERT INTO COLUMN_TYPE_ELEMENTS VALUES (...)
-> INSERT INTO COLUMNS VALUES (...)
-> Serialize dd::Table_impl to table_SDI (will include the strongly connected
new Column_impl and Column_type_impl objects)
-> sdi_set(t.tablespace(), ..., table_SDI, ...)
Handlerton extensions
=====================
The handlerton must be extended to provide storage engine
specific handling of the serialized data:
The following is a preliminary suggestion that will need to be adapted later:
bool (*sdi_set_schema)(const class dd::Tablespace *tablespace,
const class dd::Schema &schema, const dd::sdi_t sdi);
bool (*sdi_set_table)(const class dd::Tablespace *tablespace,
const class dd::Table &table, const dd::sdi_t sdi);
bool (*sdi_set_tablespace)(const class dd::Tablespace &tablespace,
const dd::sdi_t sdi);
bool (*sdi_delete_schema)(const class dd::Tablespace *tablespace,
const class dd::Schema &schema);
bool (*sdi_delete_table)(const class dd::Tablespace *tablespace,
const class dd::Table &table);
bool (*sdi_delete_tablespace)(const class dd::Tablespace &tablespace);
Usage of the SDIs will require functions to retrieve SDIs from InnoDB
tablespaces (proposal):
- std::string handlerton::dd_get_sdi(<?>)
Need to get tablespace id and SDI key.
This needs to wrap a call to sdi_get_len_by_id() followed by an allocation,
followed by a call to sdi_get_by_id()
and as a result it must be able to extract tablespace id and sdi_key from x.
Changes to Existing Code
========================
Existing code will invoke interact with serialization by storing and deleting
SDIs. This will happen in the following (member) functions in defined in
table_share.cc:
Table_share_utils::create_dd_user_table(...)
dd_remove_table(...)
dd_rename_table(...)
There are some specific challenges related to storing SDIs in files:
* File operations are not transactional, so a "best effort" approach must
employed to try to avoid losing valuable information in case of a
failure/abort. Typically this is achieved by writing to a temporary file and
then move/rename this file to its proper when it "looks certain" that the
operation will succeed. The assumption being that the move/rename will not
fail, or at least be atomic.
* The naming of SDI files (as mandated by FR-08 above) implies that only those
rename operations which modify the first 16 characters of the table name will
actually store the new version of the SDI in a new file name, (the OID does not
change even if the object is later modified). In all other cases the file name
is unchanged so that a plain write of the new SDI risk destroying the old
version if the write were to fail. For DDL operations that actually change the
name, the old SDI file will be intact, but will have to be removed when the new
file has been successfully written to avoid having garbage SDI files left in
the data directory.
Approach:
If the name of the SDI file does not change; do
When the final object version is available, create the SDI and write it to a
temporary file name.
When the transaction is "known to commit", move the temporary table name to
the correct one. This will then both remove the temporary and the old version
of the SDI file.
If the name of the SDI file DOES change; do
Store the name of the old SDI file somewhere (tied to THD?)
When the final object version is available, create the SDI and write it to
the new file name.
When the transaction is "known to commit", remove the old SDI file.
The current approach where write_sdi() is called from dd::rename_table will
require that the old SDI file name be stored since dd::rename_table only sees
the old table name and at the intermediate name, or the intermediate name and
the new name. Even if the old name gets stored it is not clear that it is ok to
rename SDI file and remove the old SDI file at the end of the last
dd::rename_table.
This approach also implies that it must be possible to identify the
intermediate tables as here is no point in writing SDIs for intermediate
tables, as the name of the table is part of the SDI. I.e. and SDI for an
intermediate table has the wrong content, so it cannot simply be renamed to the
proper name at the end of the transaction.
Copyright (c) 2000, 2017, Oracle Corporation and/or its affiliates. All rights reserved.