The Unicode Standard includes characters from the Basic Multilingual Plane (BMP) and supplementary characters that lie outside the BMP. This section describes support for Unicode in MySQL. For information about the Unicode Standard itself, visit the Unicode Consortium Web site.
BMP characters have these characteristics:
Their code point values are between 0 and 65535 (or
U+0000 and U+FFFF).
They can be encoded in a variable-length encoding using 8, 16, or 24 bits (1 to 3 bytes).
They can be encoded in a fixed-length encoding using 16 bits (2 bytes).
They are sufficient for almost all characters in major languages.
Supplementary characters lie outside the BMP. Their code point
values are between U+10000 and
U+10FFFF). Unicode support for supplementary
characters requires character sets that have a range outside BMP
characters and therefore take more space than BMP characters.
MySQL supports these Unicode character sets:
utf8, a UTF-8 encoding of the Unicode
character set using one to three bytes per character.
ucs2, the UCS-2 encoding of the Unicode
character set using two bytes per character.
utf8mb4, a UTF-8 encoding of the Unicode
character set using one to four bytes per character.
utf16, the UTF-16 encoding for the
Unicode character set using two or four bytes per character.
Like ucs2 but with an extension for
supplementary characters.
utf32, the UTF-32 encoding for the
Unicode character set using four bytes per character.
MySQL 5.5.3 and higher supports all Unicode character sets in
the preceding list. Prior to 5.5.3, MySQL supports only
utf8 and ucs2.
Table 10.2, “Unicode Character Set General Characteristics”, summarizes the general characteristics of Unicode character sets supported by MySQL.
Table 10.2 Unicode Character Set General Characteristics
| Character Set | Supported Characters | Required Storage Per Character |
utf8 |
BMP only | 1, 2, or 3 bytes |
ucs2 |
BMP only | 2 bytes |
utf8mb4 |
BMP and supplementary | 1, 2, 3, or 4 bytes |
utf16 |
BMP and supplementary | 2 or 4 bytes |
utf32 |
BMP and supplementary | 4 bytes |
Characters outside the BMP compare as REPLACEMENT CHARACTER and
convert to '?' when converted to a Unicode
character set that supports only BMP characters
(utf8 or ucs2).
If you use character sets that support supplementary characters
and thus are “wider” than the BMP-only
utf8 and ucs2 character
sets, there are potential incompatibility issues for your
applications; see Section 10.1.9.7, “Converting Between 3-Byte and 4-Byte Unicode Character Sets”.
That section also describes how to convert tables from
utf8 to the (4-byte)
utf8mb4 character set, and what constraints
may apply in doing so.
A similar set of collations is available for each Unicode
character set. For example, each has a Danish collation, the
names of which are ucs2_danish_ci,
utf16_danish_ci,
utf32_danish_ci,
utf8_danish_ci, and
utf8mb4_danish_ci. For information about
Unicode collations and their differentiating properties,
including collation properties for supplementary characters, see
Section 10.1.10.1, “Unicode Character Sets”.
Although many of the supplementary characters come from East Asian languages, what MySQL 5.5 adds is support for more Japanese and Chinese characters in Unicode character sets, not support for new Japanese and Chinese character sets.
The MySQL implementation of UCS-2, UTF-16, and UTF-32 stores characters in big-endian byte order and does not use a byte order mark (BOM) at the beginning of values. Other database systems might use little-endian byte order or a BOM. In such cases, conversion of values will need to be performed when transferring data between those systems and MySQL.
MySQL uses no BOM for UTF-8 values.
Client applications that communicate with the server using
Unicode should set the client character set accordingly; for
example, by issuing a SET NAMES 'utf8'
statement. ucs2, utf16,
and utf32 cannot be used as a client
character set, which means that they do not work for
SET NAMES or
SET CHARACTER SET. (See
Section 10.1.4, “Connection Character Sets and Collations”.)
The following sections provide additional detail on the Unicode character sets in MySQL.