MySQL supports multiple Unicode character sets:
utf8, a UTF-8 encoding of the Unicode
character set using one to three bytes per character.
ucs2, the UCS-2 encoding of the Unicode
character set using two bytes per character.
utf8mb4, a UTF-8 encoding of the
Unicode character set using one to four bytes per
character.
utf16, the UTF-16 encoding for the
Unicode character set using two or four bytes per
character. Like ucs2 but with an
extension for supplementary characters.
utf32, the UTF-32 encoding for the
Unicode character set using four bytes per character.
utf8 and ucs2 support
Basic Multilingual Plane (BMP) characters.
utf8mb4, utf16, and
utf32 support BMP and supplementary
characters. The utf8mb4,
utf16, and utf32
character sets were added in MySQL 5.5.3.
This section describes the collations available for Unicode character sets and their differentiating properties. For general information about Unicode, see Section 10.1.9, “Unicode Support”.
Most Unicode character sets have a general collection
(indicated by _general in the name or by
the absence of a language specifier), a binary collation
(indicated by _bin in the name), and
several language-specific collations (indicated by language
specifiers). For example, for utf8,
utf8_general_ci and
utf8_bin are its general and binary
collations, and utf8_danish_ci is one of
its language-specific collations.
A language name shown in the following table indicates a language-specific collation. Unicode character sets may include collations for one or more of these languages.
Table 10.3 Unicode Collation Language Specifiers
| Language | Language Specifier |
| Classical Latin | roman |
| Czech | czech |
| Danish | danish |
| Esperanto | esperanto |
| Estonian | estonian |
| Hungarian | hungarian |
| Icelandic | icelandic |
| Latvian | latvian |
| Lithuanian | lithuanian |
| Persian | persian |
| Polish | polish |
| Romanian | romanian |
| Sinhala | sinhala |
| Slovak | slovak |
| Slovenian | slovenian |
| Modern Spanish | spanish |
| Traditional Spanish | spanish2 |
| Swedish | swedish |
| Turkish | turkish |
Danish collations may also be used for Norwegian.
For Classical Latin collations, I and
J compare as equal, and
U and V compare as
equal.
Spanish collations are available for modern and traditional
Spanish. For both, ñ (n-tilde) is a
separate letter between n and
o. In addition, for traditional Spanish,
ch is a separate letter between
c and d, and
ll is a separate letter between
l and m.
Traditional Spanish collations may also be used for Asturian and Galician.
Swedish collations include Swedish rules. For example, in Swedish, the following relationship holds, which is not something expected by a German or French speaker:
Ü = Y < Ö
For questions about particular language orderings, unicode.org provides Common Locale Data Repository (CLDR) collation charts at http://www.unicode.org/cldr/charts/30/collation/index.html.
The
collations were added in MySQL 5.5.21. They preserve the
pre-5.1.24 ordering of the original
xxx_general_mysql500_ci
collations and permit upgrades for tables created before MySQL
5.1.24. For more information, see
Section 2.11.3, “Checking Whether Tables or Indexes Must Be Rebuilt”, and
Section 2.11.4, “Rebuilding or Repairing Tables or Indexes”.
xxx_general_ci
MySQL implements the
collations according to the Unicode Collation Algorithm (UCA)
described at
http://www.unicode.org/reports/tr10/. The
collation uses the version-4.0.0 UCA weight keys:
http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt.
The
xxx_unicode_ci
collations have only partial support for the Unicode Collation
Algorithm. Some characters are not supported, and combining
marks are not fully supported. This affects primarily
Vietnamese, Yoruba, and some smaller languages such as Navajo.
A combined character is considered different from the same
character written with a single unicode character in string
comparisons, and the two characters are considered to have a
different length (for example, as returned by the
xxx_unicode_ciCHAR_LENGTH() function or in
result set metadata).
MySQL implements language-specific Unicode collations if the ordering based only on UCA does not work well for a language. Language-specific collations are UCA-based, with additional language tailoring rules.
For any Unicode character set, operations performed using the
collation are faster than those for the
xxx_general_ci
collation. For example, comparisons for the
xxx_unicode_ciutf8_general_ci collation are faster, but
slightly less correct, than comparisons for
utf8_unicode_ci. The reason for this is
that utf8_unicode_ci supports mappings such
as expansions; that is, when one character compares as equal
to combinations of other characters. For example, in German
and some other languages ß is equal to
ss. utf8_unicode_ci also
supports contractions and ignorable characters.
utf8_general_ci is a legacy collation that
does not support expansions, contractions, or ignorable
characters. It can make only one-to-one comparisons between
characters.
To further illustrate, the following equalities hold in both
utf8_general_ci and
utf8_unicode_ci (for the effect of this in
comparisons or searches, see
Section 10.1.8.6, “Examples of the Effect of Collation”):
Ä = A Ö = O Ü = U
A difference between the collations is that this is true for
utf8_general_ci:
ß = s
Whereas this is true for utf8_unicode_ci,
which supports the German DIN-1 ordering (also known as
dictionary order):
ß = ss
MySQL implements utf8 language-specific
collations if the ordering with
utf8_unicode_ci does not work well for a
language. For example, utf8_unicode_ci
works fine for German dictionary order and French, so there is
no need to create special utf8 collations.
utf8_general_ci also is satisfactory for
both German and French, except that ß is
equal to s, and not to
ss. If this is acceptable for your
application, you should use utf8_general_ci
because it is faster. Otherwise, use
utf8_unicode_ci because it is more
accurate.
For all Unicode collations except the binary
(_bin) collations, MySQL performs a table
lookup to find a character's collating weight. If a character
is not in the table (for example, because it is a
“new” character), collating weight determination
becomes more complex:
For BMP characters in general collations
(),
weight = code point.
xxx_general_ci
For BMP characters in UCA collations (for example,
and language-specific collations), the following algorithm
applies:
xxx_unicode_ci
if (code >= 0x3400 && code <= 0x4DB5) base= 0xFB80; /* CJK Ideograph Extension */ else if (code >= 0x4E00 && code <= 0x9FA5) base= 0xFB40; /* CJK Ideograph */ else base= 0xFBC0; /* All other characters */ aaaa= base + (code >> 15); bbbb= (code & 0x7FFF) | 0x8000;
The result is a sequence of two collating elements,
aaaa followed by
bbbb.
Thus, U+04cf CYRILLIC SMALL LETTER
PALOCHKA currently is, with all UCA collations,
greater than U+04c0 CYRILLIC LETTER
PALOCHKA. Eventually, after further collation
tuning, all palochkas will sort together.
For supplementary characters in general collations, the
weight is the weight for 0xfffd REPLACEMENT
CHARACTER. For supplementary characters in UCA
collations, their collating weight is
0xfffd. That is, to MySQL, all
supplementary characters are equal to each other, and
greater than almost all BMP characters.
An example with Deseret characters and
COUNT(DISTINCT):
CREATE TABLE t (s1 VARCHAR(5) CHARACTER SET utf32 COLLATE utf32_unicode_ci); INSERT INTO t VALUES (0xfffd); /* REPLACEMENT CHARACTER */ INSERT INTO t VALUES (0x010412); /* DESERET CAPITAL LETTER BEE */ INSERT INTO t VALUES (0x010413); /* DESERET CAPITAL LETTER TEE */ SELECT COUNT(DISTINCT s1) FROM t;
The result is 2 because in the MySQL
collations, the replacement character has a weight of
xxx_unicode_ci0x0dc6, whereas Deseret Bee and Deseret
Tee both have a weight of 0xfffd. (Were
the utf32_general_ci collation used
instead, the result is 1 because all three characters have
a weight of 0xfffd in that collation.)
The rule that all supplementary characters are equal to each other is nonoptimal but is not expected to cause trouble. These characters are very rare, so it is very rare that a multi-character string consists entirely of supplementary characters. In Japan, since the supplementary characters are obscure Kanji ideographs, the typical user does not care what order they are in, anyway. If you really want rows sorted by the MySQL rule and secondarily by code point value, it is easy:
ORDER BY s1 COLLATE utf32_unicode_ci, s1 COLLATE utf32_bin
There is a difference between “ordering by the
character's code value” and “ordering by the
character's binary representation,” a difference that
appears only with utf16_bin, because of
surrogates.
Suppose that utf16_bin (the binary
collation for utf16) was a binary
comparison “byte by byte” rather than
“character by character.” If that were so, the
order of characters in utf16_bin would
differ from the order in utf8_bin. For
example, the following chart shows two rare characters. The
first character is in the range
E000-FFFF, so it is
greater than a surrogate but less than a supplementary. The
second character is a supplementary.
Code point Character utf8 utf16 ---------- --------- ---- ----- 0FF9D HALFWIDTH KATAKANA LETTER N EF BE 9D FF 9D 10384 UGARITIC LETTER DELTA F0 90 8E 84 D8 00 DF 84
The two characters in the chart are in order by code point
value because 0xff9d <
0x10384. And they are in order by
utf8 value because 0xef
< 0xf0. But they are not in order by
utf16 value, if we use byte-by-byte
comparison, because 0xff >
0xd8.
So MySQL's utf16_bin collation is not
“byte by byte.” It is “by code
point.” When MySQL sees a supplementary-character
encoding in utf16, it converts to the
character's code-point value, and then compares. Therefore,
utf8_bin and utf16_bin
are the same ordering. This is consistent with the SQL:2008
standard requirement for a UCS_BASIC collation:
“UCS_BASIC is a collation in which the ordering is
determined entirely by the Unicode scalar values of the
characters in the strings being sorted. It is applicable to
the UCS character repertoire. Since every character repertoire
is a subset of the UCS repertoire, the UCS_BASIC collation is
potentially applicable to every character set. NOTE 11: The
Unicode scalar value of a character is its code point treated
as an unsigned integer.”
If the character set is ucs2, comparison is
byte-by-byte, but ucs2 strings should not
contain surrogates, anyway.