MySQL has two Unicode character sets:
ucs2, the UCS-2 encoding of the Unicode character set using 16 bits per characterutf8, a UTF-8 encoding of the Unicode character set using one to three bytes per character
You can store text in about 650 languages using these character sets. This section lists the collations available for each Unicode character set and describes their differentiating properties. For general information about the character sets, see Section 10.1.10, “Unicode Support”.
A similar set of collations is available for each Unicode
character set. These are shown in the following list, where
xxx represents the character set
name. For example,
represents the Danish collations, the specific names of which
are xxx_danish_ciucs2_danish_ci and
utf8_danish_ci.
xxx_binxxx_czech_cixxx_danish_cixxx_esperanto_cixxx_estonian_ci(default)xxx_general_cixxx_hungarian_cixxx_icelandic_cixxx_latvian_cixxx_lithuanian_cixxx_persian_cixxx_polish_cixxx_roman_cixxx_romanian_cixxx_slovak_cixxx_slovenian_cixxx_spanish_cixxx_spanish2_cixxx_swedish_cixxx_turkish_cixxx_unicode_ci
The
collations were added in MySQL 5.0.13. The
xxx_esperanto_ci
collations were added in MySQL 5.0.19.
xxx_hungarian_ci
MySQL implements the
collations according to the Unicode Collation Algorithm (UCA)
described at
http://www.unicode.org/reports/tr10/. The
collation uses the version-4.0.0 UCA weight keys:
http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt.
The
xxx_unicode_ci
collations have only partial support for the Unicode Collation
Algorithm. Some characters are not supported yet. Also,
combining marks are not fully supported. This affects
primarily Vietnamese, Yoruba, and some smaller languages such
as Navajo. A combined character will be considered different
from the same character written with a single unicode
character in string comparisons, and the two characters are
considered to have a different length (for example, as
returned by the xxx_unicode_ciCHAR_LENGTH()
function or in result set metadata).
MySQL implements language-specific Unicode collations only if
the ordering with
does not work well for a language. Language-specific
collations are UCA-based. They are derived from
xxx_unicode_ci
with additional language tailoring rules.
xxx_unicode_ci
For any Unicode character set, operations performed using the
collation are faster than those for the
xxx_general_ci
collation. For example, comparisons for the
xxx_unicode_ciutf8_general_ci collation are faster, but
slightly less correct, than comparisons for
utf8_unicode_ci. The reason for this is
that utf8_unicode_ci supports mappings such
as expansions; that is, when one character compares as equal
to combinations of other characters. For example, in German
and some other languages “ß”
is equal to “ss”.
utf8_unicode_ci also supports contractions
and ignorable characters. utf8_general_ci
is a legacy collation that does not support expansions,
contractions, or ignorable characters. It can make only
one-to-one comparisons between characters.
To further illustrate, the following equalities hold in both
utf8_general_ci and
utf8_unicode_ci (for the effect this has in
comparisons or when doing searches, see
Section 10.1.7.8, “Examples of the Effect of Collation”):
Ä = A Ö = O Ü = U
A difference between the collations is that this is true for
utf8_general_ci:
ß = s
Whereas this is true for utf8_unicode_ci,
which supports the German DIN-1 ordering (also known as
dictionary order):
ß = ss
MySQL implements language-specific collations for the
utf8 character set only if the ordering
with utf8_unicode_ci does not work well for
a language. For example, utf8_unicode_ci
works fine for German dictionary order and French, so there is
no need to create special utf8 collations.
utf8_general_ci also is satisfactory for
both German and French, except that
“ß” is equal to
“s”, and not to
“ss”. If this is acceptable
for your application, you should use
utf8_general_ci because it is faster.
Otherwise, use utf8_unicode_ci because it
is more accurate.
includes Swedish rules. For example, in Swedish, the following
relationship holds, which is not something expected by a
German or French speaker:
xxx_swedish_ci
Ü = Y < Ö
The
and
xxx_spanish_ci
collations correspond to modern Spanish and traditional
Spanish, respectively. In both collations,
“xxx_spanish2_ciñ” (n-tilde) is a separate
letter between “n” and
“o”. In addition, for
traditional Spanish, “ch” is a
separate letter between “c”
and “d”, and
“ll” is a separate letter
between “l” and
“m”
The
collations may also be used for Asturian and Galician.
xxx_spanish2_ci
The
collations may also be used for Norwegian.
xxx_danich_ci
In the
collations, xxx_roman_ciI and J
compare as equal, and U and
V compare as equal.
For additional information about Unicode collations in MySQL, see Collation-Charts.Org (utf8).
a. If you have dotted words in your table, they won't be ordered correctly.
b. You can have words with the same letters and different dots in a unique index column.
On the other hand, in utf8_unicode_ci, dots are igonred, so:
a. The order will be correct;
b. Words with the same letters and different dots will be regarded as equal, and you won't be able to have them in a unique index column.
I still didn't find a collation that treats both issues correctly.