This section describes issues that you may face when
converting from the utf8 character set to
the utf8mb4 character set, or vice versa.
The discussion here focuses primarily on converting between
utf8 and utf8mb4, but
similar principles apply to converting between the
ucs2 character set and character sets
such as utf16 or
utf32.
The utf8 and utf8mb4
character sets differ as follows:
utf8 supports only characters in the
Basic Multilingual Plane (BMP). utf8mb4
additionally supports supplementary characters that lie
outside the BMP.
utf8 uses a maximum of three bytes per
character. utf8mb4 uses a maximum of
four bytes per character.
One advantage of converting from ut8 to
utf8mb4 is that this enables applications
to use supplementary characters. One tradeoff is that this may
increase data storage space requirements.
In most respects, converting from utf8 to
utf8mb4 should present few problems. These
are the primary potential areas of incompatibility:
For the variable-length character data types
(VARCHAR and the
TEXT types), the maximum
permitted length in characters is less for
utf8mb4 columns than for
utf8 columns.
For all character data types
(CHAR,
VARCHAR, and the
TEXT types), the maximum
number of characters that can be indexed is less for
utf8mb4 columns than for
utf8 columns.
Consequently, to convert tables from utf8
to utf8mb4, it may be necessary to change
some column or index definitions.
Tables can be converted from utf8 to
utf8mb4 by using ALTER
TABLE. Suppose that a table was originally defined
as follows:
CREATE TABLE t1 ( col1 CHAR(10) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL, col2 CHAR(10) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL ) CHARACTER SET utf8;
The following statement converts t1 to use
utf8mb4:
ALTER TABLE t1
DEFAULT CHARACTER SET utf8mb4,
MODIFY col1 CHAR(10)
CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL,
MODIFY col2 CHAR(10)
CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL;
In terms of table content, conversion from
utf8 to utf8mb4 presents
no problems:
For a BMP character, utf8 and
utf8mb4 have identical storage
characteristics: same code values, same encoding, same
length.
For a supplementary character, utf8
cannot store the character at all, whereas
utf8mb4 requires four bytes. Because
utf8 cannot store the character at all,
utf8 columns have no supplementary
characters and you need not worry about converting
characters or losing data when converting to
utf8mb4.
In terms of table structure, the catch when converting from
utf8 to utf8mb4 is that
the maximum length of a column or index key is unchanged in
terms of bytes. Therefore, it is smaller
in terms of characters because the
maximum length of a character is four bytes instead of three.
For the CHAR,
VARCHAR, and
TEXT data types, watch for
these issues when converting your MySQL tables:
Check all definitions of utf8 columns
and make sure they will not exceed the maximum length for
the storage engine.
Check all indexes on utf8 columns and
make sure they will not exceed the maximum length for the
storage engine. Sometimes the maximum can change due to
storage engine enhancements.
If the preceding conditions apply, you must either reduce the
defined length of columns or indexes, or continue to use
utf8 rather than
utf8mb4.
Here are some examples where structural changes may be needed:
A TINYTEXT column can hold
up to 255 bytes, so it can hold up to 85 3-byte or 63
4-byte characters. Suppose that you have a
TINYTEXT column that uses
utf8 but must be able to contain more
than 63 characters. You cannot convert it to
utf8mb4 unless you also change the data
type to a longer type such as
TEXT.
Similarly, a very long
VARCHAR column may need to
be changed to one of the longer
TEXT types if you want to
convert it from utf8 to
utf8mb4.
InnoDB has a maximum index length of
767 bytes for tables that use
COMPACT
or
REDUNDANT
row format, so for utf8 or
utf8mb4 columns, you can index a
maximum of 255 or 191 characters, respectively. If you
currently have utf8 columns with
indexes longer than 191 characters, you must index a
smaller number of characters.
In an InnoDB table that uses
COMPACT
or
REDUNDANT
row format, these column and index definitions are legal:
col1 VARCHAR(500) CHARACTER SET utf8, INDEX (col1(255))
To use utf8mb4 instead, the index must
be smaller:
col1 VARCHAR(500) CHARACTER SET utf8mb4, INDEX (col1(191))
For InnoDB tables that use
COMPRESSED
or
DYNAMIC
row format, you can enable the
innodb_large_prefix
option to permit index
key prefixes longer than 767 bytes (up to 3072
bytes). Creating such tables also requires the option
values
innodb_file_format=barracuda
and
innodb_file_per_table=true.)
In this case, enabling the
innodb_large_prefix
option enables you to index a maximum of 1024 or 768
characters for utf8 or
utf8mb4 columns, respectively. For
related information, see
Section 14.8.8, “Limits on InnoDB Tables”.
The preceding types of changes are most likely to be required
only if you have very long columns or indexes. Otherwise, you
should be able to convert your tables from
utf8 to utf8mb4 without
problems, using ALTER TABLE as
described previously.
The following items summarize other potential areas of incompatibility:
Performance of 4-byte UTF-8 (utf8mb4)
is slower than for 3-byte UTF-8 (utf8).
To avoid this penalty, continue to use
utf8.
SET NAMES 'utf8mb4' causes use of the
4-byte character set for connection character sets. As
long as no 4-byte characters are sent from the server,
there should be no problems. Otherwise, applications that
expect to receive a maximum of three bytes per character
may have problems. Conversely, applications that expect to
send 4-byte characters must ensure that the server
understands them. More generally, applications cannot send
utf8mb4, utf16,
utf16le, or utf32
data to an older server that does not understand it:
utf8mb4, utf16,
and utf32 are not recognized before
MySQL 5.5.3.
utf16le is not recognized before
MySQL 5.6.1.
For replication, if character sets that support
supplementary characters are to be used on the master, all
slaves must understand them as well. If you attempt to
replicate from a newer master to an older slave,
utf8 data will be seen as
utf8 by the slave and should replicate
correctly. But you cannot send utf8mb4,
utf16, utf16le, or
utf32 data to an older slave that does
not understand it:
utf8mb4, utf16,
and utf32 are not recognized before
MySQL 5.5.3.
utf16le is not recognized before
MySQL 5.6.1.
Also, keep in mind the general principle that if a table
has different definitions on the master and slave, this
can lead to unexpected results. For example, the
differences in maximum index key length make it risky to
use utf8 on the master and
utf8mb4 on the slave.
If you have converted to utf8mb4,
utf16, utf16le, or
utf32, and then decide to convert back to
utf8 or ucs2 (for
example, to downgrade to an older version of MySQL), these
considerations apply:
utf8 and ucs2 data
should present no problems.
The server must be recent enough to recognize definitions referring to the character set from which you are converting.
For object definitions that refer to the
utf8mb4 character set, you can dump
them with mysqldump prior to
downgrading, edit the dump file to change instances of
utf8mb4 to utf8, and
reload the file in the older server, as long as there are
no 4-byte characters in the data. The older server will
see utf8 in the dump file object
definitions and create new objects that use the (3-byte)
utf8 character set.