How can I remove characters that are not supported by MySQL's utf8 character set?

Question

How can I remove characters from a string that are not supported by MySQL's utf8 character set? In other words, characters with four bytes, such as "𝜀", that are only supported by MySQL's utf8mb4 character set.

For example,

𝜀C = -2.4‰ ± 0.3‰; 𝜀H = -57‰

should become

C = -2.4‰ ± 0.3‰; H = -57‰

I want to load a data file into a MySQL table that has CHARSET=utf8.

I hear what you are asking, but why not convert the column to CHARACTER SET utf8mb4? — Rick James, Jan 11 at 2:02

ikegami · Accepted Answer · 2017-01-10 17:19:23Z

up vote 9 down vote accepted

MySQL's utf8mb4 encoding is what the world calls UTF-8.

MySQL's utf8 encoding is a subset of UTF-8 that only supports characters in the BMP (meaning characters U+0000 to U+FFFF inclusive).

Reference

So, the following will match the unsupported characters in question:

 /[^\N{U+0000}-\N{U+FFFF}]/

You could use it as follows:

Remove unsupported characters:
```
s/[^\N{U+0000}-\N{U+FFFF}]//g;
```

Replace unsupported characters with U+FFFD:

s/[^\N{U+0000}-\N{U+FFFF}]/\N{REPLACEMENT CHARACTER}/g;

Replace unsupported characters using a translation map:

my %translations = (
   "\N{MATHEMATICAL ITALIC SMALL EPSILON}" => "\N{GREEK SMALL LETTER EPSILON}",
   # ...
);

s{([^\N{U+0000}-\N{U+FFFF}])}{ $translations{$1} // "\N{REPLACEMENT CHARACTER}" }eg;

For example,

use utf8;                              # Source code is encoded using UTF-8
use open ':std', ':encoding(UTF-8)';   # Terminal and files use UTF-8.

use strict;
use warnings;
use 5.010;               # say, //
use charnames ':full';   # Not needed in 5.16+

my %translations = (
   "\N{MATHEMATICAL ITALIC SMALL EPSILON}" => "\N{GREEK SMALL LETTER EPSILON}",
   # ...
);

$_ = "𝜀C = -2.4‰ ± 0.3‰; 𝜀H = -57‰";
say;

s{([^\N{U+0000}-\N{U+FFFF}])}{ $translations{$1} // "\N{REPLACEMENT CHARACTER}" }eg;
say;

Output:

𝜀C = -2.4‰ ± 0.3‰; 𝜀H = -57‰
εC = -2.4‰ ± 0.3‰; εH = -57‰

edited Jan 10 at 17:19

answered Jan 10 at 15:36

ikegami

210k7138321

If I understand right, I have to convert the utf8 characters (e.g. the 4 byte char "𝜀") into BMP before applying your regex. Is there a function to do this? – Matthias Munz Jan 10 at 16:37

1

I don't know what you mean by converting "𝜀" into BMP. The BMP is a set of characters that doesn't include "𝜀". You could create a mapping of characters outside of the BMP to characters inside of the BMP. I've just added a demonstration of this to my answer. – ikegami Jan 10 at 16:47

thanks for your comment, but I meant something else: I would like to remove the 4 byte utf8 chars by using your regex s/[^\N{U+0000}-\N{U+FFFF}]//g; However when I apply it to my file (which contains a "𝜀" instead of its BMP representation) the 4 byte chars are not filtered out. – Matthias Munz Jan 10 at 16:53

Are you sure you have decoded text (as opposed to UTF-8)? – ikegami Jan 10 at 16:54

No, I think it's not decoded. How can I do this? For example my file contains this phrase "𝜀C = -2.4‰ ± 0.3‰; 𝜀H = -57‰" – Matthias Munz Jan 10 at 16:55

| show 4 more comments

asked	14 days ago
viewed	79 times
active	14 days ago

current community

your communities

more stack exchange communities

How can I remove characters that are not supported by MySQL's utf8 character set?

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged mysql perl utf-8 utf8mb4 or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

How can I remove characters that are not supported by MySQL's utf8 character set?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged mysql perl utf-8 utf8mb4 or ask your own question.

Related

Hot Network Questions