Join the Stack Overflow Community
Stack Overflow is a community of 6.6 million programmers, just like you, helping each other.
Join them; it only takes a minute:
Sign up

How can I remove characters from a string that are not supported by MySQL's utf8 character set? In other words, characters with four bytes, such as "πœ€", that are only supported by MySQL's utf8mb4 character set.

For example,

πœ€C = -2.4‰ Β± 0.3‰; πœ€H = -57‰

should become

C = -2.4‰ Β± 0.3‰; H = -57‰

I want to load a data file into a MySQL table that has CHARSET=utf8.

share|improve this question
    
I hear what you are asking, but why not convert the column to CHARACTER SET utf8mb4? – Rick James Jan 11 at 2:02
up vote 9 down vote accepted

MySQL's utf8mb4 encoding is what the world calls UTF-8.

MySQL's utf8 encoding is a subset of UTF-8 that only supports characters in the BMP (meaning characters U+0000 to U+FFFF inclusive).

Reference

So, the following will match the unsupported characters in question:

 /[^\N{U+0000}-\N{U+FFFF}]/

You could use it as follows:

  • Remove unsupported characters:

    s/[^\N{U+0000}-\N{U+FFFF}]//g;
    
  • Replace unsupported characters with U+FFFD:

    s/[^\N{U+0000}-\N{U+FFFF}]/\N{REPLACEMENT CHARACTER}/g;
    
  • Replace unsupported characters using a translation map:

    my %translations = (
       "\N{MATHEMATICAL ITALIC SMALL EPSILON}" => "\N{GREEK SMALL LETTER EPSILON}",
       # ...
    );
    
    s{([^\N{U+0000}-\N{U+FFFF}])}{ $translations{$1} // "\N{REPLACEMENT CHARACTER}" }eg;
    

For example,

use utf8;                              # Source code is encoded using UTF-8
use open ':std', ':encoding(UTF-8)';   # Terminal and files use UTF-8.

use strict;
use warnings;
use 5.010;               # say, //
use charnames ':full';   # Not needed in 5.16+

my %translations = (
   "\N{MATHEMATICAL ITALIC SMALL EPSILON}" => "\N{GREEK SMALL LETTER EPSILON}",
   # ...
);

$_ = "πœ€C = -2.4‰ Β± 0.3‰; πœ€H = -57‰";
say;

s{([^\N{U+0000}-\N{U+FFFF}])}{ $translations{$1} // "\N{REPLACEMENT CHARACTER}" }eg;
say;

Output:

πœ€C = -2.4‰ Β± 0.3‰; πœ€H = -57‰
Ξ΅C = -2.4‰ Β± 0.3‰; Ξ΅H = -57‰
share|improve this answer
    
If I understand right, I have to convert the utf8 characters (e.g. the 4 byte char "πœ€") into BMP before applying your regex. Is there a function to do this? – Matthias Munz Jan 10 at 16:37
1  
I don't know what you mean by converting "πœ€" into BMP. The BMP is a set of characters that doesn't include "πœ€". You could create a mapping of characters outside of the BMP to characters inside of the BMP. I've just added a demonstration of this to my answer. – ikegami Jan 10 at 16:47
    
thanks for your comment, but I meant something else: I would like to remove the 4 byte utf8 chars by using your regex s/[^\N{U+0000}-\N{U+FFFF}]//g; However when I apply it to my file (which contains a "πœ€" instead of its BMP representation) the 4 byte chars are not filtered out. – Matthias Munz Jan 10 at 16:53
    
Are you sure you have decoded text (as opposed to UTF-8)? – ikegami Jan 10 at 16:54
    
No, I think it's not decoded. How can I do this? For example my file contains this phrase "πœ€C = -2.4‰ ± 0.3‰; πœ€H = -57‰" – Matthias Munz Jan 10 at 16:55

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.