MySQL's utf8mb4 encoding is what the world calls UTF-8.
MySQL's utf8 encoding is a subset of UTF-8 that only supports characters in the BMP (meaning characters U+0000 to U+FFFF inclusive).
Reference
So, the following will match the unsupported characters in question:
/[^\N{U+0000}-\N{U+FFFF}]/
You could use it as follows:
Remove unsupported characters:
s/[^\N{U+0000}-\N{U+FFFF}]//g;
Replace unsupported characters with U+FFFD:
s/[^\N{U+0000}-\N{U+FFFF}]/\N{REPLACEMENT CHARACTER}/g;
Replace unsupported characters using a translation map:
my %translations = (
"\N{MATHEMATICAL ITALIC SMALL EPSILON}" => "\N{GREEK SMALL LETTER EPSILON}",
# ...
);
s{([^\N{U+0000}-\N{U+FFFF}])}{ $translations{$1} // "\N{REPLACEMENT CHARACTER}" }eg;
For example,
use utf8; # Source code is encoded using UTF-8
use open ':std', ':encoding(UTF-8)'; # Terminal and files use UTF-8.
use strict;
use warnings;
use 5.010; # say, //
use charnames ':full'; # Not needed in 5.16+
my %translations = (
"\N{MATHEMATICAL ITALIC SMALL EPSILON}" => "\N{GREEK SMALL LETTER EPSILON}",
# ...
);
$_ = "πC = -2.4β° Β± 0.3β°; πH = -57β°";
say;
s{([^\N{U+0000}-\N{U+FFFF}])}{ $translations{$1} // "\N{REPLACEMENT CHARACTER}" }eg;
say;
Output:
πC = -2.4β° Β± 0.3β°; πH = -57β°
Ξ΅C = -2.4β° Β± 0.3β°; Ξ΅H = -57β°
CHARACTER SET utf8mb4? – Rick James Jan 11 at 2:02