Private-Use Characters, Noncharacters & Sentinels FAQ
Private-Use Characters
Q: What are private-use characters?
A: Private-use characters are code points whose interpretation is not
specified by a character encoding standard and whose use and interpretation may
be determined by private agreement among cooperating users. Private-use characters
are sometimes also referred to as user-defined characters (UDC) or vendor-defined characters (VDC).
Q: Does Unicode have private-use characters?
A: Yes. There are three ranges of private-use characters in
the standard. The main range in the BMP is U+E000..U+F8FF, containing 6,400 private-use
characters. That range is often referred to as the Private Use Area (PUA).
But there are also two large ranges of supplementary private-use characters,
consisting of most of the code points on Planes 15 and 16: U+F0000..U+FFFFD
and U+100000..U+10FFFD. Together those ranges allocate another 131,068
private-use characters. Altogether, then, there are 137,468 private-use
characters in Unicode.
Q: Why are there so many private-use characters in Unicode?
A: Unicode is a very large and inclusive character set, containing
many more standardized characters than any of the legacy character encodings.
Most users have little need for private-use characters, because the characters
they need are already present in the standard.
However, some implementations, particularly those interoperating with East
Asian legacy data, originally anticipated needing large numbers of private-use characters to enable round-trip
conversion to private-use definitions in that data. In most cases, 6,400
private-use characters is more than enough, but there can be occasions when 6,400 does
not suffice. Allocating a large number of private-use characters has the additional
benefit of allowing implementations to choose ranges for their
private-use characters that are less likely to conflict with ranges used by
others.
The allocation of two entire additional planes full of private-use
characters ensures that even the most extravagant implementation of private-use
character definitions can be fully accomodated by Unicode.
Q: Will the number of private-use characters in Unicode ever change?
A: No. The set of private-use characters is formally immutable. This is guaranteed
by a Unicode Stability Policy.
Q: So legacy character encodings also have private-use characters?
A: Yes. Private-use characters are commonly used in East Asia, particularly
in Japan, China, and Korea, to extend the available characters in various standards
and vendor character sets. Typically, such characters have been used
to add Han characters not included in the standard repertoire
of the character set. Such non-standard Han character extensions are often referred to as "gaiji"
in Japanese contexts.
Q: So other than interoperating with legacy CJK, why would I use private-use characters?
A: Some characters may never get standard encodings for one reason or another.
For example, they might be part of a constructed artificial script (ConScript) which
has no general community of use. Or a particular implementation may need to use
private-use characters for specific internal purposes. Private-use characters are also
useful for testing implementations of scripts or other sets of characters which may
be proposed for encoding in a future version of Unicode.
Q: How can private-use characters be input?
A: Some input method editors (IME) allow customizations whereby
an input sequence and resulting private-use character can be added to their
internal dictionaries.
Q: How are private-use characters displayed?
A: With common font technologies such as OpenType and AAT, private-use characters
can be added to fonts for display.
Q: What happens if definitions of private-use characters conflict?
A: The same code points in
the PUA may be given different meanings in different contexts, since
they are, after all, defined by users and are not standardized. For example, if text
comes from a legacy NEC encoding in Japan, the same
code point in the PUA may mean something entirely different if
interpreted on a legacy Fujitsu machine, even though both systems would
share the same private-use code points. For each given interpretation of a
private-use character one would have to pick the appropriate IME, user
dictionary and fonts to work with it.
Q: What about properties for private-use characters?
A: One should not expect the rest of an operating system to
override the character properties for private-use characters,
since private use characters can have different meanings, depending on
how they originated. In terms of line breaking, case conversions, and
other textual processes, private-use characters will typically be
treated by the operating system as otherwise undistinguished letters (or
ideographs) with no uppercase/lowercase distinctions.
Q: What does "private agreement among cooperating parties" mean?
A: A "private agreement" simply refers to the fact that agreement about
the interpretation of some set of private-use characters is done privately, outside
the context of the standard. The Unicode Standard does not specify any particular
interpretation for any private-use character. There is no implication that a private
agreement necessarily has any contractual or other legal status—it is simply
an agreement between two or more parties about how a particular set of private-use characters
should be interpreted.
Q: How would I define a private agreement?
A: One can share, or even publish, documentation containing particular
assignments for private-use characters, their glyphs, and other relevant information
about their interpretation. One can then ask others to use those private-use characters
as documented. One can create appropriate fonts and IMEs, or request that others do so.
Noncharacters
Q: What are noncharacters?
A: A "noncharacter" is a code point that is permanently reserved in the Unicode
Standard for internal use.
Q: How did noncharacters get that weird name?
A: Noncharacters are in a sense a kind of private-use character, because
they are reserved for internal (private) use. However, that internal use is intended
as a "super" private use, not normally interchanged with other users. Their
allocation status in Unicode differs from that of ordinary private-use characters.
They are considered unassigned to any abstract character, and they share the
General_Category value Cn (Unassigned) with unassigned reserved code points in
the standard. In this sense they are "less a character" than most characters in
Unicode, and the moniker "noncharacter" seemed appropriate to the UTC to express that unique aspect
of their identity.
In Unicode 1.0 the code points U+FFFE and U+FFFF
were annotated in the code charts as "Not character codes" and instead of having actual
names were labeled "NOT A CHARACTER". The term "noncharacter" in later versions of the standard
evolved from those early annotations and labels.
Q: How many noncharacters does Unicode have?
A: Exactly 66.
Q: Which code points are noncharacters?
A: The 66 noncharacters are allocated as follows:
- a contiguous range of 32 noncharacters: U+FDD0..U+FDEF in the BMP
- the last two code points of the BMP, U+FFFE and U+FFFF
- the last two code points
of each of the 16 supplementary planes: U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF, ... U+10FFFE, U+10FFFF
For convenient reference, the following table summarizes all of the noncharacters,
showing their representations in UTF-32, UTF-16, and UTF-8. (In this table, "#" stands for
either the hex digit "E" or "F".)
| UTF-32 |
UTF-16 |
UTF-8 |
| 0000FDD0 |
FDD0 |
EF B7 90 |
|
... |
| 0000FDEF |
FDEF |
EF B7 AF |
| 0000FFF# |
FFF# |
EF BF B# |
| 0001FFF# |
D83F DFF# |
F0 9F BF B# |
| 0002FFF# |
D87F DFF# |
F0 AF BF B# |
| 0003FFF# |
D8BF DFF# |
F0 BF BF B# |
| 0004FFF# |
D8FF DFF# |
F1 8F BF B# |
|
... |
| 000FFFF# |
DBBF DFF# |
F3 BF BF B# |
| 0010FFF# |
DBFF DFF# |
F4 8F BF B# |
Q: Why are 32 of the noncharacters located in a block of Arabic characters?
A. The allocation of the range of noncharacters
U+FDD0..U+FDEF in the middle of the Arabic Presentation Forms-A
block was mostly a matter of efficiency in the use of reserved code points
in the rather fully-allocated BMP. The Arabic Presentation Forms-A
block had a contiguous range of 32 unassigned code points, but
as of 2001, when the need for more BMP noncharacters became
apparent, it was already clear to the UTC that the encoding of
many more Arabic presentation forms similar to those already
in the Arabic Presentation Forms-A block would not be useful to
anyone. Rather than designate an entirely new block for noncharacters,
the unassigned range U+FDD0..U+FDEF was designated for them,
instead.
Note that the range U+FDD0..U+FDEF for noncharacters is another
example of why it is never safe to simply assume from the name of
a block in the Unicode Standard that you know exactly what kinds
of characters it contains. The identity of any character is determined
by its actual properties in the Unicode Character Database. The
noncharacter code points in the range U+FDD0..U+FDEF share
none of their properties with other characters in the Arabic
Presentation Forms-A block; they certainly are not Arabic
script characters, for example.
Q: Will the set of noncharacters in Unicode ever change?
A: No. The set of noncharacters is formally immutable. This is guaranteed
by a Unicode Stability Policy.
Q: Are noncharacters intended for interchange?
A: No. They are intended explicity for internal use. For example, they
might be used internally as a particular kind of object placeholder in a string. Or
they might be used in a collation tailoring as a target for a weighting
that comes between weights for "real" characters of different scripts, thus simplifying the support
of "alphabetic index" implementations.
Q: Are noncharacters prohibited in interchange?
A: This question has led to some controversy, because the Unicode Standard
has been somewhat ambiguous about the status of noncharacters. The formal wording of the definition of
"noncharacter" in the standard has always indicated that noncharacters "should never be
interchanged." That led some people to assume that the definition actually meant "shall not be
interchanged" and that therefore the presence of a noncharacter in any Unicode string
immediately rendered that string malformed according to the standard. But the intended
use of noncharacters requires the ability to exchange them in a limited context,
at least across APIs and even through data files and other means of "interchange", so
that they can be processed as intended. The choice of the word "should" in the original
definition was deliberate, and indicated that one should not try to interchange
noncharacters precisely because their interpretation is strictly internal to whatever
implementation uses them, so they have no publicly interchangeable semantics. But other
informative wording in the text of the core specification and in the character names list
was differently and more strongly worded, leading to contradictory interpretations.
Given this ambiguity of intent, in 2013 the UTC issued
Corrigendum #9, which
deleted the phrase "and that should never be interchanged" from the
definition of noncharacters, to make it clear that prohibition from interchange is
not part of the formal definition of noncharacters. Corrigendum #9 has been incorporated into the core specification for Unicode 7.0.
Q: Are noncharacters invalid in Unicode strings and UTFs?
A: Absolutely not. Noncharacters do not cause a Unicode string to be
ill-formed in any UTF. This can be seen explicitly in the table above,
where every noncharacter code point has a well-formed representation in UTF-32, in UTF-16, and in UTF-8.
An implementation which converts noncharacter code points between one UTF representation and
another must preserve these values correctly. The fact that they are called
"noncharacters" and are not intended for open interchange does not mean that they
are somehow illegal or invalid code points which make strings containing them invalid.
Q: So how should libraries and tools handle noncharacters?
A: Library APIs, components, and tool applications (such as low-level text
editors) which handle all Unicode strings should also handle noncharacters. Often this
means simple pass-through, the same way such an API or tool would handle a reserved
unassigned code point. Such APIs and tools would not normally be expected to interpret
the semantics of noncharacters, precisely because the intended use of a noncharacter
is internal. But an API or tool should also not arbitrarily filter out, convert, or
otherwise discard the value of noncharacters, any more than they would do for private-use
characters or reserved unassigned code points.
Q: If my application makes specific, internal use of a noncharacter,
what should I do with input text?
A: In cases where the input text cannot be guaranteed to use the same interpretation
for the noncharacter as your program does, and the presence of that noncharacter would cause
internal problems, it is best practice to replace that particular
noncharacter on input by U+FFFD. Of course, such behavior should be clearly documented, so that
external clients know what to expect.
Q: What should I do if downstream clients depend on noncharacters
being passed through by my module?
A: In such a case, your module may need to use a more complicated mechanism to
preserve noncharacters for pass through, while not interfering with their specific internal use.
This behavior will prevent your downstream clients from breaking, at the cost of making your
processing marginally more complex. However, because of this additional complexity, if you anticipate that
a future version of your module may not pass through one or more noncharacters, it is best
practice to document the reservation of those values from the start. In that way, any
downstream client using your module can have clearly specified expectations regarding
which noncharacter values your module may replace.
Q: Can failing to replace noncharacters with U+FFFD lead to problems?
A: If your implementation has no conflicting internal definition and use for the
particular noncharacter in question, it is usually harmless to just leave noncharacters in
the text stream. They definitely will not be displayable and might break up text units or
create other "funny" effects in text, but these results are typically the same as
could be expected for an uninterpreted private-use character or even
a normal assigned character for which no display glyph is available.
Q: Can noncharacters simply be deleted from input text?
A: No. Doing so can lead to security problems. For more information, see
Unicode Technical Report #36, Unicode Security Guidelines.
Q: Can you summarize the basic differences between private-use
characters and noncharacters?
A: Private-use characters do not have any meanings assigned by the Unicode Standard,
but are intended to be interchanged among cooperating parties who share conventions
about what the private-use characters mean. Typically, sharing those conventions means that
there will also be some kind of public documentation about such use: for example, a website listing
a table of interpretations for certain ranges of private-use characters. As an example, see
the ConScript Unicode Registry—a private
group unaffiliated with the Unicode Consortium—which has extensive tables listing private-use character definitions for
various unencoded scripts. Or such public documentation might consist of the specification of
all the glyphs in a font distributed for the purpose of displaying certain ranges of private-use characters.
Of course, a group of cooperating users which have a private agreement about the
interpretation of some private-use characters is under no obligation to publish the details
of their agreement.
Noncharacters also do not have any meanings assigned by the Unicode Standard, but
unlike private-use characters, they are intended only for internal use, and are not
intended for interchange. Occasionally, there will be no public documentation available
about their use in particular instances, and fonts typically do not have glyphs for them.
Noncharacters and private-use characters also differ significantly in their default
Unicode character property values.
| Code Point Type |
Use Type |
Properties |
| noncharacter |
private, internal |
gc=Cn, bc=BN, eaw=N |
| private use |
private, interchange |
gc=Co, bc=L, eaw=A |
Sentinels
Q: What is a sentinel?
A: A sentinel is a special numeric value typically used to signal an edge condition
of some sort. For text, in particular, sentinels are values stored with text but which are
not interpreted as part of the text, and which indicate some special status. For example, a null
byte is used as a sentinel in C strings to mark the end of the string.
Q: Is it safe to use a noncharacter as an end-of-string sentinel?
A: It is not recommended. The use of any Unicode code point U+0000..U+10FFFF
as a sentinel value (such as "end of text" in APIs) can cause problems when that code point actually
occurs in the text. It is preferable to use a true out-of-range value,
for example -1. This is parallel to the use of -1 as the sentinel end-of-file (EOF) value in the
standard C library, and is easy and fast to test for in code with a (result < 0) check. Alternatively,
a clearly out-of-range positive value such as 0x7FFFFFFF could also be used as a sentinel value.
Q: How about using NULL as an end-of-string sentinel?
A: When using UTF-8 in C strings, implementations follow the same conventions
they would for any legacy 8-bit character encoding in C strings. The byte 0x00 marks the
end of the string, consistent with the C standard. Because the byte 0x00 in UTF-8 also
represents U+0000 NULL, a UTF-8 C string cannot have a NULL in its contents. This is precisely
the same issue as for using C strings with ASCII. In fact, an ASCII C string is formally
indistinguishable from a UTF-8 C string with the same character content.
It is also quite common for implementations which handle both UTF-8 and
UTF-16 data to implement 16-bit string handling analogously to C strings, using 0x0000 as
a 16-bit sentinel to indicate end of string for a 16-bit Unicode string. The rationale
for this approach and the associated problems completely parallel those for UTF-8
C strings.
Q: The Unicode Standard talks about U+FEFF BYTE ORDER MARK (BOM) being a signature. Is that the same as a sentinel?
A: No. A signature is a defined sequence of bytes used to identify an object. In the case of
Unicode text, certain encoding schemes use specific initial byte sequences to identify the byte order of
a Unicode text stream. See the BOM FAQ entries for more details.
Q: But the byte-swapped BOM, U+FFFE, is a noncharacter. Why?
A: U+FFFE was designated as a noncharacter to make it unlikely that normal, interchanged text
would begin with U+FFFE. The occurrence of U+FFFE as the initial character as part of text
has the potential to confuse applications testing for the two initial signature bytes <FE FF ...> or <FF FE ...>
of a byte stream labeled as using the UTF-16 encoding scheme. That can interfere with checking for the presence of a BOM
which would indicate big-endian or little-endian order.
Q: I read somewhere that U+FFFE and U+FFFF were illegal in Unicode, and could be used as sentinels. Is that true?
A: Well, the short answer is no, that is not true—at least, not entirely true. U+FFFE and U+FFFF are noncharacters just like
the other 64 noncharacters in the standard, and are valid in Unicode strings. Because they are noncharacters,
nothing would prohibit a privately-defined internal use of either of them as a sentinel, but such use
is problematical in the same way that use of any valid character as a sentinel can be problematical.
The claims about U+FFFE and U+FFFF being illegal in Unicode derive from the days
of Unicode 1.0 [1991], when the standard
was still architected as a pure 16-bit character encoding, before the invention
of UTF-16 and supplementary characters. In that version of the standard, U+FFFE and U+FFFF did have
an unusual status. The code charts were printed omitting the last two code points altogether, and
in the names list, the code points U+FFFE and U+FFFF were labeled "NOT A CHARACTER". They were also
annotated with notes like, "the value FFFF is guaranteed not to be a Unicode character at all".
Section 2.3, p. 14 of Unicode 1.0 contains the statement, "U+FFFE and U+FFFE are reserved and should not
be transmitted or stored," so it is clear that Unicode 1.0 intended that those values would
not occur in Unicode strings.
The block description for the Specials Block in Unicode 1.0 contained the following information:
U+FFFE. The 16-bit unsigned hexadecimal value U+FFFE is not a Unicode character value,
and should be taken as a signal that Unicode characters should be byte-swapped before interpretation.
U+FFFE should only be intepreted as an incorrectly byte-swapped version of U+FEFF.
U+FFFF. The 16-bit unsigned hexadecimal value U+FFFF is not a Unicode character value,
and can be used by an application as a [sic] error code or other non-character value. The specific
interpretation of U+FFFF is not defined by the Unicode standard, so it can be viewed as a kind of private-use
non-character.
It should be apparent that U+FFFF in Unicode 1.0 was the prototype for what later became
noncharacters in the standard—both in terms of how it was labeled and how its function
was described.
Unicode 2.0 [1996] formally changed
the architecture of Unicode, as a result of the merger with ISO/IEC 10646-1:1993 and the introduction
of UTF-16 and UTF-8 (both dating from Unicode 1.1 times [1993]). However, both Unicode 2.0 and Unicode 3.0 effectively
were still 16-bit standards, because no characters had been encoded beyond the BMP, and because implementations were
still mostly treating the standard as a de facto fixed-width 16-bit encoding.
The conformance wording about U+FFFE
and U+FFFF changed somewhat in Unicode 2.0, but these were still the only two code points with this unique status,
and there were no other "noncharacters" in the standard. The code charts switched to
the current convention of showing what we now know as "noncharacters" with black cells in
the code charts, rather than omitting the code points altogether. The names list annotations were unchanged
from Unicode 1.0, and the Specials Block description text was essentially unchanged as well.
Unicode 3.0 introduced the term "noncharacter" to describe U+FFFE and U+FFFF, not as a formal
definition, but simply as a subhead in the text.
The Chapter 2 language in Unicode 2.0 dropped the explicit prohibition
against transmission or storage of U+FFFE and U+FFFF, but instead added the language, "U+FFFF
is reserved for private program use as a sentinel or other signal." That statement effectively
blessed existing practice for Unicode 2.0 (and 3.0), where 16-bit implementations were taking
advantage of the fact that the very last code point in the BMP was reserved and conveniently could also
be interpreted as a (signed) 16-bit value of -1, to use it as a sentinel value in some string
processing.
Unicode 3.0 [1999] formalized the definition
of "transformations", now more widely referred to as UTFs. And there was one very important
addition to the text which makes it clear that U+FFFE and U+FFFF still had a special status and were
not considered "valid" Unicode characters. Chapter 3, p. 46 included the language:
To ensure that round-trip transcoding is possible, a UTF mapping must also map invalid
Unicode scalar values to unique code value sequences. These invalid scalar values include FFFE16,
FFFF16, and unpaired surrogates.
That initial formulation of UTF mapping was erroneous. A lot of work was done
to correct and clarify the concepts of encoding forms and UTF mapping in the versions
immediately following Unicode 3.0, to correct various defects in the specification.
Unicode 3.1 [2001] was the watershed
for the development of noncharacters in the standard. Unicode 3.1 was the first version to add
supplementary characters to the standard. As a result, it also had to come to grips with the
fact the ISO/IEC 10646-2:2001 had reserved the last two code points for every plane
as "not a character", despite the fact that their code point values shared nothing with the
rationale for reserving U+FFFE and U+FFFF when the entire codespace was just 16 bits.
The Unicode 3.1 text formally defined noncharacters, and also designated the
code point range U+FDD0..U+FDEF as noncharacters, resulting in the 66 noncharacters
defined in the standard.
Unicode 4.0 [2003] finally corrected
the statement about mapping noncharacters and surrogate code points:
To ensure that the mapping for a Unicode encoding form is one-to-one, all Unicode
scalar values, including those corresponding to noncharacter code points and unassigned code points,
must be mapped to unique code unit sequences. Note that this requirement does not extend to high-surrogate
and low-surrogate code points, which are excluded by definition from the set of Unicode scalar values.
That correction results in the current situation for Unicode, where noncharacters are valid
Unicode scalar values, are valid in Unicode strings, and must be mapped through UTFs,
whereas surrogate code points are not valid Unicode scalar values, are not valid
in UTFs, and cannot be mapped through UTFs.
Unicode 4.0 also added an entire new informative section about noncharacters, which
recommended the use of U+FFFF and U+10FFFF "for internal purposes as sentinels." That new text
also stated that "[noncharacters] are forbidden for use in open interchange of Unicode text data," a claim
which was stronger than the formal definition. And it made a contrast between noncharacters
and "valid character value[s]", implying that noncharacters were not valid. Of course, noncharacters
could not be interpreted in open interchange, but the text in this section had not really
caught up with the implications of the change of wording in the conformance requirements for UTFs.
The text still echoed the sense of "invalid" associated with noncharacters in Unicode 3.0.
Because of this complicated history and confusing changes of wording in the standard
over the years regarding what are now known as noncharacters, there
is still considerable disagreement about their use and whether they should be considered "illegal" or
"invalid" in various contexts. Particularly for implementations prior to Unicode 3.1, it
should not be surprising to find legacy behavior treating U+FFFE and U+FFFF as invalid in Unicode
16-bit strings. And U+FFFF and U+10FFFF are, indeed, known to be used in various implementations
as sentinels. For example, the value FFFF is used for WEOF in Windows implementations.
For up-to-date Unicode implementations, however, one should use caution when choosing
sentinel values. U+FFFF and U+10FFFF still have interesting numerical properties which render them
likely choices for internal use as sentinels, but implementers should be aware of the fact that
those values, as for all noncharacters in the standard, are also valid in Unicode strings,
must be converted between UTFs, and may be encountered in Unicode data—not necessarily
used with the same interpretation as for one's own sentinel use. Just be careful out there!