12.5. Unicode¶

12.5.1. Basic Concepts¶

12.5.1.1. BMP¶

BMP (Basic Multilingual Plane)，translated as Basic Multilingual Plane, is a coding block in Unicode.

12.5.1.2. Code planes¶

Unicode code points are divided into 17 planes, and each plane contains 2^16 (ie 65536) code points. The code bits of the 17 planes can be represented as from U+xx0000 to U+xxFFFF, where xx represents the hexadecimal value from 0016 to 1016, for a total of 17 planes.

12.5.1.3. Code Point¶

Code Point is also called Code Position, translated as code point or code position, which refers to the numerical value that constitutes the code space.

12.5.1.4. Code Unit¶

Refers to the minimum number of bytes required to encode a Code Point in a certain Unicode encoding method, such as UTF-8 requires at least one byte, UTF-16 requires at least two bytes, UCS-2 requires two bytes, UCS-4 and UTF -32 Four bytes.

12.5.1.5. Surrogate Pair¶

Surrogate Pair is used for UTF-16 to be backward compatible with UCS-2. The method is to take the code points of 0xD800~0xDBFF (called high surrogates) and 0xDC00~0xDFFF (called low surrogates) in the range of UCS-2, A high surrogate followed by a low surrogate is spelled into four bytes to represent characters beyond the BMP. Both surrogate ranges are 1024 code points, so a surrogate pair can express 1024 x 1024 = 1048576 = 0x100000 characters.

12.5.1.6. Combining Character¶

For example, characters that He̊llö contain accents etc., use a large number of code points to combine. Therefore, this kind of character is implemented in a combined way.

12.5.1.7. BOM¶

A byte-order mark (BOM) is a Unicode character with a special meaning, the code point is U+FEFF. When encoding a string of UCS characters in UTF-16 or UTF-32, this character is used to indicate its endianness. Often used to distinguish whether it is UTF encoding.

12.5.2. Encoding¶

12.5.2.1. UCS-2¶

UCS-2 (2-byte Universal Character Set) is a fixed-length encoding method. UCS-2 simply uses a 16-bit symbol to represent the code point, which means that the encoding range is from 0 to 0xFFFF. Inside.

12.5.2.2. UTF-8¶

UTF-8（8-bit Unicode Transformation Format）is a variable-length character encoding for Unicode and a prefix code. It can encode all valid code points in the Unicode character set in one to four bytes and is part of the Unicode standard. The encoding is as follows

the number of digits in the code point	code point value	code point end value	byte sequence	Byte 1	Byte 2	Byte 3	Byte 4	Byte 5	Byte 6
7	U+0000	U+007F	1	0xxxxxxx
11	U+0080	U+07FF	2	110xxxxx	10xxxxxx
16	U+0800	U+FFFF	3	1110xxxx	10xxxxxx	10xxxxxx
21	U+10000	U+1FFFFF	4	11110xxx	10xxxxxx	10xxxxxx	10xxxxxx
26	U+200000	U+3FFFFFF	5	111110xx	10xxxxxx	10xxxxxx	10xxxxxx	10xxxxxx
31	U+4000000	U+7FFFFFFF	6	1111110x	10xxxxxx	10xxxxxx	10xxxxxx	10xxxxxx	10xxxxxx

12.5.2.3. UTF-16¶

UTF-16 (16-bit Unicode Transformation Format) is an extension of UCS-2. It uses one or two 16-bit code units to represent code points, and can encode code points from 0 to 0x10FFFF.

12.5.3. Equivalence problems¶

12.5.3.1. Introduction¶

Unicode (Unicode) contains many special characters. In order to make many existing standards compatible, Unicode equivalence is proposed. Among the characters, some are functionally equivalent to other characters or sequences of characters. Therefore, Unicode defines some sequences of code points to be equal.

Unicode provides two concepts of equivalence: standard equivalence and compatible equivalence. The former is a subset of the latter. For example, the character n followed by the combining character ~ would be (standard and compatible) equivalent to the Unicode character ñ. The ligature ff is only compatible and equivalent to two f characters.

Unicode normalization is a form of literal normalization, which refers to converting sequences that are equivalent to each other into the same sequence. This sequence is called normal form in the Unicode standard.

For each equivalent concept, Unicode defines two forms, one is fully composite, the other is fully decomposed. Therefore, in the end, there will be four forms, the abbreviations are: NFC, NFD, NFKC, NFKD. Normalization is important for Unicode word processors. Because it affects the meaning of comparison, search and sorting.

12.5.3.2. Standard Equivalence¶

The underlying concept of standard equivalence in Unicode is the interactive use of character composition and decomposition. Synthesis refers to the process of combining simple characters into fewer pre-grouped characters, such as the character n and the combining character ~ to form the Unicode ñ. Decomposition is the reverse process, that is, the pre-grouped characters are turned back into components.

Standard equivalence refers to maintaining visual and functional equivalence. For example, a letter with a plus sign is considered to be the standard equivalent of the decomposed letter and its plus sign. In other words, the pregroup character ‘ü’ and the sequence of ‘u’ and ‘¨’ are standard equivalence. Similarly, Unicode unifies some Greek hyphens and punctuation marks that look like hyphens.

12.5.3.3. Compatible Equivalence¶

The range of compatible equivalences is wider than standard equivalences. If the sequence is canonical equivalence, it will be compatible equivalence, and vice versa is not necessarily true. Compatible equivalence is more concerned with plain literal equivalence and brings together some semantically different forms.

For example, superscript numbers and the numbers they use are compatible equivalents, but not standard equivalents. The rationale is that subscript and superscript forms, while sometimes having different meanings, are reasonable (though visually distinguishable) for applications to treat them as the same. This way, in Unicode rich files, superscripts and subscripts can appear in a less cumbersome way (see next section).

Full-width and half-width katakana is also a compatible equivalence but not a standard equivalence. Like ligatures and sequences of their parts. It has only a visual but no semantic difference. In other words, authors usually do not specifically claim that using a ligature means one meaning and not using it means another. Rather, this is limited to typographical choices.

Word processing software must take into account the existence of equivalence when implementing the search and sorting of Unicode strings. Without this feature, users would not be able to find visually indistinguishable glyphs when searching.

Unicode provides a standard normalization algorithm that produces a unique sequence from all identical sequences. Its equivalent criterion can be standard (NF) or compatible (NFK). Since the elements in the equivalence class can be chosen arbitrarily, it is also possible to have multiple canonical forms for each equivalence standard. Unicode provides two formal forms for each equivalence criterion: NFC and NFKC for synthesis and NFD and NFKD for decomposition. Regardless of the combined or decomposed form, the standard order is used, which restricts the regular form to only one form.

12.5.3.4. Regularization¶

To compare or search for Unicode strings, the software can use one of the composite or decomposed forms. It doesn’t matter which option is used as long as the strings being compared or searched for are in the same form. On the other hand, the choice of equivalent concepts affects the search results. For example, some ligatures such as ﬃ (U+FB03), Roman numerals such as Ⅸ (U+2168), and even superscript numerals such as ⁵ (U+2075) have their own Unicode code points. Standard normal form does not affect these results. But the compatible normal form decomposes ﬃ into f, f, i. So searching for U+0066(f) will succeed in NFKC but fail in NFC. Likewise there is a search for the Latin letter I (U+0049) in the pre-grouped Roman numeral IX. Similarly, “⁵” turns into “5”.

For browsers, it may not be good to convert superscripts to base underscores, because the superscript information will then disappear. To allow for this difference, the Unicode character database clause contains a compatible format tag, which provides details of compatible conversions. In the case of ligatures, this label is only <compat>, and in the case of superscript it is <super>. Rich file formats such as Hypertext Markup Language use compatible tags. For example, HTML uses custom tags to put “5” in the superscript position.

12.5.3.5. Regular Form¶

NFD Normalization Form Canonical Decomposition
NFC Normalization Form Canonical Composition is decomposed in a standard equivalence manner, and then reconstituted in a standard equivalence manner. If it is singleton, the result of recombination may be different from before decomposition.
NFKD Normalization Form Compatibility Decomposition to decompose NFKC in a compatible and equivalent way
Normalization Form Compatibility Composition is decomposed in a compatible equivalence manner, and then reconstituted in a standard equivalence manner

12.5.4. Tricks¶

The length of some languages is not the length of characters, a UTF-16 may be two bits.
Some languages handle errors when flipping multi-byte encodings such as UTF-16.

12.5.5. Security Issues¶

12.5.5.1. Visual Spoofing¶

For example bаidu.com (here a is u0430) and baidu.com (here a is x61) are visually identical, but actually point to two different domains.

baidu.com (here a is uff41) and baidu.com (here a is x61) are somewhat different, but point to the same two domain names.

This phenomenon can cause some Spoofing or WAF Bypass problems.

12.5.5.2. Best Fit¶

If the two characters have different encodings before and after, and the previous encoding does not correspond to the subsequent encoding, the program will try to find the best character for automatic conversion.

When wide characters become single-byte characters, the character encoding will change to some extent.

This phenomenon may cause some WAF Bypass.

12.5.5.3. Syntax Spoofing¶

The following four Urls are grammatically correct domain names, but the characters used for segmentation are not real segmentation characters, but U+2044 ( ⁄ ), which can cause some UI deception problems.

http://macchiato.com/x.bad.com macchiato.com/x bad.com
http://macchiato.com?x.bad.com macchiato.com?x bad.com
http://macchiato.com.x.bad.com macchiato.com.x bad.com
http://macchiato.com#x.bad.com macchiato.com#x bad.com

12.5.5.4. Punycode Spoofs¶

http://䕮䕵䕶䕱.com http://xn–google.com
http://䁾.com http://xn–cnn.com
http://岍岊岊岅岉岎.com http://xn–citibank.com

Some browsers will display puncode directly, but UI Spooof can also be implemented with this mechanism.

12.5.5.5. Buffer Overflows¶

During encoding conversion, some characters will become multiple characters, such as Fluß → FLUSS → fluss, which may lead to BOF.

12.5.6. Common Loads¶

12.5.6.1. URL¶

‥ (U+2025)
︰ (U+FE30)
。 (U+3002)
⓪ (U+24EA)
／ (U+FF0F)
ｐ (U+FF50)
ʰ (U+02B0)
ª (U+00AA)