12.5. Unicode

12.5.1. Basic Concepts

12.5.1.1. BMP

BMP (Basic Multilingual Plane),translated as Basic Multilingual Plane, is a coding block in Unicode.

12.5.1.2. Code planes

Unicode code points are divided into 17 planes, and each plane contains 2^16 (ie 65536) code points. The code bits of the 17 planes can be represented as from U+xx0000 to U+xxFFFF, where xx represents the hexadecimal value from 0016 to 1016, for a total of 17 planes.

12.5.1.3. Code Point

Code Point is also called Code Position, translated as code point or code position, which refers to the numerical value that constitutes the code space.

12.5.1.4. Code Unit

Refers to the minimum number of bytes required to encode a Code Point in a certain Unicode encoding method, such as UTF-8 requires at least one byte, UTF-16 requires at least two bytes, UCS-2 requires two bytes, UCS-4 and UTF -32 Four bytes.

12.5.1.5. Surrogate Pair

Surrogate Pair is used for UTF-16 to be backward compatible with UCS-2. The method is to take the code points of 0xD800~0xDBFF (called high surrogates) and 0xDC00~0xDFFF (called low surrogates) in the range of UCS-2, A high surrogate followed by a low surrogate is spelled into four bytes to represent characters beyond the BMP. Both surrogate ranges are 1024 code points, so a surrogate pair can express 1024 x 1024 = 1048576 = 0x100000 characters.

12.5.1.6. Combining Character

For example, characters that He̊llö contain accents etc., use a large number of code points to combine. Therefore, this kind of character is implemented in a combined way.

12.5.1.7. BOM

A byte-order mark (BOM) is a Unicode character with a special meaning, the code point is U+FEFF. When encoding a string of UCS characters in UTF-16 or UTF-32, this character is used to indicate its endianness. Often used to distinguish whether it is UTF encoding.

12.5.2. Encoding

12.5.2.1. UCS-2

UCS-2 (2-byte Universal Character Set) is a fixed-length encoding method. UCS-2 simply uses a 16-bit symbol to represent the code point, which means that the encoding range is from 0 to 0xFFFF. Inside.

12.5.2.2. UTF-8

UTF-8(8-bit Unicode Transformation Format)is a variable-length character encoding for Unicode and a prefix code. It can encode all valid code points in the Unicode character set in one to four bytes and is part of the Unicode standard. The encoding is as follows

the number of digits in the code point

code point value

code point end value

byte sequence

Byte 1

Byte 2

Byte 3

Byte 4

Byte 5

Byte 6

7

U+0000

U+007F

1

0xxxxxxx

11

U+0080

U+07FF

2

110xxxxx

10xxxxxx

16

U+0800

U+FFFF

3

1110xxxx

10xxxxxx

10xxxxxx

21

U+10000

U+1FFFFF

4

11110xxx

10xxxxxx

10xxxxxx

10xxxxxx

26

U+200000

U+3FFFFFF

5

111110xx

10xxxxxx

10xxxxxx

10xxxxxx

10xxxxxx

31

U+4000000

U+7FFFFFFF

6

1111110x

10xxxxxx

10xxxxxx

10xxxxxx

10xxxxxx

10xxxxxx

12.5.2.3. UTF-16

UTF-16 (16-bit Unicode Transformation Format) is an extension of UCS-2. It uses one or two 16-bit code units to represent code points, and can encode code points from 0 to 0x10FFFF.

12.5.3. Equivalence problems

12.5.3.1. Introduction

Unicode (Unicode) contains many special characters. In order to make many existing standards compatible, Unicode equivalence is proposed. Among the characters, some are functionally equivalent to other characters or sequences of characters. Therefore, Unicode defines some sequences of code points to be equal.

Unicode provides two concepts of equivalence: standard equivalence and compatible equivalence. The former is a subset of the latter. For example, the character n followed by the combining character ~ would be (standard and compatible) equivalent to the Unicode character ñ. The ligature ff is only compatible and equivalent to two f characters.

Unicode normalization is a form of literal normalization, which refers to converting sequences that are equivalent to each other into the same sequence. This sequence is called normal form in the Unicode standard.

For each equivalent concept, Unicode defines two forms, one is fully composite, the other is fully decomposed. Therefore, in the end, there will be four forms, the abbreviations are: NFC, NFD, NFKC, NFKD. Normalization is important for Unicode word processors. Because it affects the meaning of comparison, search and sorting.

12.5.3.2. Standard Equivalence

The underlying concept of standard equivalence in Unicode is the interactive use of character composition and decomposition. Synthesis refers to the process of combining simple characters into fewer pre-grouped characters, such as the character n and the combining character ~ to form the Unicode ñ. Decomposition is the reverse process, that is, the pre-grouped characters are turned back into components.

Standard equivalence refers to maintaining visual and functional equivalence. For example, a letter with a plus sign is considered to be the standard equivalent of the decomposed letter and its plus sign. In other words, the pregroup character ‘ü’ and the sequence of ‘u’ and ‘¨’ are standard equivalence. Similarly, Unicode unifies some Greek hyphens and punctuation marks that look like hyphens.

12.5.3.3. Compatible Equivalence

The range of compatible equivalences is wider than standard equivalences. If the sequence is canonical equivalence, it will be compatible equivalence, and vice versa is not necessarily true. Compatible equivalence is more concerned with plain literal equivalence and brings together some semantically different forms.

For example, superscript numbers and the numbers they use are compatible equivalents, but not standard equivalents. The rationale is that subscript and superscript forms, while sometimes having different meanings, are reasonable (though visually distinguishable) for applications to treat them as the same. This way, in Unicode rich files, superscripts and subscripts can appear in a less cumbersome way (see next section).

Full-width and half-width katakana is also a compatible equivalence but not a standard equivalence. Like ligatures and sequences of their parts. It has only a visual but no semantic difference. In other words, authors usually do not specifically claim that using a ligature means one meaning and not using it means another. Rather, this is limited to typographical choices.

Word processing software must take into account the existence of equivalence when implementing the search and sorting of Unicode strings. Without this feature, users would not be able to find visually indistinguishable glyphs when searching.

Unicode provides a standard normalization algorithm that produces a unique sequence from all identical sequences. Its equivalent criterion can be standard (NF) or compatible (NFK). Since the elements in the equivalence class can be chosen arbitrarily, it is also possible to have multiple canonical forms for each equivalence standard. Unicode provides two formal forms for each equivalence criterion: NFC and NFKC for synthesis and NFD and NFKD for decomposition. Regardless of the combined or decomposed form, the standard order is used, which restricts the regular form to only one form.

12.5.3.4. Regularization

To compare or search for Unicode strings, the software can use one of the composite or decomposed forms. It doesn’t matter which option is used as long as the strings being compared or searched for are in the same form. On the other hand, the choice of equivalent concepts affects the search results. For example, some ligatures such as ffi (U+FB03), Roman numerals such as Ⅸ (U+2168), and even superscript numerals such as ⁵ (U+2075) have their own Unicode code points. Standard normal form does not affect these results. But the compatible normal form decomposes ffi into f, f, i. So searching for U+0066(f) will succeed in NFKC but fail in NFC. Likewise there is a search for the Latin letter I (U+0049) in the pre-grouped Roman numeral IX. Similarly, “⁵” turns into “5”.

For browsers, it may not be good to convert superscripts to base underscores, because the superscript information will then disappear. To allow for this difference, the Unicode character database clause contains a compatible format tag, which provides details of compatible conversions. In the case of ligatures, this label is only <compat>, and in the case of superscript it is <super>. Rich file formats such as Hypertext Markup Language use compatible tags. For example, HTML uses custom tags to put “5” in the superscript position.

12.5.3.5. Regular Form

  • NFD Normalization Form Canonical Decomposition

  • NFC Normalization Form Canonical Composition is decomposed in a standard equivalence manner, and then reconstituted in a standard equivalence manner. If it is singleton, the result of recombination may be different from before decomposition.

  • NFKD Normalization Form Compatibility Decomposition to decompose NFKC in a compatible and equivalent way

  • Normalization Form Compatibility Composition is decomposed in a compatible equivalence manner, and then reconstituted in a standard equivalence manner

12.5.4. Tricks

  • The length of some languages ​​is not the length of characters, a UTF-16 may be two bits.

  • Some languages ​​handle errors when flipping multi-byte encodings such as UTF-16.

12.5.5. Security Issues

12.5.5.1. Visual Spoofing

For example bаidu.com (here a is u0430) and baidu.com (here a is x61) are visually identical, but actually point to two different domains.

baidu.com (here a is uff41) and baidu.com (here a is x61) are somewhat different, but point to the same two domain names.

This phenomenon can cause some Spoofing or WAF Bypass problems.

12.5.5.2. Best Fit

If the two characters have different encodings before and after, and the previous encoding does not correspond to the subsequent encoding, the program will try to find the best character for automatic conversion.

When wide characters become single-byte characters, the character encoding will change to some extent.

This phenomenon may cause some WAF Bypass.

12.5.5.3. Syntax Spoofing

The following four Urls are grammatically correct domain names, but the characters used for segmentation are not real segmentation characters, but U+2044 ( ⁄ ), which can cause some UI deception problems.

12.5.5.4. Punycode Spoofs

Some browsers will display puncode directly, but UI Spooof can also be implemented with this mechanism.

12.5.5.5. Buffer Overflows

During encoding conversion, some characters will become multiple characters, such as Fluß → FLUSS → fluss, which may lead to BOF.

12.5.6. Common Loads

12.5.6.1. URL

  • (U+2025)

  • (U+FE30)

  • (U+3002)

  • (U+24EA)

  • (U+FF0F)

  • (U+FF50)

  • ʰ (U+02B0)

  • ª (U+00AA)

12.5.6.2. SQL Injection

  • (U+FF07)

  • (U+FF02)

  • (U+FE63)

12.5.6.3. XSS

  • (U+FF1C)

  • (U+FF02)

12.5.6.4. Command Injection

  • (U+FF06)

  • (U+FF5C)

12.5.6.5. Template Injection

  • (U+FE5B)

  • (U+FF3B)