UTF-16
UTF-16 is a coding characters defined by Unicode where each character is coded on a continuation of one or two words of 16 Bit S.
Coding was defined in the report 17 to the Unicode standard. Since, this appendix became obsolete because UTF-16 formed integral part of the Unicode standard, in its chapter 3 Conformance which defines it in a very strict way.
The UTF-16 is not the UCS-2 which is coding, simpler, of each character on two bytes. These two standards are however called both Unicode, because coding is the same one as long as one does not use beaches U+D800 with U+DFFF (reserved in theory) and the beaches after U+FFFF (little used in occident).
Description
The number of each character (its point of code) is given by the Norme Unicode. The points of code which can be represented must be in the interval of U+0000 validity to U+10FFFF, and do not have to be affected with a not-character. All the possible characters in Unicode have such points of codes.All Not of code which is not a not-character, and whose value does not exceed 2 Octet S (16 bit S), i.e. all the points of U+0000 code to U+D7FF and U+E000 with U+FFFD, is stored on only one word of 16 bits, whose 5 bits of strong weight cannot be equal to 11011 (since the beach of not-characters U+D800 to U+DFFF is excluded).
In the other cases, the character is a point of code of an additional plan (thus between U+10000 and U+10FFFD and whose 16 bits of weak weight should not equalize 0xFFFE or 0xFFFF) ; it is then stored on 2 words (codets) successive of 16 bits each one, whose values correspond to the points of codes reserved in the half-zones of indirection allocated in the basic multilingual plan of the Unicode standards and ISO/CEI 10646 :
- the first word will have the 6 bits of weights extremely equal to 110110 and will thus be included/understood in the interval
. 0xDBFF (here in numeration Hexadecimal E) ; this word will contain in its 10 bits of weak weight the 10 bits of strong weight of the difference (represented on 20 bits) between the point in code to be stored and the first point of additional code U+10000 ; - the second word will have the 6 bits of weights extremely equal to 110111 and will thus be included/understood in the interval
. 0xDFFF (here in numeration Hexadecimal E)) ; this word will contain in its 10 bits of weak weight the 10 bits of weak weight of the point of code to be stored.
Then according to the format of storage of the words of 16 bits in an ordered flow of Byte S, two systems are possible for coding final :
The indication of the type of coding used (order of the bytes) can be implicit for the protocol used, or specified explicitly by this protocol (by indicating for example the reserved names " UTF-16BE" or " UTF-16LE" in a heading of charset MIME). If the protocol does not make it possible to specify the order of the bytes, and if it allows one or the other of the alternatives, one will be able to at the head use coding UTF-16 of the valid point of code U+FEFF as indicating of the data flow (because a change of order of its bytes to the reading of flow will lead to a point of code U+FFFE, valid in Unicode but affected with a not-character and thus prohibited in this case in any flow UTF-16. This point of code thus represented (called mark of scheduling of the bytes, byte order mark in English, summary BOM ) will be coded only at the beginning of the data flow, and makes it possible to know how the flux  was coded;:
If one of the two sequences of two bytes each one is present at the head of flow, the type of coding in is deduced and the sequence is withdrawn from flow: it does not represent any character of the text stored in this data flow. If none of the two sequences appears at the head of the data flow, the Unicode standard specifies that flow must be decoded in Big endian (UTF-16BE).
Elsewhere than at the beginning of flow (including after an initial BOM ), these sequences are not recognized like coding a BOM and decoding continues with a type of coding unique ; thus if these sequences appear after the beginning, alors :
- is the text contains character U+FEFF well (nonbreaking space without hunting, zero-width non-breaking space in English, summary ZWNBSP ) if the word of 16 bits is correctly coded in the good order,
- is flow is invalid and does not contain a text in conformity with the Unicode standards and ISO/CEI 10646.
In the same way flow must be regarded as invalid and not containing a text in conformity with Unicode if it contains a word of 16 bits ranging between 0xD800 and 0xDBFF not immediately followed by a word ranging between 0xDC00 and 0xDFFF, or if it contains a word of 16 bits between 0xDC00 and 0xDFFF not immediately preceded by a word between 0xD800 and 0xDBFF, or if decoding reveals the point of code of any other not-character.
See too
Internal bonds
- UTF-8, UTF-32
- Unicode, ISO/CEI 10646
- ISO 646, ASCII, ISO 8859, ISO 8859-1
- Endianness
External bonds
- '' Conformance '', The Unicode Standard 4.0 , chapter 3.2 pp. 60-61 and chapter 3.4 pp. 64-65 and chapter 3.9 pp. 73-81, August 2003, ISBN 0-321-18578-1.
| Random links: | Marie-Caroline Le Pen | Estrée-Cauchy | The Community of communes of Dinan | Be of the Book | Gasometer |