Unicode is a standard Informatique, developed by the Consortium Unicode , which aims at giving to any character any system of writing of language a name and a numerical identifier, and this in a unified way, whatever the data-processing platform or the Logiciel.
Today completely compatible with ISO/CEI 10646, the Unicode standard adds to him a model of complete representation and word processing, by conferring on each character a set of standardized or informative properties, by describing with precision the semantic relations which can exist between several successive characters of a text, and by standardizing algorithms of treatment which preserve to the maximum semantics of the transformed texts, while extending the interworking of the representation of these texts on heterogeneous systems.
One can say today that the Unicode standard includes completely standard ISO/CEI 10646 as a subset, since the latter standardizes only the individual characters in their assigning a name and a normative number and a very limited informative description, but no treatment nor no specification or recommendation for their employment in the writing of real languages, which only the Unicode standard defines precisely. However, standard ISO/CEI 10646 confers on Unicode the statute of international standard approved for the coding of the textes ; Unicode is also a standard de facto for the treatment of these texts, and is used today basic for many other standards.
These pages of code presented some problems indeed. For example when was envisaged a character “currency symbol”, the same text authorizing in the United States an expenditure in dollars could once transmitted by email to the United Kingdom to authorize the same expenditure of pounds sterling, without anything being modified with the text!
In practice, all the written forms are not yet present, because a documentary research task near specialists can still prove to be necessary for rare characters or systems little known (because disappeared, for example).
However, the systems most used in the world are represented, as well as rules on the semantics of the character S, their compositions and the manner of combining these various systems. - For example, how to from right to left insert a written form in a written form from left to right (bidirectional Text).
Unicode approaches however the problems of the Casse, of the Alphabetical classification, and of the combination of accent S and character S. Since version 1.1 of Unicode and in all the following versions, the characters have the same identifiers as those of standard ISO/CEI 10646: the repertories are maintained in parallel, with identical during their final standardization, the two standards being updated simultaneously. Two standards Unicode (since version 1.1) and ISO/CEI 10646 ensure a total upward compatibility: any text in conformity with a previous version must remain in conformity in the later versions.
Thus the characters of version 3.0 of Unicode are those of standard ISO/CEI 10646:2000. Version 3.2 of Unicode classified: 95,221 characters, symbols and directives.
Version 4.1 of Unicode, update in November 2005, contient :
Some problems however seem to exist, for the Codage of the Chinese characters, because of the unification of the plays ideographic used in various languages, with a slightly different penmanship and sometimes meaning, but they are in the course of resolution by Unicode which defined selectors of alternatives and opened a register of standardized sequences which uses them.
Version 5.0 was published in July 2006.
For example, the character C is named " Latin capital letter C cédille".
This definition is completely identical to that of the ISO/CEI 10646, which approves any extension of the repertory. It should be noted that Unicode includes in the text of its standard only the normative English names, but that standard ISO/CEI 10646 is published in two also normative languages. Also the French and English names are both standardized.
In the facts, any extension of the repertory is done today jointly between the Work group WG2 of the ISO/CEI 10646 (whose members voters are only national authorities of standardization of all the countries of the world, or their official representative), and the technical Committee Unicode UTC (whose members voters can be any private organization or of public interest, or even a government, which adhered and pay a yearly rental enabling them to take part in these decisions).
This number, the point of code, is noted U+xxxx where xxxx are into hexadecimal, and comprises 4 digits for all the points of coding of the multilingual basic foreground (thus between U+0000 and U+FFFF), 5 digits for the 15 following plans (between U+10000 and U+FFFFF), or 6 digits for the last plan (between U+100000 and U+10FFFF).
Thus, character named " Latin capital letter C cédille" an index of U+00C7 has.
It should be noted that all points of code between U+0000 and U+10FFFF are valid, even if some are reserved and not yet assigned with characters, or so certain points of code are assigned for not-characters (for example U+FFFE or U+FFFF) the use is interdict in a text, or are reserved to allow the coding of any text in conformity with one of the forms of standard transformation Unicode (see UTF-16, low).
It will be also noted that Unicode (or ISO/CEI 10646) assigned many points of code to valid characters but whose semantics are unknown because of private use (for example the last two plans between U+F0000 and U+10FFFF are entirely dedicated to this use, except the two points of code at the end of each plan which is not-characters prohibited in a text conforms).
There still, the standardization of coding, i.e. the assignment of the points of codes to the characters of the common repertory is a joint decision shared between standards Unicode and ISO/CEI 10646. All the characters of the repertory have a single point of code (even if for certain languages or Unicode certain characters are regarded as equivalent).
One can note that if the repertory of the characters is extensible, it is it only within the limits permitted by the coding of the assignable points of code the coded characters. A large majority of the possible points of code is not assigned with a special character, but can become it constantly.
Also these still free points of code are not regarded as invalids but represent many abstract characters (not yet specified, and temporarily reserved). These abstract characters (just as the characters with private use) supplement the character set coded of the repertory standardized to form a single play called “ universal character set coded ” ( Universal Coded Character Set , often shortened in UCS ) which contains all the character sets coded of the repertories of each version passed, present and future of the ISO/CEI 10646 and Unicode (since version 1.1 only).
It is also here that one specifies the mark of boutianity (BOM, for Byte Order Mark ) which makes it possible to indicate at the beginning of file if it is wholesale boutien or into small boutien. In world Internet, one seldom uses it, by preferring an explicit marking (“ charset=UTF-16BE ” in MIME, for example, to indicate a flood of data large boutien, where BE means Big Endian ).
There can also be an overcoding as for the LDAP which specifies that the Unicode chains must be coded in UTF-8 and surcodées in Base64.
Where the ASCII used formerly 7 bit S and ISO 8859-1 8 bits (like the majority of the national pages of codes), Unicode, which gathers the characters of each page of code, needed to use more than the 8 bits of a Octet. The limit was initially fixed at 16 bits for the first versions of Unicode, and at 32 bits for the first versions of the standard ISO/CEI 10646.
The current limit from now on is placed between 20 and 21 bits per point of code assigned with the characters standardized in the two standards, from now on mutually compatible:
These transformations were initially created for the internal representation and the diagrams of coding of the points of code of the standard ISO 10646, which at the beginning could define points of code on 31 bits. Since, standard ISO/CEI10646 was amended, so that the three forms are completely compatible between them and make it possible to code all the points of code (because UTF-16 makes it possible to represent only the points of code of the 17 foregrounds).
Unicode also standardized in a very strict way these three forms of transformation of all the valid points of code (U+0000 with U+D7FF and U+E000 with U+10FFFF) and only them, that is to represent text in the form of continuations of points of codes, or of the points of code assigned with the characters valid, or reserved, or assigned with not-characters. It should be noted that the points of code assigned at half-zones (U+D800 with U+DFFF), used only in UTF-16, are invalid separately since it are used for the representation, by a couple of 2 codets of 16 bits, points of code of the 16 additional plans.
The UTF-8 also ensures, and it is its main advantage, a compatibility with simple handling of chains in ASCII in the computer programming languages. Thus, the programs written in C can often function without modification.
Initially, the UTF-8 could code any point of code between U+0000 and U+7FFFFFFF (thus to 31 bits). This use is obsolete and standard ISO/CEI 10646 was amended not to more support but the valid points of code of the 17 foregrounds, except those of the half-zone corresponding to the codets used in UTF-16 for the representation on two codets of the points of code of the 16 additional plans. Also the longest sequences in UTF-8 require to the maximum 4 bytes, instead of 6 previously.
Moreover, UTF-8 was amended initially by Unicode then by the ISO/CEI10646 more not to accept but the shortest representation of each point of code.
Its advantage over the UTF-16 is that there exists one diagram of possible coding for the transmission of sequences of bytes in a network of heterogeneous systems. The majority of the protocols of exchange standardized today use this transformation because it is independent of the scheduling of the bytes composing a longer entirety.
In addition, the UTF-8 is completely compatible for the transmission of texts by protocols based on the ASCII character set, or can be made compatible (at the price of a transformation on several bytes of characters not-ASCII) with the protocols of exchange supporting the character sets coded on 8 bits (which they are based on ISO-8859 or of many other character sets coded on 8 bits defined by particular national standards or owner systems).
Its defect is the very variable coding length (1 byte for the points of code assigned with the characters ASCII/ISO646, 2 to 4 bytes for the other points of code), even if it is possible to determine the beginning of the coding of a point of code starting from a random position of a text transformed into UTF-8 (while carrying out with more the 3 additional readings of the codets which precede). Also, this transformation is seldom used for the internal treatment of the texts, and one often prefers the UTF-16, sometimes the UTF-32 to him.
It should be noted that a valid character Unicode, the null control character U+0000, is coded in UTF-8 in the form of a single null byte. This null byte poses problems with the bookstores of treatment of chains of the language C which allots to him the function of end of chain.
However the Java platform uses also an additional specific compact format coded on 8 bits (near to UTF-8, but distinct since the point of U+0000 code is represented there by a sequence of two nonnull bytes, a normally invalid sequence in standard UTF-8) for certain native exchanges with the libraries C of the supported platform; this format is also used in-house in the files of compiled classes, since this format alternative (but portable) does not depend either on the internal scheduling of the bytes composing an entirety of more than 8 bits, and it thus makes it possible to represent all the valid Unicode texts (as well as other invalid sequences).
However the points of code of the 16 additional plans require a transformation on two seizets:
It is possible to determine the beginning of the sequence of coding starting from an unspecified point of a text represented in UTF-16 by carrying out to the maximum an additional reading, only if this code element is in the low half-zone. This form is more economic and easier to treat quickly than the UTF-8 for the representation of texts containing few ASCII natures (U+0000 with U+007F).
However, this transformation has two incompatible diagrams of coding which depend on the scheduling of the bytes in the representation of entireties on 16 bits. To solve this ambiguity and to allow the transmission between heterogeneous systems, it is necessary to associate information indicating the diagram of coding used (UTF-16BE or UTF-16LE), or to prefix the text coded with the representation of the valid point of code U+FEFF (assigned with the character “nonbreaking space of null width”, a character now reserved for this only use as a marker of scheduling of the bytes), since the point of code “reversed” valid U+FFFE is a not-character, interdict in the texts in conformity with Unicode and ISO/CEI10646.
The other defect of UTF-16 is that a text transformed with him and forwarded with one or the other of the two diagrams of coding contains a great number of null bytes or having a value in conflict with the values of bytes reserved by certain protocols of exchange.
It is in particular the coding which the platform Java in-house uses, like Windows for its APIs compatible Unicode (with the " type; WCHAR").
The advantage of this standardized transformation is that all the codets have the same size. It is thus not necessary to read additional codets to determine the beginning of the representation of a point of code.
However, this format is particularly not very economic (including in memory) since it “wastes” unnecessarily at least a byte (always no one) by character. The memory size of a text exploits the performances negatively since that requires more readings and writings on disc in the event of saturation of the physical memory, and that decreases also the performances of the hiding place memory of the processors.
For the texts written in the current modern languages (except certain rare characters of the additional plan ideographic), and thus using only the points of code of the basic multilingual plan, this transformation doubles the memory quantity necessary compared to the UTF-16.
Like the UTF-16, the UTF-32 has several diagrams of coding depend on the scheduling of the bytes composing an entirety of more than 8 bits (two diagrams of coding of the UTF-32 are standardized, UTF-32BE and UTF-32LE). It is thus also necessary to specify this diagram of coding, or to determine it by prefixing the text by the representation in UTF-32 of the point of code U+FEFF. Like the UTF-16, the presence of null bytes in the standardized diagrams of coding of the UTF-32 makes it incompatible with many protocols of exchange between heterogeneous systems.
Also this format is used generally only very locally for certain treatments as an intermediate form easier to handle, and one often prefers to him transformation UTF-16 often more powerful to treat and store significant amounts of texts, conversion between the two being very simple to realize, and far from expensive in term of complexity of treatment.
In fact, very many libraries of word processing are written only with the UTF-16 and are more powerful than in UTF-32, even when the texts contain characters of the additional plans (because this case of figure remains rare in the very large majority of the cases).
One will note however that the transformation into UTF-32 uses codets on 32 bits, whose very many values can not represent any point of valid code (values out of the two intervals representing the valid points of code U+0000 to U+D7FF and U+E000 with U+10FFFF), therefore no valid or reserved character (any information which would be contained there cannot thus be text within the meaning of Unicode). The transmission of texts using these invalid values of codets in one of the standardized diagrams of coding of the UTF-32 is prohibited for any system in conformity with Unicode (it is necessary to rather use the points of code to private use), since it will be impossible to represent them in another transformation UTF with which the three standardized UTF are bijectivement compatible.
Moreover, contrairment with a police force ASCII or traditional Latin-1, the selection of a glyphe by a code is not single and is often contextual, and can also post same the glyphe for different codes. Thus, the French character E can it be described in two manners: either by using the number directly corresponding to the “E”, or while making follow the number of the “E” by that of the acute accent without hunting. Whatever the selected option same the glyphe will be posted. One will say first character which it is precomposed, of the second that it is a composition (two characters form only one glyphe made up of both). This is authorized and even highly recommended because the various forms of coding are classified by Unicode like “canonically equivalent”, which means that two equivalent forms of coding should be treated in an identical way.
Many composite characters are in this case and can be coded in these two manners (or more, certain characters made up being able êre broken up of several way, notemment when they comprise several diacritics). Generally, the precomposed character is preferable for the coding of the text, if this one exists (it is the case for the Greek polytonic, for example, which, coded in decomposition, can not be satisfying graphically: according to the font faces, various components of the glyphe being sometimes badly laid out and not very readable). However, all the composite characters do not have a single point of code for their precomposed form.
In the same way, certain written forms, like the devânagarî or the characters Arab S, require a complex treatment of the binding S: the Graphème S indeed change form according to their position and/or compared to their neighbors (cf contextual Variante and joint Lettre ). The selection of the correct glyphe to use requires a treatment making it possible to determine the contextual form to select in the police force, while at the same time all the contextual forms are coded in an identical way in Unicode.
It is thus understood that the term of police force Unicode must be used very prudently. To have a police force which represents a certain number or all charts that one can obtain with Unicode is not sufficient, it is necessary in more than the display system has the mechanisms of representation suitable (what one names the returned driving ) able to manage the bindings, contextual alternatives and joint forms of certain writings. On the contrary, a police force which represents only certain characters but which knows how to post them best merit the term of police force Unicode . Lastly, it should be recognized that there exist technical constraints in the formats of bills of character, which prevent to them from supporting the totality of the repertory, and in practice, it is today impossible to find a font face single supporting all the repertory.
A Unicode font face is thus only one police force allowing to directly post a text coded according to all the forms authorized by Unicode, and allowing to support a coherent subset adapted to one or more languages to support one or more writings. No bill of character Unicode can “only function”, and the complete support of the writing requires a support of those in an engine of made, able to detect the fomes coding equivalent, to seek the contextual forms in the text and to select different the glyphes from a police force coded with Unicode, while being helped with the esoin of tables of correspondences include in the police force itself.
The types to use to store Unicode variables, are the following:
Notes:
wchar_t language C always does not make it possible to code all the Unicode characters, because the standard of this language does not envisage a minimum number sufficient for this standard type. However of many compilers of the language define wchar_t on 32 bits (even 64 bits on the environments handling the standard entireties on 64 bits), which is enough to store any point of standardized Unicode code. But of other compilers represent wchar_t on 16 bits (in particular under Windows in environment 16 or 32 bits), even on 8 bits only (in particular in the embarked environments not having an operating system of general use) because wchar_t can use the same representation as the char type which counts a minimum of 8 bits. char is a positive entirety on 16 bits only). To handle the characters standardized out of the foreground, it is necessary to use a pair of codets, each one containing a value equal to both codets defined by form UTF-16. Also the types of String objects or char are adapted to represent a Unicode character. From Java 1.4.1, the standard library provides a complete support of Unicode thanks to the native type int (which is an entirety defined on 32 bits) and to the static methods of the standard class Char (however a instancié object of this Char type does not allow, just like the native type char, to store any point of code). char type on 32 bits having to support all the points of code of the 17 standardized plans. These two languages do not support explicit typing of the variables, the type being defined dynamically by the values that one assigns to them (also, several internal representations are possible, their differences being normally transparent for the programmer). Unicode still suffers however from a weak support of the rational expressions by certain software, even if libraries like ICU and Java can support them.
It should be noted that the old standard Unicode 1.0 is obsolete and incompatible with the standard ISO 10646 and normalizes it Unicode 1.1 and all its later versions; the principal incompatibility is that of the blocks of Hangul natures used for the writing of the Korean language which changed position and whose old points of code for summer have assigned with other blocks. The table below is compatible with ISO 10646 (all versions) and Unicode 1.1 (or later)
The zones with private use indicated by the symbol ☒ do not contain same the eye s from one police force to another and must thus be avoided for the coding of texts intended for the exchanges between heterogeneous systems. However these points of codes to private use are valid and can be used in any treatment automated in conformity with the standards Unicode and ISO 10646, including between different systems if there exists a private mutual agreement concerning their use.
In the absence of agreement between the two parts, of the systems using these characters can reject the texts the container, because the treatments that they make them undergo could not function correctly or cause security issues; the other systems which do not allot any special function to these characters must on the other hand accept them like valid and preserve them like integral part of the texts, as if they were graphic symbols, even if they cannot post them correctly.
The listed not-characters are valid points of code, but they (and will be never) are not assigned with standardized characters. Their use in coding of texts forwarded between systems (even if identical) is prohibited, because it is impossible to make them compatible with the standardized universal forms of transformation (of which UTF-8, UTF-16, UTF-32) corresponding diagrams of coding, and other codings standardized compatible with Unicode and ISO 10646 (BOCU-1, SCSU, various versions of Chinese standard GB18030, etc). However certain systems generate them and use them locally, but for a strictly internal treatment intended to facilitate the implementation of the algorithms of word processing using the other standardized characters.
Among these last not-characters appear the points of code valid but reserved in the half-zones (private or not). These points of code cannot be used individually to code a character. They are used only for the form as universal transformation UTF-16 (and the corresponding diagrams of coding) to represent on two codets (with 16 bits each one) of the valid points of code in one of the 16 additional plans (certain combinations of codets correspond to valid of these plans, standards or private characters, other combinations can not represent any valid character because they would correspond to not-characters of these additional plans, and are thus prohibited in the texts in conformity with the standard).
The other free zones (not assigned with a standardized named block, or left points of code free and reserved in the existing named blocks) are reserved for a later use in future versions of Unicode and ISO 10646, but are valid. Any treating system of the texts containing these reserved points of code must accept them without filtering them. Unicode defines properties by defect for the hypothetical corresponding characters, in order to preserve the compatibility of the systems (in conformity with the Unicode standard) with the future texts in conformity which would contain them. No application in conformity must assign a character or a special semantics to them (the private zones are intended for this use).
Simple: Unicode Zh-min-nan: Thong-iōng-Be Zh-yue: 統一碼
| Random links: | Christophe Bourseiller | Oleksij Gatin | Canton of Carcassonne-Center | Blazer | Mathias Ortega |