Unicode is a standard Informatique, developed by the Consortium Unicode , which aims at giving to any character any system of writing of language a name and a numerical identifier, and this in a unified way, whatever the data-processing platform or the Logiciel.
Today completely compatible with ISO/CEI 10646, the Unicode standard adds to him a model of complete representation and word processing, by conferring on each character a set of standardized or informative properties, by describing with precision the semantic relations which can exist between several successive characters of a text, and by standardizing algorithms of treatment which preserve to the maximum semantics of the transformed texts, while extending the interworking of the representation of these texts on heterogeneous systems.
One can say today that the Unicode standard includes completely standard ISO/CEI 10646 as a subset, since the latter standardizes only the individual characters in their assigning a name and a normative number and a very limited informative description, but no treatment nor no specification or recommendation for their employment in the writing of real languages, which only the Unicode standard defines precisely. However, standard ISO/CEI 10646 confers on Unicode the statute of international standard approved for the coding of the textes ; Unicode is also a standard de facto for the treatment of these texts, and is used today basic for many other standards.
GoalUnicode, whose first publication goes back to 1991, was developed with an aim of replacing the use of national pages of code.
These pages of code presented some problems indeed. For example when was envisaged a character “currency symbol”, the same text authorizing in the United States an expenditure in dollars could once transmitted by email to the United Kingdom to authorize the same expenditure of pounds sterling, without anything being modified with the text!
In practice, all the written forms are not yet present, because a documentary research task near specialists can still prove to be necessary for rare characters or systems little known (because disappeared, for example).
However, the systems most used in the world are represented, as well as rules on the semantics of the character S, their compositions and the manner of combining these various systems. - For example, how to from right to left insert a written form in a written form from left to right (bidirectional Text).
Standards and versionsWork on Unicode parallel and is synchronized with that on the standard ISO/CEI 10646 of which the goals are the same ones. ISO/CEI 10646, English and French an international standard published, however specifies neither the rules of composition of characters, nor the semantic properties of the characters.
Unicode approaches however the problems of the Casse, of the Alphabetical classification, and of the combination of accent S and character S. Since version 1.1 of Unicode and in all the following versions, the characters have the same identifiers as those of standard ISO/CEI 10646: the repertories are maintained in parallel, with identical during their final standardization, the two standards being updated simultaneously. Two standards Unicode (since version 1.1) and ISO/CEI 10646 ensure a total upward compatibility: any text in conformity with a previous version must remain in conformity in the later versions.
Thus the characters of version 3.0 of Unicode are those of standard ISO/CEI 10646:2000. Version 3.2 of Unicode classified: 95,221 characters, symbols and directives.
Version 4.1 of Unicode, update in November 2005, contient :
- : 137,468 characters with private use (assigned in all the version of Unicode and sufficient for all the uses) ;
- more: 97,755 various letters or syllables, figures or numbers, symbols, diacritics and punctuation marks, with among eux :
- more: 70,207 characters ideographic, and
- among them: 11,172 syllables hangûles précomposées ; like
- : 8,258 points of reserved codes in a permanent way prohibited for coding of text (assigned in all the versions of Unicode) ; and
- several hundreds of control characters or modifying spéciaux ;
Some problems however seem to exist, for the Codage of the Chinese characters, because of the unification of the plays ideographic used in various languages, with a slightly different penmanship and sometimes meaning, but they are in the course of resolution by Unicode which defined selectors of alternatives and opened a register of standardized sequences which uses them.
Version 5.0 was published in July 2006.
Layers of UnicodeUnicode is defined according to a shell model (technical Note Unicode #17). The other standards did not make typically distinction between the character set and the physical representation. The layers are presented here on the basis of highest (furthest away from the machine).
Abstract character set ( Abstract Character Repertory )The highest layer is the definition of the character set. For example, Latin-1 has a set of 256 characters and Unicode standardizes currently more: 120,000 characters. Moreover, Unicode gives them names. To draw up the list of the characters and to give them names are thus the first layer of Unicode.
For example, the character C is named " Latin capital letter C cédille".
This definition is completely identical to that of the ISO/CEI 10646, which approves any extension of the repertory. It should be noted that Unicode includes in the text of its standard only the normative English names, but that standard ISO/CEI 10646 is published in two also normative languages. Also the French and English names are both standardized.
In the facts, any extension of the repertory is done today jointly between the Work group WG2 of the ISO/CEI 10646 (whose members voters are only national authorities of standardization of all the countries of the world, or their official representative), and the technical Committee Unicode UTC (whose members voters can be any private organization or of public interest, or even a government, which adhered and pay a yearly rental enabling them to take part in these decisions).
Character set coded ( Coded Character Set )Here, one adds to the preceding table a numerical index. Let us note well that it is not a question of a representation in memory, just of a number.
This number, the point of code, is noted U+xxxx where xxxx are into hexadecimal, and comprises 4 digits for all the points of coding of the multilingual basic foreground (thus between U+0000 and U+FFFF), 5 digits for the 15 following plans (between U+10000 and U+FFFFF), or 6 digits for the last plan (between U+100000 and U+10FFFF).
Thus, character named " Latin capital letter C cédille" an index of U+00C7 has.
It should be noted that all points of code between U+0000 and U+10FFFF are valid, even if some are reserved and not yet assigned with characters, or so certain points of code are assigned for not-characters (for example U+FFFE or U+FFFF) the use is interdict in a text, or are reserved to allow the coding of any text in conformity with one of the forms of standard transformation Unicode (see UTF-16, low).
It will be also noted that Unicode (or ISO/CEI 10646) assigned many points of code to valid characters but whose semantics are unknown because of private use (for example the last two plans between U+F0000 and U+10FFFF are entirely dedicated to this use, except the two points of code at the end of each plan which is not-characters prohibited in a text conforms).
There still, the standardization of coding, i.e. the assignment of the points of codes to the characters of the common repertory is a joint decision shared between standards Unicode and ISO/CEI 10646. All the characters of the repertory have a single point of code (even if for certain languages or Unicode certain characters are regarded as equivalent).
One can note that if the repertory of the characters is extensible, it is it only within the limits permitted by the coding of the assignable points of code the coded characters. A large majority of the possible points of code is not assigned with a special character, but can become it constantly.
Also these still free points of code are not regarded as invalids but represent many abstract characters (not yet specified, and temporarily reserved). These abstract characters (just as the characters with private use) supplement the character set coded of the repertory standardized to form a single play called “ universal character set coded ” ( Universal Coded Character Set , often shortened in UCS ) which contains all the character sets coded of the repertories of each version passed, present and future of the ISO/CEI 10646 and Unicode (since version 1.1 only).
Form coded in memory ( Character Encoding Form )This time, we arrive at a representation in memory: this layer specifies which storage units ( code units ), bytes or words of 16 - seizets - or from 32 bits, will represent a character or more exactly a point of code.
Mechanism of serialization of the characters ( Character Encoding Design )This layer deals with sérialiser the storage units defined by the layer of the top. It is here that the opposition between large boutiens is treated (the most significant byte initially) and small boutiens (the least significant byte initially).
It is also here that one specifies the mark of boutianity (BOM, for Byte Order Mark ) which makes it possible to indicate at the beginning of file if it is wholesale boutien or into small boutien. In world Internet, one seldom uses it, by preferring an explicit marking (“ charset=UTF-16BE ” in MIME, for example, to indicate a flood of data large boutien, where BE means Big Endian ).
Overcoding of transfer ( Transfer Encoding Syntax )Here, the mechanisms of compression or coding intervene optionnellement.
Limit of the byteContrary to the preceding standards, Unicode separates the definition from the character set (the list of the characters, their name and their index, the not of code ) of that of coding. Thus, one cannot thus speak about the size of a Unicode character, because it depends on selected coding.
Where the ASCII used formerly 7 bit S and ISO 8859-1 8 bits (like the majority of the national pages of codes), Unicode, which gathers the characters of each page of code, needed to use more than the 8 bits of a Octet. The limit was initially fixed at 16 bits for the first versions of Unicode, and at 32 bits for the first versions of the standard ISO/CEI 10646.
The current limit from now on is placed between 20 and 21 bits per point of code assigned with the characters standardized in the two standards, from now on mutually compatible:
- the international work group of the ISO standardizes the assignment of the points of code to the characters, their official name and holds the blocks of points of code used by each writing or groups writings. It documents also a possible chart (indicative) for each character (this chart being if possible nonambiguous thanks to the placement of the characters standardized in the suitable blocks of code for a limited number of writings).
- the work group of the Unicode Consortium more precisely standardizes (in the Unicode standard) their semantics for the treatments automated thanks to the tables of properties of the characters, and the development of standard algorithms using these properties.
- the two organizations of standardization collaborate to permanently synchronize their repertory standardized in official versions referred mutually, and work together for the amendments (versions becoming official only once the two organizations approved each one and completely defined the additions of new characters).
- In practice, for the majority of the developers of applications, the standard ISO 10646 seems a subset of the more complete Unicode standard, but it has the same points of code for exactly the same character set as those of the Unicode standard (this is why the Unicode standard more known because is adapted for the computerized treatments, but also the Unicode standard is more accessible because consultable free on Internet).
UTF, Universal Transformation FormatUnicode and ISO/CEI 10646 accept several forms of universal transformation to represent a valid point of code. Let us quote:
These transformations were initially created for the internal representation and the diagrams of coding of the points of code of the standard ISO 10646, which at the beginning could define points of code on 31 bits. Since, standard ISO/CEI10646 was amended, so that the three forms are completely compatible between them and make it possible to code all the points of code (because UTF-16 makes it possible to represent only the points of code of the 17 foregrounds).
Unicode also standardized in a very strict way these three forms of transformation of all the valid points of code (U+0000 with U+D7FF and U+E000 with U+10FFFF) and only them, that is to represent text in the form of continuations of points of codes, or of the points of code assigned with the characters valid, or reserved, or assigned with not-characters. It should be noted that the points of code assigned at half-zones (U+D800 with U+DFFF), used only in UTF-16, are invalid separately since it are used for the representation, by a couple of 2 codets of 16 bits, points of code of the 16 additional plans.
UTF-8UTF-8, specified in the RFC 2279, is most common for the applications Unix and Internet. Its coding of variable size enables him to be on average less expensive in occupation memory. But that slows down the operations clearly where intervene of the extractions of under-chains, because it is necessary to count the characters since the beginning of the chain to know where is the first character to extract.
The UTF-8 also ensures, and it is its main advantage, a compatibility with simple handling of chains in ASCII in the computer programming languages. Thus, the programs written in C can often function without modification.
Initially, the UTF-8 could code any point of code between U+0000 and U+7FFFFFFF (thus to 31 bits). This use is obsolete and standard ISO/CEI 10646 was amended not to more support but the valid points of code of the 17 foregrounds, except those of the half-zone corresponding to the codets used in UTF-16 for the representation on two codets of the points of code of the 16 additional plans. Also the longest sequences in UTF-8 require to the maximum 4 bytes, instead of 6 previously.
Moreover, UTF-8 was amended initially by Unicode then by the ISO/CEI10646 more not to accept but the shortest representation of each point of code.
Its advantage over the UTF-16 is that there exists one diagram of possible coding for the transmission of sequences of bytes in a network of heterogeneous systems. The majority of the protocols of exchange standardized today use this transformation because it is independent of the scheduling of the bytes composing a longer entirety.
In addition, the UTF-8 is completely compatible for the transmission of texts by protocols based on the ASCII character set, or can be made compatible (at the price of a transformation on several bytes of characters not-ASCII) with the protocols of exchange supporting the character sets coded on 8 bits (which they are based on ISO-8859 or of many other character sets coded on 8 bits defined by particular national standards or owner systems).
Its defect is the very variable coding length (1 byte for the points of code assigned with the characters ASCII/ISO646, 2 to 4 bytes for the other points of code), even if it is possible to determine the beginning of the coding of a point of code starting from a random position of a text transformed into UTF-8 (while carrying out with more the 3 additional readings of the codets which precede). Also, this transformation is seldom used for the internal treatment of the texts, and one often prefers the UTF-16, sometimes the UTF-32 to him.
It should be noted that a valid character Unicode, the null control character U+0000, is coded in UTF-8 in the form of a single null byte. This null byte poses problems with the bookstores of treatment of chains of the language C which allots to him the function of end of chain.
However the Java platform uses also an additional specific compact format coded on 8 bits (near to UTF-8, but distinct since the point of U+0000 code is represented there by a sequence of two nonnull bytes, a normally invalid sequence in standard UTF-8) for certain native exchanges with the libraries C of the supported platform; this format is also used in-house in the files of compiled classes, since this format alternative (but portable) does not depend either on the internal scheduling of the bytes composing an entirety of more than 8 bits, and it thus makes it possible to represent all the valid Unicode texts (as well as other invalid sequences).
UTF-16UTF-16 is a good compromise when the place memory is not too restricted, because large majority of the Unicode characters assigned for the writings of the modern languages (of which the characters most frequently used) are it in the basic multilingual plan and can thus be represented on 16 bits. The ISO/CEI 10646 names these entities of 16 bits of the seizets.
However the points of code of the 16 additional plans require a transformation on two seizets:
- the first seizet taken in the high half-zone (0xD800 with 0xDBFF) makes it possible to represent the 10 bits of strong weight of the difference between the point of additional code and the first point in code out of the basic multilingual plan;
- the second seizet taken in the low half-zone (0xDC00 with 0xDFFF) makes it possible to represent the 10 bits of weak weight of the point of additional code.
It is possible to determine the beginning of the sequence of coding starting from an unspecified point of a text represented in UTF-16 by carrying out to the maximum an additional reading, only if this code element is in the low half-zone. This form is more economic and easier to treat quickly than the UTF-8 for the representation of texts containing few ASCII natures (U+0000 with U+007F).
However, this transformation has two incompatible diagrams of coding which depend on the scheduling of the bytes in the representation of entireties on 16 bits. To solve this ambiguity and to allow the transmission between heterogeneous systems, it is necessary to associate information indicating the diagram of coding used (UTF-16BE or UTF-16LE), or to prefix the text coded with the representation of the valid point of code U+FEFF (assigned with the character “nonbreaking space of null width”, a character now reserved for this only use as a marker of scheduling of the bytes), since the point of code “reversed” valid U+FFFE is a not-character, interdict in the texts in conformity with Unicode and ISO/CEI10646.
The other defect of UTF-16 is that a text transformed with him and forwarded with one or the other of the two diagrams of coding contains a great number of null bytes or having a value in conflict with the values of bytes reserved by certain protocols of exchange.
It is in particular the coding which the platform Java in-house uses, like Windows for its APIs compatible Unicode (with the " type; WCHAR").
UTF-32UTF-32 is used when the place memory is not a problem and that one needs to have access to characters in a direct way and without change of size (Hiéroglyphe S).
The advantage of this standardized transformation is that all the codets have the same size. It is thus not necessary to read additional codets to determine the beginning of the representation of a point of code.
However, this format is particularly not very economic (including in memory) since it “wastes” unnecessarily at least a byte (always no one) by character. The memory size of a text exploits the performances negatively since that requires more readings and writings on disc in the event of saturation of the physical memory, and that decreases also the performances of the hiding place memory of the processors.
For the texts written in the current modern languages (except certain rare characters of the additional plan ideographic), and thus using only the points of code of the basic multilingual plan, this transformation doubles the memory quantity necessary compared to the UTF-16.
Like the UTF-16, the UTF-32 has several diagrams of coding depend on the scheduling of the bytes composing an entirety of more than 8 bits (two diagrams of coding of the UTF-32 are standardized, UTF-32BE and UTF-32LE). It is thus also necessary to specify this diagram of coding, or to determine it by prefixing the text by the representation in UTF-32 of the point of code U+FEFF. Like the UTF-16, the presence of null bytes in the standardized diagrams of coding of the UTF-32 makes it incompatible with many protocols of exchange between heterogeneous systems.
Also this format is used generally only very locally for certain treatments as an intermediate form easier to handle, and one often prefers to him transformation UTF-16 often more powerful to treat and store significant amounts of texts, conversion between the two being very simple to realize, and far from expensive in term of complexity of treatment.
In fact, very many libraries of word processing are written only with the UTF-16 and are more powerful than in UTF-32, even when the texts contain characters of the additional plans (because this case of figure remains rare in the very large majority of the cases).
One will note however that the transformation into UTF-32 uses codets on 32 bits, whose very many values can not represent any point of valid code (values out of the two intervals representing the valid points of code U+0000 to U+D7FF and U+E000 with U+10FFFF), therefore no valid or reserved character (any information which would be contained there cannot thus be text within the meaning of Unicode). The transmission of texts using these invalid values of codets in one of the standardized diagrams of coding of the UTF-32 is prohibited for any system in conformity with Unicode (it is necessary to rather use the points of code to private use), since it will be impossible to represent them in another transformation UTF with which the three standardized UTF are bijectivement compatible.
GB18030It is about a transformation of Unicode which is not defined by the Unicode Consortium, but by the administration of standardization into China, where its support is obligatory in the applications. Historically it was a coded character set, which was wide to support the entirety of repertory UCS by an algorithmic transformation supplementing a broad table of mapping from one code to another.
The Unicode font facesFor speaking about police force Unicode , it is necessary well to include/understand an essential principle: to say that Unicode codes characters amounts saying that it allots a number to abstract symbols, according to a principle of logical coding. Unicode does not code on the other hand the charts of the characters, the Glyphe S . There is not thus a Bijection between the representation of the character and its number, pusic all the graphic alternatives of style are unified.
Moreover, contrairment with a police force ASCII or traditional Latin-1, the selection of a glyphe by a code is not single and is often contextual, and can also post same the glyphe for different codes. Thus, the French character E can it be described in two manners: either by using the number directly corresponding to the “E”, or while making follow the number of the “E” by that of the acute accent without hunting. Whatever the selected option same the glyphe will be posted. One will say first character which it is precomposed, of the second that it is a composition (two characters form only one glyphe made up of both). This is authorized and even highly recommended because the various forms of coding are classified by Unicode like “canonically equivalent”, which means that two equivalent forms of coding should be treated in an identical way.
Many composite characters are in this case and can be coded in these two manners (or more, certain characters made up being able êre broken up of several way, notemment when they comprise several diacritics). Generally, the precomposed character is preferable for the coding of the text, if this one exists (it is the case for the Greek polytonic, for example, which, coded in decomposition, can not be satisfying graphically: according to the font faces, various components of the glyphe being sometimes badly laid out and not very readable). However, all the composite characters do not have a single point of code for their precomposed form.
In the same way, certain written forms, like the devânagarî or the characters Arab S, require a complex treatment of the binding S: the Graphème S indeed change form according to their position and/or compared to their neighbors (cf contextual Variante and joint Lettre ). The selection of the correct glyphe to use requires a treatment making it possible to determine the contextual form to select in the police force, while at the same time all the contextual forms are coded in an identical way in Unicode.
It is thus understood that the term of police force Unicode must be used very prudently. To have a police force which represents a certain number or all charts that one can obtain with Unicode is not sufficient, it is necessary in more than the display system has the mechanisms of representation suitable (what one names the returned driving ) able to manage the bindings, contextual alternatives and joint forms of certain writings. On the contrary, a police force which represents only certain characters but which knows how to post them best merit the term of police force Unicode . Lastly, it should be recognized that there exist technical constraints in the formats of bills of character, which prevent to them from supporting the totality of the repertory, and in practice, it is today impossible to find a font face single supporting all the repertory.
A Unicode font face is thus only one police force allowing to directly post a text coded according to all the forms authorized by Unicode, and allowing to support a coherent subset adapted to one or more languages to support one or more writings. No bill of character Unicode can “only function”, and the complete support of the writing requires a support of those in an engine of made, able to detect the fomes coding equivalent, to seek the contextual forms in the text and to select different the glyphes from a police force coded with Unicode, while being helped with the esoin of tables of correspondences include in the police force itself.
Software librariesThe Bibliothèque multi platform ICU makes it possible to handle unicodées data. A support of Unicode specific to certain platforms (noncompatible as for the code-source) is also provided by the modern systems (Java, MFC, GNU/Linux).
The types to use to store Unicode variables, are the following:
- In UTF-8
- One will note however that the type
wchar_tlanguage C always does not make it possible to code all the Unicode characters, because the standard of this language does not envisage a minimum number sufficient for this standard type. However of many compilers of the language define
wchar_ton 32 bits (even 64 bits on the environments handling the standard entireties on 64 bits), which is enough to store any point of standardized Unicode code. But of other compilers represent
wchar_ton 16 bits (in particular under Windows in environment 16 or 32 bits), even on 8 bits only (in particular in the embarked environments not having an operating system of general use) because
wchar_tcan use the same representation as the
chartype which counts a minimum of 8 bits.
- In a way similar to C and C++, the Java language has unit type making it possible to code 16 bits, but not making it possible to code only one point of code of an unspecified value (the native type
charis a positive entirety on 16 bits only). To handle the characters standardized out of the foreground, it is necessary to use a pair of codets, each one containing a value equal to both codets defined by form UTF-16. Also the types of
charare adapted to represent a Unicode character. From Java 1.4.1, the standard library provides a complete support of Unicode thanks to the native type
int(which is an entirety defined on 32 bits) and to the static methods of the standard class
Char(however a instancié object of this
Chartype does not allow, just like the native type
char, to store any point of code).
chartype on 32 bits having to support all the points of code of the 17 standardized plans. These two languages do not support explicit typing of the variables, the type being defined dynamically by the values that one assigns to them (also, several internal representations are possible, their differences being normally transparent for the programmer).
Unicode still suffers however from a weak support of the rational expressions by certain software, even if libraries like ICU and Java can support them.
PartitioningUp to date partitioning can be found on the official site of Unicode. However, considering the big role of Unicode, (ISO 10646) one will describe here the principal blocks of characters. The French names are official names of the ISO/CEI 10646, the bilingual international standard which takes again the same characters as Unicode. They are as official as the English names.
It should be noted that the old standard Unicode 1.0 is obsolete and incompatible with the standard ISO 10646 and normalizes it Unicode 1.1 and all its later versions; the principal incompatibility is that of the blocks of Hangul natures used for the writing of the Korean language which changed position and whose old points of code for summer have assigned with other blocks. The table below is compatible with ISO 10646 (all versions) and Unicode 1.1 (or later)
The zones with private use indicated by the symbol ☒ do not contain same the eye s from one police force to another and must thus be avoided for the coding of texts intended for the exchanges between heterogeneous systems. However these points of codes to private use are valid and can be used in any treatment automated in conformity with the standards Unicode and ISO 10646, including between different systems if there exists a private mutual agreement concerning their use.
In the absence of agreement between the two parts, of the systems using these characters can reject the texts the container, because the treatments that they make them undergo could not function correctly or cause security issues; the other systems which do not allot any special function to these characters must on the other hand accept them like valid and preserve them like integral part of the texts, as if they were graphic symbols, even if they cannot post them correctly.
The listed not-characters are valid points of code, but they (and will be never) are not assigned with standardized characters. Their use in coding of texts forwarded between systems (even if identical) is prohibited, because it is impossible to make them compatible with the standardized universal forms of transformation (of which UTF-8, UTF-16, UTF-32) corresponding diagrams of coding, and other codings standardized compatible with Unicode and ISO 10646 (BOCU-1, SCSU, various versions of Chinese standard GB18030, etc). However certain systems generate them and use them locally, but for a strictly internal treatment intended to facilitate the implementation of the algorithms of word processing using the other standardized characters.
Among these last not-characters appear the points of code valid but reserved in the half-zones (private or not). These points of code cannot be used individually to code a character. They are used only for the form as universal transformation UTF-16 (and the corresponding diagrams of coding) to represent on two codets (with 16 bits each one) of the valid points of code in one of the 16 additional plans (certain combinations of codets correspond to valid of these plans, standards or private characters, other combinations can not represent any valid character because they would correspond to not-characters of these additional plans, and are thus prohibited in the texts in conformity with the standard).
The other free zones (not assigned with a standardized named block, or left points of code free and reserved in the existing named blocks) are reserved for a later use in future versions of Unicode and ISO 10646, but are valid. Any treating system of the texts containing these reserved points of code must accept them without filtering them. Unicode defines properties by defect for the hypothetical corresponding characters, in order to preserve the compatibility of the systems (in conformity with the Unicode standard) with the future texts in conformity which would contain them. No application in conformity must assign a character or a special semantics to them (the private zones are intended for this use).
CoincidenceA Unicode character can have a point of active code until 0x10FFFF into hexadecimal, that is to say 1.114.111 into decimal. This number is a Prime number Palindrome.
- Assistance: Unicode, Assistance: Special characters, Wikipédia: Unicode/Test
- Table of the characters Unicode
- ISO/CEI 10646
- ISO 646, ASCII
- Asian Characters
- Characters APL/APL2 in Unicode
- Duplication of characters unicode
- Banner page of the Unicode consortium.
- official French Translation of the ISO/CEI 10646 and Unicode.
- Tables of Unicode natures annotated with equivalences.
- standardized Unicode Scheduling (for the classification or the sorting).
- RFC 3718, RFC 3629, RFC 3492, RFC 2482, RFC 1642, RFC 1641.
- decodeunicode, Unicode-Wiki, all 98.884 characters of Unicode in images.
- Site counting the various blocks of Unicode with pages of tests, councils and bonds towards the police forces allowing to post the blocks in question.
- Table of the Unicode characters from 1 to 65535.
- Chapters 2,3, and 4 of the book Pig iron and cast iron and codings.
- Presentation of the coding of phonetic natures using Unicode, with the use of the beginners.
- Unicode, writing of the world? (vol.6 (2003) of the review numerical Document , 364 pages). Interest: critical points of view (typographers, data processing specialists, Egyptologists, etc) and discussion with Ken Whistler, chief technical officer of the Consortium Unicode.
- The Gallery off Unicode Font: inventory of 1239 pig iron and cast iron (08/2007) and the characters which they include/understand.
Use of Unicode
- Example of use of Unicode and tests.
- How to use Unicode on the free GNU/Linux or compatible systems.
- a list of the free software supporting Unicode
Utilities for the exploitation of Unicode
- Graticiel BabelMap, visualization and research of the characters of the Unicodes police forces, their properties and their codings UTF-8, UTF-16, UTF-32.
- The Unicode Sliderule, tool Internet of seizure of Unicode characters.
Articles and discussions
- UTF-8 and Unicode FAQ of Markus Kuhn, very complete article.
- Article “UniHan”, by Otfried Cheong on the problems of unification of the Sinogramme S with UniHan in Unicode.
Simple: Unicode Zh-min-nan: Thong-iōng-Be Zh-yue: 統一碼
|Random links:||Milan Baroš | Blot (biology) | Voceru | Andelot (river) | Hans Gildemeister|