UTF-8
UTF-8 ( UCS transformation format 8 bit S) is a format of definite Codage of characters for the character S Unicode (UCS). Each character is coded on a continuation of one to four Octet S. UTF-8 was conceived to be compatible with certain software originally designed to treat characters of only one byte.
UTF-8 is standardized in the RFC 3629 ( UTF-8, has off transformation format ISO 10646 ). Coding was also defined in the report 17 of the Norme Unicode. It now makes integral part of the standard in her chapter 3 Conformance and is also approved by the International organization of standardization (ISO), the Internet Engineering Task Force (IETF) and the majority of the national organizations of standardization.
The IETF requires that UTF-8 is dealt with by the communication protocols of Internet exchanging text.
Description
The number of each character is given by the standard Unicode.The characters of number 0 to 127 are coded on a Octet whose bit of strong weight is always null.
The characters of number higher than 127 are coded on several bytes. In this case, the bits of strong weight of the first byte form a continuation of 1 length equal to the number of bytes used to code the character, the following bytes having 10 like bits of strong weight.
This principle could be wide up to six bytes for a character, but UTF-8 poses the limit with four. This principle also makes it possible to use more bytes than necessary to code a character, but UTF-8 prohibits it.
In all Character string UTF-8, one notices that:
- any byte of bit of strong weight no code a US character ASCII on a byte;
- any byte of bits of strong weight being worth 11 is the first byte of a nature coded on several bytes;
- any byte of bits of strong weight being worth 10 is inside a character coded on several bytes.
Advantages
-
Universality:
- Compatibility with US ASCII :
- Effectiveness:
- Réutilisabilité :
-
Reliability:
Disadvantages
- variable Size :
- Effectiveness:
- invalid Sequences :
2F 2nd 2nd 2F” in hexadecimal notation . 2F C0 AE 2nd 2F”, also called in English overlong form ( form superlongue ). If the software is not carefully written to reject this chain, by putting it for example in canonical form , a potential breach of safety is open. This attack is called directory traversal .
History
UTF-8 was invented by Kenneth Thompson at the time of a dinner with Rob Pike in the neighborhoods of September 1992. It was immediately used in the Operating system Plan 9 on which they worked. A constraint to be solved was to code the characters no one and “/” as in ASCII and that no byte coding another character has the same code. Thus the operating systems UNIX could continue to seek these two characters in a chain without software adaptation.
Assumption of responsibility
- Navigators Web: the assumption of responsibility of UTF-8 started to be widespread starting from 1998.
- old the navigators Web not supporting UTF-8 posts all the same correctly the first 127 characters ASCII.
- the navigator Netscape Navigator supports UTF-8 starting from its version 4 (June 1997).
- the navigator Microsoft Internet Explorer supports UTF-8 starting from its version 4 (October 1997) for Microsoft Windows and Mac OS (January 1998).
- the navigators based on the driving of returned Gecko (launched in 1998) support the UTF-8: Mozilla, Mozilla Firefox, SeaMonkey etc
- the navigator Opera supports UTF-8 starting from its version 6 (November 2001).
- the navigator Konqueror supports UTF-8.
- the navigator Safari on Macintosh and Windows supports UTF-8.
- the navigator OmniWeb on Macintosh supports UTF-8.
- Files and file names: more and more running under the GNU/Linux systems, not supported very well under Windows.
- Client of transport
- Thunderbird supports UTF-8
- Si.Mail supports UTF-8
- some Webmail S, as imp.free.fr is not compatible with this standard.
See too
Internal bonds
- UTF-16, UTF-32, CESU-8
- Unicode, ISO/CEI 10646
- ISO 646, ASCII
- ISO 8859, ISO 8859-1
- parser
- Assistance of Wikipédia on the special characters
External bonds
- Form of conversion UTF-8, UTF-16, UTF-32
- “Confirmité”, Translation of the Unicode Standard in French.
- '' Conformance '', The Unicode Standard 4.0 , chapter 3.2 pp. 60-61 and chapter 3.4 pp. 64-65 and chapter 3.9 pp. 73-81, August 2003, ISBN 0-321-18578-1.
- RFC 3629, UTF-8, has off transformation format ISO 10646 , November 2003 (standard, completely compatible with Unicode) ;
- RFC 2279, UTF-8, has off transformation format ISO 10646 , January 1998 (old revision, obsolete) ;
- RFC 2044, UTF-8, has off transformation format Unicode and ISO 10646 , October 1996 (initial version approved by the ISO, obsolete) ;
- original Paper on UTF-8 of Rob Pike and Ken Thompson (informative, obsolete).
- RFC 2277, IETF policy one character sets and languages , January 1998.
- History of the creation of UTF-8, by Rob Pike.
| Random links: | Holy-Jeanne-in Arc (Mitis) | Musical academy Chigiana of His | Oddr kíkinaskáld | RS-514 | Jihane |