UTF-8 ( UCS transformation format 8 bit S) is a format of definite Codage of characters for the character S Unicode (UCS). Each character is coded on a continuation of one to four Octet S. UTF-8 was conceived to be compatible with certain software originally designed to treat characters of only one byte.

UTF-8 is standardized in the RFC 3629 ( UTF-8, has off transformation format ISO 10646 ). Coding was also defined in the report 17 of the Norme Unicode. It now makes integral part of the standard in her chapter 3 Conformance and is also approved by the International organization of standardization (ISO), the Internet Engineering Task Force (IETF) and the majority of the national organizations of standardization.

The IETF requires that UTF-8 is dealt with by the communication protocols of Internet exchanging text.

Description

The number of each character is given by the standard Unicode.

The characters of number 0 to 127 are coded on a Octet whose bit of strong weight is always null.

The characters of number higher than 127 are coded on several bytes. In this case, the bits of strong weight of the first byte form a continuation of 1 length equal to the number of bytes used to code the character, the following bytes having 10 like bits of strong weight.

This principle could be wide up to six bytes for a character, but UTF-8 poses the limit with four. This principle also makes it possible to use more bytes than necessary to code a character, but UTF-8 prohibits it.

In all Character string UTF-8, one notices that:

  • any byte of bit of strong weight no code a US character ASCII on a byte;
  • any byte of bits of strong weight being worth 11 is the first byte of a nature coded on several bytes;
  • any byte of bits of strong weight being worth 10 is inside a character coded on several bytes.

Advantages

  • Universality:

This coding makes it possible to represent the thousands of characters of Unicode.
  • Compatibility with US ASCII :
a text in US-ASCII is coded identically in UTF-8. Owing to the fact that a character is cut out in a succession of bytes (and not in words of several bytes), it does not have there a problem of Endianness. Problem which appears with codings UTF-16 and UTF-32 for example.
  • Effectiveness:
For the languages using the US characters much ASCII, UTF-8 requires less bytes than UTF-16 or UTF-32.
  • Réutilisabilité :
Of many valid techniques of data-processing Programmation with the characters uniformly coded on a byte remains it with UTF-8, in particular:
* manner of locating the end of a Character string C, because byte 00000000 in a character string coded in UTF-8 is always the null Caractère.
* the manner of finding a under-chain is identical.
  • Reliability:

It acts of a car-synchronizing coding (by reading only one byte one knows if it is the first of a character or not).
* a sequence describing a character never appears in a longer sequence describing another character (case of Shift-JIS).
* There does not exist code “of exhaust” changing the interpretation of the continuation of a sequence of bytes.

Disadvantages

  • variable Size :
the characters are represented in UTF-8 by sequences of bytes of variable size, which returns certain operations on the more complicated character strings: the calculation of the number of characters; positioning at a distance given in a textual file and in general any operation requiring the access to the character of position NR in a chain.
  • Effectiveness:
For the languages using many natures external with US-ASCII, UTF-8 occupies appreciably more space. For example, the current ideograms employed in the Asian texts of languages like the Chinese, the Korean or the Japanese (Kanji, for example) use 3 bytes in UTF-8 against 2 bytes in UTF-16. In a general way, scripts employing many characters of value higher than U+0800 occupy more memory than if it were encodés with UTF-16.
  • invalid Sequences :
From its system of coding, it is possible to represent a point of code in various manners in UTF-8, which can present an security issue: a program badly writing can accept a certain number of representations UTF-8, normally invalid according to the RFC 3629 but not according to the original specification, and convert them like a single character. In fact, a Logiciel detecting certain character strings (to prevent the injections SQL, for example) can fail in its task.
Let us take an example drawn from a real case of assailing virus of the waiters HTTP of the Web in 2001 (). A sequence to be detected could be “/../” represented in ASCII ( a fortiori in UTF-8) by the bytes “2F 2nd 2nd 2F” in hexadecimal notation .
However, a manner malformée to code this chain in UTF-8 would be “2F C0 AE 2nd 2F”, also called in English overlong form ( form superlongue ). If the software is not carefully written to reject this chain, by putting it for example in canonical form , a potential breach of safety is open. This attack is called directory traversal .

History

UTF-8 was invented by Kenneth Thompson at the time of a dinner with Rob Pike in the neighborhoods of September 1992. It was immediately used in the Operating system Plan 9 on which they worked. A constraint to be solved was to code the characters no one and “/” as in ASCII and that no byte coding another character has the same code. Thus the operating systems UNIX could continue to seek these two characters in a chain without software adaptation.

Assumption of responsibility

  • Navigators Web: the assumption of responsibility of UTF-8 started to be widespread starting from 1998.
    • old the navigators Web not supporting UTF-8 posts all the same correctly the first 127 characters ASCII.
    • the navigator Netscape Navigator supports UTF-8 starting from its version 4 (June 1997).
    • the navigator Microsoft Internet Explorer supports UTF-8 starting from its version 4 (October 1997) for Microsoft Windows and Mac OS (January 1998).
    • the navigators based on the driving of returned Gecko (launched in 1998) support the UTF-8: Mozilla, Mozilla Firefox, SeaMonkey etc
    • the navigator Opera supports UTF-8 starting from its version 6 (November 2001).
    • the navigator Konqueror supports UTF-8.
    • the navigator Safari on Macintosh and Windows supports UTF-8.
    • the navigator OmniWeb on Macintosh supports UTF-8.
  • Files and file names: more and more running under the GNU/Linux systems, not supported very well under Windows.
  • Client of transport
    • Thunderbird supports UTF-8
    • Si.Mail supports UTF-8
    • some Webmail S, as imp.free.fr is not compatible with this standard.

See too

Internal bonds

External bonds

  • Form of conversion UTF-8, UTF-16, UTF-32
  • “Confirmité”, Translation of the Unicode Standard in French.
  • '' Conformance '', The Unicode Standard 4.0 , chapter 3.2 pp. 60-61 and chapter 3.4 pp. 64-65 and chapter 3.9 pp. 73-81, August 2003, ISBN 0-321-18578-1.
  • RFC 3629, UTF-8, has off transformation format ISO 10646 , November 2003 (standard, completely compatible with Unicode)   ;
    • RFC 2279, UTF-8, has off transformation format ISO 10646 , January 1998 (old revision, obsolete)   ;
    • RFC 2044, UTF-8, has off transformation format Unicode and ISO 10646 , October 1996 (initial version approved by the ISO, obsolete)   ;
    • original Paper on UTF-8 of Rob Pike and Ken Thompson (informative, obsolete).
  • RFC 2277, IETF policy one character sets and languages , January 1998.
  • History of the creation of UTF-8, by Rob Pike.

Random links:Charter of 1814 | Cat-like Chlamydiose | Theory of Iwasawa | Museum of the fine arts of Houston | Safra

© 2007-2008 speedlook.com; article text available under the terms of GFDL, from fr.wikipedia.org