GB 18030

GB18030 is Internet name recorded for the official Character set of the Popular republic of China (RPC).

History

This character set was in the past called " National Chinese Standard GB 18030-2000: Technology information -- Chinese ideograms coded character set for information interchanges -- Extension for the BASIC set". GB is the abbreviation of Guojia Biaozhun (国家标准), which means standard national in Chinese.

The standard was published by the China Standard Close of Beijing (capital of popular China) the March 17th 2000 and updated the November 20th 2000. At September 1st 2001, the support for this character set is officially (in RPC) obligatory for all the software sold to a final customer of RPC.

Description

GB18030 can be regarded as a format of transport of Unicode (UTF) (i.e. a coding of all the code points Unicode) which maintains compatibility with the old character sets (patrimonial) . In other words, it is about a Chinese equivalent of the UTF-8 (which maintains compatibility with the ASCII). Like the UTF-8, the GB18030 of the ASCII is a superset and can represent the beach of whole value of the code points Unicode. Because of its equivalence with Unicode, the GB18030 supports the characters as well Chinese simplified as traditional Chinese .

The GB18030 also preserves compatibility with GBK (except for the Symbole euro which is coded on only one byte of 0x80 in last version GBK from Microsoft and a code of two bytes, A2 E3, in GB18030) , which was the preexistent character standard in RPC, with an aim of simplifying the update of the data and the software to use GB18030. Parts of the mappées data come from a table of research (as in GBK). The remainder is calculated algorithmiquement. Unfortunately, he also inherits the bad aspects of the standards legacy (old) on which he is based (in particular, a special algorithm is necessary to seek ASCII characters in sequences GB18030).

Many companies of software development already standardized on the basis of a given version of Unicode the internal format of representation of their data and calls systems. However, the majority support only the Code points of PMB originally defined in the version Unicode 1.0, which supported only 65536 positions of code and was often coded on 16 bits like UCS-2.

In a historical change of importance for the software in conformity with Unicode, the Popular republic of China decided to offer the support of some code points apart from the BMP. That means that the software cannot continue any more to suppose that all the characters are entities of a fixed size of 16 bits (UCS-2). Consequently, they must is to treat the data with characters of variable size (like UTF-8 or UTF-16), which are the most common choices, that is to say to choose a larger fixed width (such as UCS-4 or UTF-32). Microsoft carried out the migration since UCS-2 towards UTF-16 with Windows 2000.

The font face SimSun 18030 allows the posting of the characters GB 18030, which takes again all the characters of Unicode 2.1 more new characters which are in the block Unicode Supplément has with unified ideograms CJC.

Technical details

The diagram of 4 bytes can be thought like made up of two units, each of two bytes. Each unit has a format similar to a character GBK of two bytes but with a beach of value for the second byte of 0x30-0x39 (codes ASCII of the decimal digits). The first byte is in the beach 0x81 with 0xFE, like before. That means that routine of sour search for character for the GBK should also be sour for the GB18030 (same manner that a directed research directed is reasonably sour for EUC).

That gives a total of 1  587  600 (126 × 10 × 126 × 10) possible sequences of 4 bytes, which is for the moment sufficient to cover the 1  114  112 (17 × 65536) code points of Unicode.

Unfortunately, the subject becomes complicated owing to the fact that there does not exist simple rule to convert the sequences of 4 bytes in their Numéro of code corresponding. In fact, these numbers are allocated in a sequential way and Large-boutiste seulement towards the code Unicode points which is not maps of any another manner. For example:

U+00DE (Þ) → 81 30 89 37 U+00DF (ß) → 81 30 89 38 U+00E0 (with) → A8 A4 U+00E1 (á) → A8 A2 U+00E2 (â) → 81 30 89 39 U+00E3 (ã) → 81 30 8A 30

See too

Related articles

  • GBK
  • Guobiao codes off
  • CJC
  • Chinese character encoding
  • Comparison Unicode encodings

External bonds

  • IANA Charset Registration for GB18030
  • English language summary off GB 18030-2000
  • Introduction to GB18030 including evolution from GB2312 and GBK
  • Authoritative mapping table between GB18030 and Unicode (Broad xml file. Seems to hang firefox!).
  • ICU Converter To explore: GB18030
  • Unicode CJK Unified Ideographs Extension has (pdf, 1.5MB)
  • Unicode CJK Unified Ideographs Extension B (pdf, 13 MB)
  • GB18030 Support Package for Windows 2000/XP, including Chinese, Tibetan, Yi, Thai Mongolian and make by Microsoft
  • SIL' S freeware font, editors and documentation

Random links:Sporting union of Ivry handball | Bréziers | Dartrier | The Rose tree of Mrs Husson (film, 1932) | Bambey | Kurobane,_Tochigi