UTF-EBCDIC
UTF-EBCDIC is a Codage of characters used to represent the characters Unicode. It is conceived to be compatible with EBCDIC, so that the existing applications EBCDIC on the mainframes can accept and treat the characters without large difficulty. Its advantages for the existing systems based on the EBCDIC are similar to those of UTF-8 for the system baseds on the ASCII. The details on transformation UTF-EBCDIC are defined in the Report Unicode n°16 ( UTR #16 ).
Transformation of a point of Unicode code into sequence UTF-EBCDIC
Intermediate transformation UTF-8-MOD
To produce the version encodée in UTF-EBCDIC of a succession of points of code Unicode, a first transformation intermediate, similar to the UTF-8 (indicated in the specifications like UTF-8-MOD ), is initially appliquée ; the principal difference between this intermediate transformation and the UTF-8 are that it makes it possible to represent the points of code 0+0080 to U+009F (control characters C1) on only one byte, and to continue to use them as check codes EBCDIC.For that purpose, the binary reason 101xxxxx was used instead of 10xxxxxx to represent the final bytes of a sequence multi-byte representing only one point of code. Since that leaves only 5 significant bits instead of 6 for the final bytes, the UTF-EBCDIC will often produce a result a little longer than that obtained with the UTF-8, for the same data input.
The report n°16 stipulates that the result of first transformation UTF-8-MOD should not be used for the communications between systems.
Final, compatible permutation ASCII towards EBCDIC
The preceding intermediate transformation leaves the data in a format based on the ASCII, also a reversible transformation of permutation of bytes is operated on the intermediate sequences of bytes, in order to make them possible as near as to the EBCDIC by means of a table of correspondence.However that is not possible that for the invariant positions of the EBCDIC, the table of permutation being based directly on the reversible transformation of the American version of the ISO 646 (commonly called ASCII) into the American version of EBCDIC : the UTF-EBCDIC does not define nor does not use any other table of transformation for the other national versions of the ISO 646 and the EBCDIC.
Reverse transformation of the UTF-8 towards a point of Unicode code
These two preceding stages can be easily reversed to find the points of Unicode code.- the second stage will be initially reversed by the use of one second opposite table of permutation, to produce sequences of bytes transformed into UTF-8-MOD.
- Then the first stage will be reversed algorithmiquement.
Detection of the bytes of head in the texts containing of sequences UTF-EBCDIC
A defect of the UTF-EBCDIC is that it is not simple to detect, in a text coded in UTF-EBCDIC, which bytes delimit each sequence.Indeed, they are dispersed among the 256 possible values, and the standard technique requires a table of correspondence making it possible to know if an isolated byte represents a character (except for the check codes C0 and C1 grouped between 0x00 and 0x3F or the delete character (LED) coded 0xfF in all the versions of the EBCDIC), or if it is a byte of tail or a byte of head indicating the effective length of the sequence.
This table of correspondence is described in the Report n°16 and contains flags ( shadow flags ) for each value of possible byte. Its algorithmic cost and in term of performance is considerable, and finally similar to that of the table of permutation used in the second phase of transformation since the UTF-EBCDIC. Its interest remains very limited in term of performance, since it is still necessary to treat in particular the final bytes (all identified by the same flag because one cannot know their relative position in the sequence only since their only value of byte) and to proceed to additional loops of reading and test to find the first byte of the sequence.
Also, with many implementations of the UTF-EBCDIC are satisfied only with the opposite table of permutation of bytes UTF-EBCDIC towards UTF-8-MOD, and do without the table of flags. They then carry out a simple test of value, knowing that in UTF-8-MOD, the bytes of tail all obey in the condition very simple to test (written here in syntax of the languages C, C++, Java or C#) :
- (byte & 0xE0) == 0xA0
Use of the UTF-EBCDIC
Generally, this encoding seldom is used, even on the mainframes based on the EBCDIC and for which this encoding was conceived. The systems of mainframes IBM based on the EBCDIC, like Z/OS or MVS, use today generally the UTF-16 for a complete support of Unicode : for example, DB2 UDB, COBOL, PL/I, Java and the box of IBM tools for XML support all the UTF-16 on the IBM systems.
Extension on 32 bits of internal use
Transformation UTF-EBCDIC can be sometimes wide to facilitate the internal treatments, by considering that sequences UTF-EBCDIC limited to 4 bytes can code any point of code until the end of the additional plan n°3 (it is àdire until U+3FFFF). Thus, it is possible to represent (in-house only) all the points of code of the basic multilingual plan in a form comparable with the UTF-16, by also representing the codes of half-zone ( surrogates ) of the UTF-16. One then obtains easily a code element on 32 bits, which remains compatible with the EBCDIC for each point of code of the BMP, and two codets of 32 bits each one to represent the points of code of the additional plans.This alternative representation should not be used in the exchanges between systems, but only to facilitate and optimize the application program interfaces internal where characters EBCDIC are exchanged in codets (in memory) of 32 bits, which then limits the number of tests of values and avoids the automatic appeal with the tables of permutation to test the extents of characters during complex or bulky treatments of texts (the systematic use of the tables of permutation is an expensive operation in term of performance, if one compares it with a simple test based on the intervals of values of the codets of 32 bits).
Currently this internal representation does not have any well defined official denomimation, even if some call it UTF-16-MOD or UTF-16-EBCDIC (unsuitable denominations because this transformation creates codets of 32 bits representing each one a code element of 16 bits of UCS-2).
Its interest compared to intermediate representation UTF-8-MOD is that it becomes possible to avoid the use of any table to know if a code element is the first of a sequence or one of the final codets. Indeed, the codets of the half-zones of the UTF-16 (which makes it possible to know if a code element is the first or the second of a sequence) are also represented in contiguous intervals of codets on 32 bits in this representation, which only facilitates their detection by an arithmetic test and which makes it possible to know if a code element 32-bit is the first or the second representing an additional point of code, or if the code element of 32 bits is isolated and represents a point of code of the BMP. In addition, this internal representation preserves also the values of all characters EBCDIC invariants.
However the transformation of a sequence UTF-EBCDIC in this internal representation on 32 bits requires knowledge which bytes delimit a sequence UTF-EBCDIC, which requires a table of flags (called show flags ) to interpret the UTF-EBCDIC correctly. But the reverse is immediate and does not require any table (the reverse transformation sefaisant by simple binary shifts and tests desvaleurs null to poursavoir if one or more bytes must be emitted in the UTF-EBCDIC.
If storage is not important and relates to only limited quantities of characters, this representation will be faster (for example as intermediate stage to transform a text UTF-EBCDIC in capital letters when one has of tables or algorithms based on the EBCDIC, or like intermediate stage of the calculation of keys of collation based on the EBCDIC or the UTF-EBCDIC, or in-house in lexical analyzers dealing with the texts coded in EBCDIC or UTF-EBCDIC).
On the other hand, its principal defect is obviously its size, double of the UTF-16 (this is why the databases prefer to index or store the texts and search keys by using the more compact UTF-16).
See too
Internal bonds
External bonds
- Unicode Technical Carryforward #16: the definition off UTF-EBCDIC
| Random links: | Equip with Greece of football | Charles Walters | Center evaluation of the safety of information technologies | Claudia Augusta | Mamadou Diabaté | Banque_de_la_Thaïlande |