Hauptseminar "i18n and l10n", World Wide Web Internationalization and Localization (SS 2002)
Coach: Prof. Dr. François Bry, Dr. Slim Abdennadher, Dr. Norbert Eisinger
Session Date: May 6, 2002
Report on the Topic: Legacy Character Models and an Introduction to Unicode
Authors: Oliver M. Bolzer

Legacy Character Models and an Introduction to Unicode

  1. Introduction
  2. The Chaos of Legacy Character Models
  3. Unicode: The Universal Character Set
  4. References

1. Introduction

Computers can only handle numbers, they are in fact in their essence only very fast calculators. But in order to be usable for more, computers must be able to handle characters. For this purpose, the characters need to be coded as numbers, and therefore clear rules on how to map characters to numbers. 3 basic concepts need to be defined, before character representation in computers can be discussed.

character repertoire
A collection of characters
coded character set
A finite, ordered set of characters that is complete for a given purpose. Each character is assigned a unique and specific number
character encoding
scheme for binary representation of the characters in a character sets

In the early days of computing, the first character sets included only a small number of characters, manly due to limited storage resources available at the time and because English. the language of computer pioneers, only required a relatively small number of characters to be fully representable. With the world-wide spread of computers, the need to handle more characters from different languages quickly emerged. While many languages can be represented with several dozen or several hundreds of characters, others need several thousand characters to fully express themselves. Heterogenous requirements for character sets for different countries and languages led to the development of many incompatible character sets with greatly differing designs.

This report tries to give an overview of the various types of character codes used today, the motivation behind their development and to give an introduction to Unicode, a new standard trying to solve many of the problems of todays character codes. It does not try to be complete.

Warning:" This report includes characters from many different languages, ranging from Latin to Arabian, Hebrew and East Asian Ideographs as well as Unicode's advance features such as bi-directional text and dynamic composition. Display and printing of this report is heavily dependent on the viewer's browser and locally available fonts. With all neccersary fonts present, the most accurate rendering has been achieved on Mozilla 1.0 RC1 for X11 and the most accurate printing using the same browser on Windows. All other major combinations of browsers and graphical environments were unable to correctly render this report at the time of this writing.

2. The Chaos of Legacy Character Models

2.1 ASCII

The most famous character set almost every computer user will have heard of is ASCII or the American Standard Code for Information Interchange. ASCII was developed in the 1960s by the ASA (American Standards Association), today known as ANSI or American National Standards Institute, as a character set for telecommunication and computers. There had been other character codes prior to ASCII like EBCDIC used on IBM mainframes and codes used for teletypes dating back to Morse code, but ASCII was adopted on the PC and spread with it. Today, ASCII is the base for all major character sets and all of the character sets discussed in this paper retain some form of compatibility with ASCII.

At the time ASCII was developed, computational resources, especially memory had been very expensive and also for it's intended use as telecommunication, it was restricted to the bare minimums. The first version of ASCII defined in 1963 [ASCII63] did only have capital letters. Of course this short-coming was quickly noticed and ASCII was extended in 1967 by the ECMA, the European Computer Manufacturers Association, as [ECMA6] and later adopted as [ASCII67], containing 94 visible characters, the space character and 32 control characters like delete, escape and line feed) encoded in 7-bits, as shown in Figure 1.

ASCII table
Fig. 1:ASCII graphical characters,
(c)Roman Czyborra, http://czyborra.com/charsets/iso646.html

Because the only languages that could be fully written using the characters found in ASCII were Latin, Swahili, Hawaiian and American English, [ECMA6] defines the ASCII-compatible character set as the International Reference Version (IRV) and allows 10 of the lesser used code points of ASCII # $ @ [ \ ] ^ ` { | } ~ to be substituted by national variants. For example the German variant,DIN 66003, substitutes these with the "Umlaut"-characters (ä ö ü Ä Ö Ü) and the Eszett character (ß), keeping the $ # @ and BACKSPACE characters (Fig. 2). The Japanese variant, JIS X 0212, on the other hand, for which the possible substitutions was insufficient anyway to be used for Japanese, replaced only the BACKSPACE character with the YEN currency symbol and TILDE with WAVE DASH.

ASCII table
Fig. 2: ISO-646-DE (DIN 66003) graphical characters,
(c)Roman Czyborra, http://czyborra.com/charsets/iso646.html

[RFC1345] lists 25 such variants of ASCII. In 1972, the International Organization for Standardization (ISO) collected the major national variants and published them as [ISO646]. The IRV was adopted as ISO-646-IRV and the national variants were called ISO-646-xx, like ISO-646-DE for German and ISO-646-JP for Japanese.

By using the national variants, basic demands of many countries and languages using the alphabet could be fulfilled but the situation was far from being satisfiable. Using the variants, usually only one language could be properly represented at the same time and there was no standard way to indicate, which variant a specific text was using.

[JENNINGS] has a thorough explanation of the history of character codes up to ASCII, including the historical reason for the included characters and their layout.

2.2 European Languages: ISO-8859

In order to handle even only West European languages like German, French, Italian and British English with all their accent characters, at the same time, the 128 code points of 7-bit ASCII were obviously not enough. Fortunately, computers handle data in units of 8-bits, called bytes. ASCII being a 7-bit code, the 8th bit was sometimes used for controlling purposes but was usually set to zero, because it was very inconvenient, from a programmer's view, to squeeze 8 7-bit ASCII characters into 7 bytes. So, if this 8th bit was utilized, the number of characters that a single byte could represent would double from 128 to 256.

In the mid-1980s, the ECMA started to design character sets that would be 8-bit and be able to represent multiple European languages. A basic design principle was to be compatible to ASCII, so it was decided that code points 0x00 - 0x7f would be bit-by-bit compatible with ASCII and code points 0x80 - 0xff would contain additional characters. This way, systems that could handle the new character sets could also transparently handle legacy documents containing only ASCII characters.

[ECMA94] defined 4 character sets, one each for West, Central (Eastern), South and North Europe called Latin alphabet, called Latin alphabet No.1,2,3,4. Again, as with [ECMA6] and [ISO646, ISO adopted the various Latin alphabets as [ISO8859]. Later, more alphabets with the same design philosophy were added. As the Western Europe Latin alphabet, called Latin-1 or ISO-8859-1 found wide adaption in West European countries, it became apparent that some characters were missing from it and others were less needed. So Latin alphabet No.9 (ISO-8859-15)(Fig. 3) was defined as successor of Latin-1, replacing some characters and adding others, especially the € sign, which didn't exist at the time [ECMA94] was created.

ISO-8859-16 table
Fig. 3: ISO-8859-15, additional graphical characters,
(c)Roman Czyborra, http://czyborra.com/charsets/iso8859.html

Fig. 4 gives an overview of the Latin alphabets, their ISO-8859 numbers and their coverage. They cover most of Europe, but still, several of the character sets can't be inter-mixed without special handling by applications. Also, many systems, especially on the Internet, can't handle 8-bit characters cleanly, as they had been designed only with 7-bit ASCII in mind. If an 8-bit text passes such an system, usually the 8th bit is zeroed and all characters originally having a code value higher than 0x7f will be mangled. To prevent such damage, an character encoding scheme must be applied to the 8-bit text. For ISO-8859, it is common to encode characters from the 0x80 - 0xff range using the "quoted-printable" encoding [RFC1521]. In this scheme, the character to be encoded will be represented by three characters from the ASCII character set. The = (equal) sign to signal the beginning of an encoded character and the character's hexadecimal code value. For this to work, the = equal sign used as encoding delimiter needs to be encoded, too. (Fig. 5)

ISO-8859-1Western, West European
ISO-8859-2Central European, East European
ISO-8859-3South European, Esperanto
ISO-8859-4North European
ISO-8859-5Cyrillic
ISO-8859-6Arabic
ISO-8859-7Greek
ISO-8859-8Hebrew
ISO-8859-9Turkish
ISO-8859-10Nordic
ISO-8859-11Thai
ISO-8859-13Baltic Rim
ISO-8859-14Celtic
ISO-8859-15Euro symbol, revision of ISO-8859-1
ISO-8859-16Romania

Fig. 4: Major ISO-8859 Latin alphabets and their coverage

CharacterISO-8859-1 codeencoded
ä0xe4=E4
ß0xdf=DF
=0x3d=3D

Fig. 5: The Quoted-Printable Character Encoding

2.3 CJK: Chinese-Japanese-Korean

CJK is an acronym for Chinese, Japanese and Korean, commonly used in the context of Internationalization. Why these 3 East Asian languages? They all use ideographic characters. Ideographic characters differ from phonetic characters, like the Latin alphabet, that they do not only have a pronunciation but also have a meaning themselves (Fig. 6 shows some examples). The ideographs used in all 3 countries date back to the Chinese Han dynasty but have since have evolved separately. In Japanese they are called Kanji, in Chinese Hanzi and in Korean Hanja. Note the similarity of the names, tough the spoken languages are totally different today. In addition to the ideographs, Japanese and Korean also have their own phonetic scripts, too. Because the ideographs are like words themselves, there are many many more of them than 256 that would fit into an 8-bit character set. Actually it is said that you need to to know roughly 2,000 ideographs to be able to read a Japanese newspaper. In order to properly write people's names and names of places, computer systems need to handle more than 20,000 ideographs at the minimum. In Chinese, the number is even higher. [UNIHAN] is a large database of these ideographs together with sample glyphs and their meaning.

Ideograph
Meaningdogwaterfriendly

Fig. 6: CJK-Ideographs and their meanings

In order to handle the large number of characters, multiple bytes were needed to represent a single character. In 1976, JIS X 0208 was the first standardized character set that used 16-bits (=2bytes) for each character. It was designed for Japanese, but also contained basic Latin, Greek and Cyrillic alphabets as well as many symbols besides the Japanese Hiragana and Katakana alphabets and the most important Kanjis. The idea was to make the character set convenient for every-day use by Japanese businesses.

Though JIS X 0208 could theoretically contain 2^16 = 65,536 characters, in order to maintain compatibility with ASCII, JIS X 0208 was organized in 94 rows of 94 cells each, so that they could be mapped over the 94 graphical characters of ASCII. So actually only 94 x 94 = 8,836 characters are included. Later, additional characters were added as extentions to JIS X 0208. Fig. 7 shows a single row of JIS X 0208.

JIS X 0208
Fig. 7: a small part of JIS X 0208

For Chinese, mainland China defined GB 2312 for Simplified Chinese characters in the same 94x94 layout as JIS X 0208. Taiwan on the other hand, defined it's own character set CNS 11643 Plane 1 for Traditional Chinese characters. To make matters worse, another character set for Traditional Chinese, Big5, an industry standard, is being widely used. (see [WITTERN] for an overview on Chinese character codes)

Now, For a byte stream containing text in multi-byte character codes, the text needs to be specially encoded to be recognized as such. Otherwise it would not be possible to decide where character boundaries are or whether a specific byte is part of a multi-byte character or a single ASCII character. Very often multiple encodings exist for a single character set. Even only for JIS X 0208, there are three possible encodings that are widely used. raw JIS, EUC-JP and Shift-JIS. In the raw JIS encoding, the first byte has the raw value of the row and the second byte the raw cell value of the character; distinction from an stream of ASCII character is impossible. EUC-JP or Extended Unix Coding just sets the 8th bit that is not used by ASCII to distinguish multi-byte characters. As the name implies, EUC-JP is widely used on UNIX environments. Another encoding scheme is Shift-JIS, also known as MS-Kanji, because it was developed by Microsoft for use in it's operating systems. In order to add extra 64 characters, the characters from JIS X 0208 are "shifted" 64 code points and are reorganized as 47 rows with 188 cells each. [PING] gives an simple overview of these encodings. Korea also has two character sets in wide use. The national standard KS C 5601 which again has been modeled after JIS X 0208 and UHC, the Unified Hangul Code. Tough UHC was designed as a superset for KS C 5601, they have multiple totally different encoding schemes.

Confused? Then imagine the nightmare of deciding which encoding a byte-stream of text is in and converting between different encodings. Or what happens when you don't even know which language, and thus which potential encoding, a text is supposed to be in? Tough there exist methods to identify character sets and encodings ([RFC2278]) as well as ways to specify these such as in [RFC1521], Section 7.1.1, or [XML], Section 2.12, todays information, especially on the web is poorly labeled, and a mixture of wild guessing and heuristics is used to determine a documents language and encoding, not even elaborating on the difficulty to create multi-lingual documents.

2.4 ISO-2022

With the wide-spread use of 8 and 16-bit character sets, most languages could be represented. But still, handling of multiple languages was limited to those included in a single character set. In order to use use two or more languages that didn't share a common character set simultaneously, a new and (most likely) incompatible character set had to be created.

Instead of creating dozens of new character sets, ISO-2022 ( [ISO2022]) defines a mechanism to use multiple character sets simoultanously by switching between them using escape sequences, Again, there is an identical ECMA standard, [ECMA35].

ISO-2022 divides the 256 code points of a single byte into 4 areas: CL (Control Left, primary control characters), GL (Graphical left, graphical characters), CR (Control Right, secondary control characters) and GR (Graphical Right, graphical characters), mapped over ASCII's layout of control and graphical characters. (Fig. 8). At the beginning of an ISO-2022 encoded text, up to 4 character sets are designated as G0, G1, G2 and G3, using escape sequences, each starting with the ESCAPE control character followed by one or more bytes. Then two of the character sets are assigned to either GL or GR area as needed. Depending whether a byte has it's 8th bit set or not, it is clear to which character set it belongs. A text can also be simply encoded into 7-bits for transfer over the Internet by utilizing only the GL area and switching the character set assigned to GL. Fig. 9 shows an example text containing US-ASCII and JIS X 0208 characters encoded in ISO-2022.

CL|GL|CR|GR
Fig. 8: ISO-2022 code point organization

text
byte sequence 0x1B0x280x42 0x1B0x240x290x42 0x0F 0x410x42 0x0E 0x410x42
ESC(B ESC$)B SI AB S0 AB
meaning G0 = US-ASCII G1 = JIS X 0208 assign G0 to GL G0 0x41G0 0x42 assign G1 to GL G1 0x41G1 0x42
G0 ???US-ASCII
G1 ???JIS X 0208
GL ??G0 G1
 AB  

Fig. 9: ISO-2022 encoded text

Additionally, the current character set can explictly switched, either only for the next character or until switched again. Should more than 4 character sets be needed, the Gx areas can be reassigned. The character sets are identified in the escape sequences by codes assigned to them in [ISOESC].

ISO-2022 was supposed to become the one encoding to rule them all, the many character sets, but isn't being used as widely as had been expected at the time of it's design. ISO-2022's complexity is the main reason for it's failure. Being an stateful encoding, switching from one character set to another, processing of ISO-2022 encoded sequences of bytes is non-trivial. In order to know which character set us used for a given byte, the whole sequence must be analyzed. To make matters worse, if a bytes or even a single bit of an ISO-2022 encoded text would get lost or corrupted, the whole meaning of the text could change, especially if an escape sequence was damaged. Additionally, applications must have knowledge of all character sets they expect to encounter in an ISO-2022 encoded text.

ISO-2022 found use on the Internet for transmission of E-Mail messages, but only in simplified forms like ISO-2022-JP ([RFC1468]) and ISO-2022-KR ([RFC1557]), which specify usable character sets and their fixed assignment to G0 - G4, as well as restrict to specific switching methods. One of the few large-scale applications known to fully utilize ISO-2022 newer versions of Emacs and XEmacs which include multilingual capabilities.

3. Unicode: The Universal Character Set

3.1 Unicode Goals

As the world globalizes more and more, users often need to handle multiple languages and scripts simultaneously. But the multitude and heterogeneity of character sets and encodings schemes, as described in Section 2, can't meet the needs beginning multilingual era. Also, with the global spread of personal computers, software manufactures became more and more stressed. Software was usually developed in English, with only English and 8-bit character sets in mind. After the release of the English version, it often took several month to a year for a company to provide localized versions of their software, because they not only had to translate all the text of the application, but also change the operation semantics of the software according to local needs. And parallel localization work in multiple countries cost real money, very often even more than the initial development cost of the software. And that for every new version of the software. As a result, customers in non-English countries felt discriminated because their software was more expensive, out-of-date and usually badly localized because of limits in the original software.

In the late 1980s, the idea of an Universal Character Set started to emerge at research centers of various computer firms. In 1987, the term "Unicode" was first used in course of discussions. The main difficulty in creating such a character set was, that the requirements were not clearly known. After thorough research on the world's characters, the "Unicode 1.0" standard was published by the Unicode Consortium with 4 main design principles, learning from legacy character sets. ([UNICODE3], Section 1.2)

Universal
enough characters must be included, so that all characters used in day to day operations wold-wide
Efficient
the character code must be simple enough that computer systems can easily implement it
Uniform
the character code shall be uniform, so that sorting, searching, displaying and editing of text can be done efficiently without special exception rules
Unambiguous
any given code value always represents the same character

At roughly the same time, ISO started developing an international standard with the same goals what would later become ISO/IEC 10646. After the release of Unicode 1.0 in 1991, both efforts realized that having two different, incompatible but universal character sets would be senseless and merged their work, so that both standards share the same repertoire of characters using identical code numbers. Today the standards are strongly linked so that in various technical documentation, very often one or the other is used as reference.

After the merge with ISO/IEC 10646, and the release of Unicode 1.0.1 and ISO/IEC-10646-1:1993, both standards continued to evolve together towards their goal of a Universal Character Set, adding more characters and clarifying various issues such as algorithms and encodings as needed. Today, the newest revisions are ISO/IEC-10646-2:2000 and Unicode 3.2, the latter including more than 90,000 characters. Fig. 10 summarizes the history of Unicode and Fig. 11 shows number of characters included in major releases of Unicode.

1986-87ideas about an unified character set @ Xerox
Dec. 1987first use of the term "Unicode"
Feb. 1988basic architecture of Unicode developed @ Apple
1988-89 Kanji-Unification work
Jan 1991founding of Unicode Inc.
Oct 1991Unicode 1.0
Jun 1993Unicode 1.1 - merger with ISO/IEC 10646
Jul 1996Unicode 2.0 - extention of character space by surrogate pairs
1998 Unicode 2.1 - add Euro symbol
Sep 1999Unicode 3.0 - add precise encoding and property definition
Mar 2001Unicode 3.1 - add 44,946 new characters
Mar 2002Unicode 3.2

Fig. 10: history and major revisions of Unicode ([UNIHIST], [UNIVERS])

# of characters
Fig. 11: Numbers of characters in Unicode ([UNICODE3],[UNICODE31],[UNICODE32])

3.2 The Unicode Consortium

When the Unicode Standard started to take form, the formation of the Unicode Consortium was announced, and shortly thereafter incorporated as the non-profit organization Unicode, Inc. in 1991. The consortium's mission is to define Unicode characters, their relationship to each other and provide technical information and guidelines to implementers of the Unicode Standard.

The consortium funds itself through sale of the standard in printed form and fees from it's members, which include prominent computer hardware and software manufactures like IBM, Hewlett-Packard, Oracle, SAP, Adobe, Apple and Microsoft, just to name a few. Individual can join either as Specialist or Individual members, neither having voting rights but the former having full access to all members-only documents. ([UNIMEMBER] contains the full list of consortium members, and [UNIJOIN] an overview of membership types and benefits)

Additionally, the consortium has liaison relationships with other national and international standardization bodies like the Internet Engineering Task Force (IETF), the World Wide Web Consortium (W3C), the High Council of Informatics of Iran and the several joint work groups of the International Organization for Standardization (ISO) and International Electrotechnical Commission (IEC) working on internationalization. Especially the relationship with ISO/IEC JTC1/SC2/WG2 is important as they work closely with Unicode on the ISO/IEC 10646 standard.

3.3 Unicode Design

In order to realize the ambitious idea of an easily implementable Universal Character Set, the creators of Unicode made a simple but well-toughtout set of design decisions. They were not arrogant as to ignore existing legacy character sets but to embrace compatibility with them and ease migration. For this purpose, some decisions were made against the ultimate goals mentioned in Section 3.1, but paving the road for fast adaption of Unicode was given priority.

3.3.1 Character, not glyphs

Unicode does not encode glyphs but characters. Fig. 12 shows some glyphs. They are all visually different from each, but semantically they all represent the character "A".

LATIN CAPITAL LETTER A
Fig. 12: Various glyphs of the character "A"

In Unicode, each semantically distinct character is given a unique name, with which the characters are distinguished. Reusing the above example, the character "A" has been given the name LATIN CAPITAL LETTER A. Most characters are given descriptive names like GREEK SMALL LETTER PSI (ψ) or HIRAGANA LETTER KA (か) but the many ideographs are named according to their number like CJK UNIFIED IDEOGRAPH-5C71 (山), because giving them distinguished names would have been too complex.

3.3.2 16-bit fixed-length

Each character in Unicode's character repertoire is assigned an unique number making Unicode a character set. To have enough space for all the characters in the world, each character's code is 16-bits log, allowing for a total of 65,535 code values. A character's value is noted in it's hexadecimal form with the 'U+' prefix. For example, the code of LATIN CAPITAL LETTER A is U+0041.

Deciding on a multiple of 8 was logical in the context of byte-oriented computer systems and a 24-bit code seemed unneccersary and an hindering factor in Unicode adaption. requiring three times as much space for storage as legacy codes from an American/European view, despite heavy protests from East Asian countries, which knew from the experience with their own local 16-bit character sets, that 16-bits would be insufficient even for their own needs.

Unsurprisingly, after the first implementation of Unicode were introduced into the market and the products could not be decently used in East Asian markets, it finally became apparent to Unicode designers that the 16-bit code space was indeed insufficient. Beginning with Unicode 2.0, an extention mechanism was introduced that allowed additional 16*2^16 characters to be added to Unicode by using two 16-bit values, called a surrogate pair, to represent the additional character. This way, Unicode still is a 16-bit character set.

The new design of Unicode divides the expanded code points U+000000 - U+10FFFF into 17 "planes" of 2^16 code points each. Plane 00, containing the original Unicode code points U+0000 through U+FFFF, is called Basic Multilingual Plane (BMP), the others supplementary planes. Characters in the BMP are special, because they can be represented by a single 16-bit value. Unicode 3.1 was the first Unicode Standard to include characters outside the BMP and as of Unicode 3.2, 44,944 characters out of a total of 95,156 characters are located in the supplementary planes.

3.3.3 Universal Character Set

To be universal, to include as many characters for as many languages as possible. But what languages are used in a heterogenous world? What kind of scripts do they use ? How should they be prioritized? Are they all equal or are some more equal than others? The Unicode Consortium was faced with many difficult decisions, as language and characters are often matters of national pride and the omission of a single character could lead to the boycott of the standard in a whole region.

Based on thorough research on the topic, Unicode was divided into blocks of different sizes and the scripts of various languages were allocated to specific blocks. This way, characters from the same script would logically be grouped together and would still have space to add new characters without disturbing their grouping. Fig. 13 shows the major blocks of Unicode and their code regions.

NameRegionDescription
General ScriptsU+0000 - U+1F00"normal" characters like the Latin alphabet, Greek, Hebrew, Thai, Tibetan. Smaller scripts are included here.
SymbolsU+2000 - U+2DFFpunctuation, numbers, chemical symbols, arrows etc. Also contains OCR and Braille characters
CJK Syllables and SymbolsU+2E00 - U+33FFCJK phonetic characters and symbols
CJK Unified IdeographsU+3400 - U+9FFFCJK Ideographs unified into one repertoire
YiU+A000 - U+A4CFPhonetic characters used in South China
HangulU+AC00 - U+D743Pre-composed Hangul (Korean) characters
SurrogatesU+D800 - DFFFused to represent characters in supplementary planes
Private UseU+E000 - F8FFto be used freely for private purposes, no compatibility guaranteed
Compatibility and SpecialsF900-FFFDCharacters needed for compatibility with legacy character sets and characters with special meanings in Unicode
major blocks of Unicode

Fig. 13: Major blocks and their size in Unicode

Each block is further divided into smaller sub-blocks for specific scripts and contains unused regions, reserved for future use. [UNICODE3] gives detailed description of all blocks and the characters contained within, including a sample glyph for each character. U+FFFE and U+FFFF are not considered as characters and are and will not be used in future revisions of Unicode. To discover the endianess of the system, U+FFFE is reserved as the byte-swapped form of U+FEFF (ZERO WIDTH NON-BREAK SPACE), also called the Byte-Order-Mark (BOM). U+FFFF can be used by applications to signal errors or a non-character value.

3.3.4 Logical Order

In Unicode, characters are exclusively stored in logical order. That is the order the characters are read and not necessary the order they are displayed on screen or printed. Some characters like Latin, Greek and Cyrillic characters are written left-to-right while others like Arabic and Hebrew are written right-to-left. Each Unicode character has it's written direction as a property to aid in proper graphical rendering. Additionally there are invisible control-characters that explictly mark a direction change in case of bi-directional text where the direction change might be ambiguous.

Display Order: ABCD:אבגד

Logical Order:ABCD:אבגד

Fig. 14: Bi-directional text: storage and display

Storing characters in their logical orders complicates graphical rendering but simplifies operations like searching, sorting and editing dramatically.

3.3.5 Character Properties

Unicode characters have well-defined semantics that are specified through Character Properties. The properties operations like parsing and sorting as well as other algorithms that need to have semantic knowledge about the characters. Some properties are normative and some are only informative. For normative properties, applications conforming to the Unicode standard must react if they encounter a character having such an property. For informative properties, it is up to the application whether to honor them or not. Below is a small list of the most important properties and their description. It does not include all normative properties, [UNICODE3] Chapter 4 lists all properties and their status as well as full descriptions.

Alphabetic
Set if an character is phonetic and not ideographic.
Case
Some phonetic alphabets have two variants of the same character, like "A" and "a".
Directionality
Text direction for characters before and after this character, as mentioned in 3.3.4
Numeric Value
Characters representing numbers have a value so that they can be used for arithmetic purposes
Surrogate
Whether a character is part of a surrogate pair
Decomposition
If the same character can be represented by the use of two other characters, specifies it's decomposition rule. (see. 3.3.6)
Mirrored
Character has different look depending whether it appears in a left-to-right or right-to-left context, like brackets.

3.3.6 Dynamic Composition and Decomposition

Unicode not only includes accented characters like Ü, Ç, ǻ, it also has a mechanism to dynamically create such composed characters by combining a single base character with a arbitrary number of combining character. Not only can accented characters, that are already included in Unicode be emulated this way, also new characters can be composed. Fig. 15 shows some examples.

a + ̈ ⇒ ä
C + ̧ ⇒ Ç
a + ̊ + ́ ⇒ ǻ
Fig. 15: Dynamic Composition

Now, with Dynamic Composition, there are several ways to encode a single character. For example, the above mentioned ǻ character could be encoded as ǻ, å + ́ or even a + ̊ + ́, making searching and sorting of text very difficult. In order to solve this ambiguity, characters that can be dynamically composed have a decomposition mapping, defining how a character can be dissolved apart into it's basic parts. Using the decomposed, canonic form as in-memory representation, searching and sorting becomes simple again.

3.3.7 Unification

All West European languages as well as some African and South Asian languages use the Latin alphabet as common script together with their individual extentions, usually accented characters. The same visual character might be pronounced differently in those languages, but it is still the same character. To reduce the number of characters and redundancy, characters with the same appearance have been been unified and allocated only a single code point.

Unification is obvious in the case of the Latin alphabet, but there are many uncertain unification candidates. For example, the comma character is mainly used in as thousands-separator in English but is used as decimal-separator in French, but there is only one Latin COMMA character. Unicode does not differentiate on usage but on only on appearance. As another example, the unit symbols for seconds, fetes and prime have been unified as the PRIME character (′). But there are also exceptions like the Greek Omega character and the Ohm symbol for electrical resistance. These haven't been unified because of legacy compatibility and their totally different semantic.

The area in most need of unification were the CJK ideographs. Sharing common roots, many of them have similar or even same visual appearance. With unification, the more than 130,000 ideographs present in legacy character sets, have been reduced to less than 30,000.

But sometimes, the characters have evolved differently and look slightly different. Fig. 13 shows two such ideographs, U+6D77 ("ocean") and U+76F4 ("straight") and their appearances in Traditional Chinese, Simplified Chinese, Korean and Japanese. For users of the Latin alphabet, the differences seem subtle, but for the actual users of the languages the difference is very big. The difference is not an glyph problem, but of the actual shape of the character. If a Japanese student would write the Korean variant of an ideograph in an ideograph-exam, he'd fail it. In some cases U+6D77, Japanese readers would probably able to guess the meaning of the Chinese and Korean variants but in other cases like U+76F4, the Chinese variant would be impossible to understand for a Japanese and vice versa. Still they have been aggressively unified, the only exceptions being those cases where a legacy character sets differentiated between variants as separate characters.

U+6D77
U+76F4
Fig. 16: unified ideographs and their possible visual appearances, [KUBOTA]

As Unicode did not include a mechanism to specify the language of a text, applications had to depend on higher-level protocols to help them decide on the correct rendering of the characters, as the "lang" attribute in XML ([XML], 2.12). Because of this problem with variants, there also can't be a single Unicode font that would cover all characters and languages. Fig. 17 shows the same Unicode character rendered differently according to the lang Attribute of HTML. (browser and local font dependent).

LocaleLanguageU+5E73U+76F4
jaJapanese
koKorean
zh-TWChinese (Taiwan)
zh-CNChinese (China)

Fig. 17: language-dependent character rendering

Unification has been the source of much grief and chaos since the the early days of Unicode. First revisions of Unicode were practically useless for East Asian countries because of overzealous unification to fit as much as possible into the 16-bits. Newer versions of Unicode provide room for more characters so that variants as well as ideographs that had been previously missing are being continously added to Unicode's repertoire.

3.3.8 Surrogate Pairs

Though the number of characters had been greatly reduced through unification work, Unicode designers soon had to realize, that the number of characters included in Unicode were simple not enough. Even the wast number of 65,536 characters was insufficient. But Unicode being an 16-bit fixed-length, how could a characters with a number higher than U+010000 be represented without fundamental changes ?

For this purpose,2048 code points have been reserved as "surrogates" beginning with Unicode 2.0. 1024 are designated "High-surrogate", and another 1024 are "Low-surrogates". These are not characters themselves, but by combining one high and one low surrogates, called a "surrogate pair", they represent a single Unicode character together. A simple algorithm is used to calculate the actual Unicode character number from the surrogate pair and is defined in [UNICODE3], Sec. 3.7.

This method is clearly a violation with Unicode's basic philosophy of simplicity, because the surrogates must be handled specially. But still, the scheme is very well thought out to minimize the negative effect based on experiences with legacy character encodings. Depending on application support, a surrogate pair is either shown as two unknown characters, if the application knows nothing about surrogates or as single character should the application be aware of surrogates. Because a high-surrogate is always followed by a low-surrogate and the encoded character is not dependant on any other values before or after the pair, character boundaries are obvious in a sequence of pairs. At the same time, should a character stream be interrupted, the maximum damage is limited to a single character.

With the introduction of surrogate pairs, the potential numbers of characters that can be included in Unicode increased 17-fold.

3.6 Unicode and ISO/IEC 10646

The Unicode Standards and the ISO-10646 standard are strongly interlinked. As ISO-10646's formal name "Information technology -- Universal Multiple-Octet Coded Character Set (UCS) " reflects, ISO-10646 has the same basic goals as Unicode, to create an Universal Character Set. An "octet" is ISO's term for an 8-bit byte. The two standards have agreed to share the same character repertoire and character numbering so that both standards are character-by-character equal in their character sets.

Usually Unicode revisions are published more frequently due to the administrative overhead of ISO standards but both organizations have agreed to synchronize as often as possible. The advantage of collaboration for Unicode is, that many national standards don't allow industry standards like Unicode to be referenced but allow ISO standards. For ISO on the other hand, Unicode has the computing industry's support and compatibility with it guarantees industry acceptance and feedback.

In difference to Unicode, ISO-10646 doesn't limit itself to 16-bits. ISO-10646 is a 4-octet (32-bit) character set capable of including more than 2*10^9 characters (the highest bit is not used). It is organized into 128 groups each containing 256 planes that again include 256 rows with 256 cells each. Plane 0x00 of Group 0x00 is called the Basic Multilingual Plane (BMP) and has exactly the same size and code points as Unicode's BMP. Planes 0x01 to 0x10 contain the characters from Unicode's additional 16 planes that are represented using surrogates in Unicode. Though ISO-10646 could contain much much more characters than Unicode, the additional groups and planes are currently reserved for future use and no characters can be defined there to maintain compatibility with Unicode as long as possible.

ISO-10646's canonical representation are UCS-4 and UCS-2. In UCS-4, a character's number is encoded one octet each for group, plane, row and cell number. Should a text only contain characters from the BMP, the group and plane octets can be omitted. This 2-octet representation is called UCS-2. Beware that ISO-10646 doesn't have the surrogate mechanism If an Unicode text is interpreted as UCS-2, all characters above U+FFFF will be lost.

4. References And Additional Reading

4.1 References

[JENNINGS]
"World Power Systems:Texts:Annotated history of character codes", Tom Jennings, 2001, http://www.wps.com/texts/codes/
[ASCII63]
X3.4-1963, AMERICAN STANDARD CODE FOR INFORMATION INTERCHANGE, American Standards Association (ASA), June 1963, http://www.wps.com/texts/codes/X3.4-1963/index.html
[ASCII67]
ANSI X3.4-1967, AMERICAN STANDARD CODE FOR INFORMATION INTERCHANGE, American Standards Association (ASA), 1967
[ECMA6]
Standard ECMA-6 7-bit Coded Character Set, 6th edition (Dec.1991), ECMA, http://www.ecma.ch/ecma1/STAND/ECMA-006.HTM
[ISO646]
"ISO/IEC 646 Information technology -- ISO 7-bit coded character set for information interchange", ISO
[RFC1345]
"Character Mnemonics & Character Sets", RFC1345, IETF Network Working Group, Jun 1992, http://www.ietf.org/rfc/rfc1345.txt
[ECMA94]
Standard ECMA-94 8-Bit Single-Byte Coded Graphic Character Sets - Latin Alphabets No. 1 to No. 4, 2nd edition, ECMA, http://www.ecma.ch/ecma1/STAND/ECMA-094.HTM
[ISO8859]
ISO/IEC 8859 Information technology -- 8-bit single-byte coded graphic character sets, International Organization for Standardization
[RFC1521]
"MIME (Multipurpose Internet Mail Extensions) Part One: Mechanisms for Specifying and Describing the Format of Internet Message Bodies",RFC1521, IETF Network Working Group, Sep 1993, http://www.ietf.org/rfc/rfc1521.txt
[PING]
"Japanese text encoding, stuff I've learned through creating Shodouka", Ping, 1996, http://web.lfw.org/text/jp.html
[RFC2278]
"IANA Charset Registration Procedures", RFC2278, The Internet Society, Jan 1998, http://www.ietf.org/rfc/rfc2278.txt
[XML]
"Extensible Markup Language (XML) 1.0 (Second Edition)", World Wide Web Consortium, Oct 2000, http://www.w3.org/TR/2000/REC-xml-20001006
[UNIHAN]
"Unihan Database", Unicode Inc., http://www.unicode.org/charts/unihan.html
[WITTERN]
"Chinese character codes: an update", Christian Wittern, May 1995, http://www.iijnet.or.jp/iriz/irizhtml/multling/codes.htm
[ISO2022]
"ISO/IEC 2022 Information technology -- Character code structure and extension techniques", ISO
[ECMA35]
"Standard ECMA-35 Character Code Structure and Extension Techniques", ECMA, http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM
[ISOESC]
"ISO/IEC International Register of Coded Character Sets To Be Used With Escape Sequences", Information Processing Society of Japan/Information Technology Standards Commission of Japan, http://www.itscj.ipsj.or.jp/ISO-IR/
[RFC1468]
"Japanese Character Encoding for Internet Messages", RFC1468, RFC1468, Murai et al., December 1993, http://ftp.ietf.org/rfc/rfc1468.txt
[RFC1557]
"Korean Character Encoding for Internet Messages", RFC1557, Choi et al., December 1993, http://ftp.ietf.org/rfc/rfc1557.txt
[UNICODE3]
"The Unicode Standard, Version 3.0", ISBN 0-201-61633-5, The Unicode Consortium, 2000
[UNICODE31]
"Unicode Standard Annex #27 Unicode 3.1", The Unicode Consortium, May 2001, http://www.unicode.org/unicode/reports/tr27/
[UNICODE32]
"Unicode Standard Annex #28 Unicode 3.2", he Unicode Consortium, Mar 2002, http://www.unicode.org/unicode/reports/tr28/
[UNIHIST]
"Chronology of Unicode Version 1.0", Unicode Inc., http://www.unicode.org/unicode/history/
[UNIVERS]
"Enumerated Versions of The Unicode Standard", Unicode Inc., http://www.unicode.org/unicode/standard/versions/enumeratedversions.html
[UNIMEMBER]
"The Unicode Consortium Members", Unicode Inc., http://www.unicode.org/unicode/consortium/memblogo.html
[UNIJOIN]
"Joining the Unicode Consortium", Unicode Inc., http://www.unicode.org/unicode/consortium/join.html
[KANJIBUKURO]
"Unicode contradictions, Kanjibukuro Ideograph Variant Database", http://kanji.zinbun.kyoto-u.ac.jp/~yasuoka/kanjibukuro/unicode.html
[KUBOTA]
Han Unification, Tomohiro Kubota, Apr 2002, http://www.debian.or.jp/~kubota/unicode-symbols-unihan.html

4.2 Additional Readings


(c) Oliver M. Bolzer <bolzer@informatik.uni-muenchen.de>, , 2002, All rights reserved.

Valid XHTML 1.0!      Valid CSS!