Hauptseminar | "i18n and l10n", World Wide Web Internationalization and Localization (SS 2002) |
Coach: | Prof. Dr. François Bry, Dr. Slim Abdennadher, Dr. Norbert Eisinger |
Session Date: | May 6, 2002 |
Report on the Topic: | Legacy Character Models and an Introduction to Unicode |
Authors: | Oliver M. Bolzer |
Computers can only handle numbers, they are in fact in their essence only very fast calculators. But in order to be usable for more, computers must be able to handle characters. For this purpose, the characters need to be coded as numbers, and therefore clear rules on how to map characters to numbers. 3 basic concepts need to be defined, before character representation in computers can be discussed.
In the early days of computing, the first character sets included only a small number of characters, manly due to limited storage resources available at the time and because English. the language of computer pioneers, only required a relatively small number of characters to be fully representable. With the world-wide spread of computers, the need to handle more characters from different languages quickly emerged. While many languages can be represented with several dozen or several hundreds of characters, others need several thousand characters to fully express themselves. Heterogenous requirements for character sets for different countries and languages led to the development of many incompatible character sets with greatly differing designs.
This report tries to give an overview of the various types of character codes used today, the motivation behind their development and to give an introduction to Unicode, a new standard trying to solve many of the problems of todays character codes. It does not try to be complete.
Warning:" This report includes characters from many different languages, ranging from Latin to Arabian, Hebrew and East Asian Ideographs as well as Unicode's advance features such as bi-directional text and dynamic composition. Display and printing of this report is heavily dependent on the viewer's browser and locally available fonts. With all neccersary fonts present, the most accurate rendering has been achieved on Mozilla 1.0 RC1 for X11 and the most accurate printing using the same browser on Windows. All other major combinations of browsers and graphical environments were unable to correctly render this report at the time of this writing.
The most famous character set almost every computer user will have heard of is ASCII or the American Standard Code for Information Interchange. ASCII was developed in the 1960s by the ASA (American Standards Association), today known as ANSI or American National Standards Institute, as a character set for telecommunication and computers. There had been other character codes prior to ASCII like EBCDIC used on IBM mainframes and codes used for teletypes dating back to Morse code, but ASCII was adopted on the PC and spread with it. Today, ASCII is the base for all major character sets and all of the character sets discussed in this paper retain some form of compatibility with ASCII.
At the time ASCII was developed, computational resources, especially memory had been very expensive and also for it's intended use as telecommunication, it was restricted to the bare minimums. The first version of ASCII defined in 1963 [ASCII63] did only have capital letters. Of course this short-coming was quickly noticed and ASCII was extended in 1967 by the ECMA, the European Computer Manufacturers Association, as [ECMA6] and later adopted as [ASCII67], containing 94 visible characters, the space character and 32 control characters like delete, escape and line feed) encoded in 7-bits, as shown in Figure 1.
Fig. 1:ASCII graphical characters,
(c)Roman Czyborra, http://czyborra.com/charsets/iso646.html
Because the only languages that could be fully written using the characters found in ASCII were Latin, Swahili, Hawaiian and American English, [ECMA6] defines the ASCII-compatible character set as the International Reference Version (IRV) and allows 10 of the lesser used code points of ASCII # $ @ [ \ ] ^ ` { | } ~ to be substituted by national variants. For example the German variant,DIN 66003, substitutes these with the "Umlaut"-characters (ä ö ü Ä Ö Ü) and the Eszett character (ß), keeping the $ # @ and BACKSPACE characters (Fig. 2). The Japanese variant, JIS X 0212, on the other hand, for which the possible substitutions was insufficient anyway to be used for Japanese, replaced only the BACKSPACE character with the YEN currency symbol and TILDE with WAVE DASH.
Fig. 2: ISO-646-DE (DIN 66003) graphical characters,
(c)Roman Czyborra, http://czyborra.com/charsets/iso646.html
[RFC1345] lists 25 such variants of ASCII. In 1972, the International Organization for Standardization (ISO) collected the major national variants and published them as [ISO646]. The IRV was adopted as ISO-646-IRV and the national variants were called ISO-646-xx, like ISO-646-DE for German and ISO-646-JP for Japanese.
By using the national variants, basic demands of many countries and languages using the alphabet could be fulfilled but the situation was far from being satisfiable. Using the variants, usually only one language could be properly represented at the same time and there was no standard way to indicate, which variant a specific text was using.
[JENNINGS] has a thorough explanation of the history of character codes up to ASCII, including the historical reason for the included characters and their layout.
In order to handle even only West European languages like German, French, Italian and British English with all their accent characters, at the same time, the 128 code points of 7-bit ASCII were obviously not enough. Fortunately, computers handle data in units of 8-bits, called bytes. ASCII being a 7-bit code, the 8th bit was sometimes used for controlling purposes but was usually set to zero, because it was very inconvenient, from a programmer's view, to squeeze 8 7-bit ASCII characters into 7 bytes. So, if this 8th bit was utilized, the number of characters that a single byte could represent would double from 128 to 256.
In the mid-1980s, the ECMA started to design character sets that would be 8-bit and be able to represent multiple European languages. A basic design principle was to be compatible to ASCII, so it was decided that code points 0x00 - 0x7f would be bit-by-bit compatible with ASCII and code points 0x80 - 0xff would contain additional characters. This way, systems that could handle the new character sets could also transparently handle legacy documents containing only ASCII characters.
[ECMA94] defined 4 character sets, one each for West, Central (Eastern), South and North Europe called Latin alphabet, called Latin alphabet No.1,2,3,4. Again, as with [ECMA6] and [ISO646, ISO adopted the various Latin alphabets as [ISO8859]. Later, more alphabets with the same design philosophy were added. As the Western Europe Latin alphabet, called Latin-1 or ISO-8859-1 found wide adaption in West European countries, it became apparent that some characters were missing from it and others were less needed. So Latin alphabet No.9 (ISO-8859-15)(Fig. 3) was defined as successor of Latin-1, replacing some characters and adding others, especially the € sign, which didn't exist at the time [ECMA94] was created.
Fig. 3: ISO-8859-15, additional graphical characters,
(c)Roman Czyborra, http://czyborra.com/charsets/iso8859.html
Fig. 4 gives an overview of the Latin alphabets, their ISO-8859 numbers and their coverage. They cover most of Europe, but still, several of the character sets can't be inter-mixed without special handling by applications. Also, many systems, especially on the Internet, can't handle 8-bit characters cleanly, as they had been designed only with 7-bit ASCII in mind. If an 8-bit text passes such an system, usually the 8th bit is zeroed and all characters originally having a code value higher than 0x7f will be mangled. To prevent such damage, an character encoding scheme must be applied to the 8-bit text. For ISO-8859, it is common to encode characters from the 0x80 - 0xff range using the "quoted-printable" encoding [RFC1521]. In this scheme, the character to be encoded will be represented by three characters from the ASCII character set. The = (equal) sign to signal the beginning of an encoded character and the character's hexadecimal code value. For this to work, the = equal sign used as encoding delimiter needs to be encoded, too. (Fig. 5)
ISO-8859-1 | Western, West European |
ISO-8859-2 | Central European, East European |
ISO-8859-3 | South European, Esperanto |
ISO-8859-4 | North European |
ISO-8859-5 | Cyrillic |
ISO-8859-6 | Arabic |
ISO-8859-7 | Greek |
ISO-8859-8 | Hebrew |
ISO-8859-9 | Turkish |
ISO-8859-10 | Nordic |
ISO-8859-11 | Thai |
ISO-8859-13 | Baltic Rim |
ISO-8859-14 | Celtic |
ISO-8859-15 | Euro symbol, revision of ISO-8859-1 |
ISO-8859-16 | Romania |
Fig. 4: Major ISO-8859 Latin alphabets and their coverage
Character | ISO-8859-1 code | encoded |
---|---|---|
ä | 0xe4 | =E4 |
ß | 0xdf | =DF |
= | 0x3d | =3D |
Fig. 5: The Quoted-Printable Character Encoding
CJK is an acronym for Chinese, Japanese and Korean, commonly used in the context of Internationalization. Why these 3 East Asian languages? They all use ideographic characters. Ideographic characters differ from phonetic characters, like the Latin alphabet, that they do not only have a pronunciation but also have a meaning themselves (Fig. 6 shows some examples). The ideographs used in all 3 countries date back to the Chinese Han dynasty but have since have evolved separately. In Japanese they are called Kanji, in Chinese Hanzi and in Korean Hanja. Note the similarity of the names, tough the spoken languages are totally different today. In addition to the ideographs, Japanese and Korean also have their own phonetic scripts, too. Because the ideographs are like words themselves, there are many many more of them than 256 that would fit into an 8-bit character set. Actually it is said that you need to to know roughly 2,000 ideographs to be able to read a Japanese newspaper. In order to properly write people's names and names of places, computer systems need to handle more than 20,000 ideographs at the minimum. In Chinese, the number is even higher. [UNIHAN] is a large database of these ideographs together with sample glyphs and their meaning.
Ideograph | 犬 | 水 | 優 |
---|---|---|---|
Meaning | dog | water | friendly |
Fig. 6: CJK-Ideographs and their meanings
In order to handle the large number of characters, multiple bytes were needed to represent a single character. In 1976, JIS X 0208 was the first standardized character set that used 16-bits (=2bytes) for each character. It was designed for Japanese, but also contained basic Latin, Greek and Cyrillic alphabets as well as many symbols besides the Japanese Hiragana and Katakana alphabets and the most important Kanjis. The idea was to make the character set convenient for every-day use by Japanese businesses.
Though JIS X 0208 could theoretically contain 2^16 = 65,536 characters, in order to maintain compatibility with ASCII, JIS X 0208 was organized in 94 rows of 94 cells each, so that they could be mapped over the 94 graphical characters of ASCII. So actually only 94 x 94 = 8,836 characters are included. Later, additional characters were added as extentions to JIS X 0208. Fig. 7 shows a single row of JIS X 0208.
Fig. 7: a small part of JIS X 0208
For Chinese, mainland China defined GB 2312 for Simplified Chinese characters in the same 94x94 layout as JIS X 0208. Taiwan on the other hand, defined it's own character set CNS 11643 Plane 1 for Traditional Chinese characters. To make matters worse, another character set for Traditional Chinese, Big5, an industry standard, is being widely used. (see [WITTERN] for an overview on Chinese character codes)
Now, For a byte stream containing text in multi-byte character codes, the text needs to be specially encoded to be recognized as such. Otherwise it would not be possible to decide where character boundaries are or whether a specific byte is part of a multi-byte character or a single ASCII character. Very often multiple encodings exist for a single character set. Even only for JIS X 0208, there are three possible encodings that are widely used. raw JIS, EUC-JP and Shift-JIS. In the raw JIS encoding, the first byte has the raw value of the row and the second byte the raw cell value of the character; distinction from an stream of ASCII character is impossible. EUC-JP or Extended Unix Coding just sets the 8th bit that is not used by ASCII to distinguish multi-byte characters. As the name implies, EUC-JP is widely used on UNIX environments. Another encoding scheme is Shift-JIS, also known as MS-Kanji, because it was developed by Microsoft for use in it's operating systems. In order to add extra 64 characters, the characters from JIS X 0208 are "shifted" 64 code points and are reorganized as 47 rows with 188 cells each. [PING] gives an simple overview of these encodings. Korea also has two character sets in wide use. The national standard KS C 5601 which again has been modeled after JIS X 0208 and UHC, the Unified Hangul Code. Tough UHC was designed as a superset for KS C 5601, they have multiple totally different encoding schemes.
Confused? Then imagine the nightmare of deciding which encoding a byte-stream of text is in and converting between different encodings. Or what happens when you don't even know which language, and thus which potential encoding, a text is supposed to be in? Tough there exist methods to identify character sets and encodings ([RFC2278]) as well as ways to specify these such as in [RFC1521], Section 7.1.1, or [XML], Section 2.12, todays information, especially on the web is poorly labeled, and a mixture of wild guessing and heuristics is used to determine a documents language and encoding, not even elaborating on the difficulty to create multi-lingual documents.
With the wide-spread use of 8 and 16-bit character sets, most languages could be represented. But still, handling of multiple languages was limited to those included in a single character set. In order to use use two or more languages that didn't share a common character set simultaneously, a new and (most likely) incompatible character set had to be created.
Instead of creating dozens of new character sets, ISO-2022 ( [ISO2022]) defines a mechanism to use multiple character sets simoultanously by switching between them using escape sequences, Again, there is an identical ECMA standard, [ECMA35].
ISO-2022 divides the 256 code points of a single byte into 4 areas: CL (Control Left, primary control characters), GL (Graphical left, graphical characters), CR (Control Right, secondary control characters) and GR (Graphical Right, graphical characters), mapped over ASCII's layout of control and graphical characters. (Fig. 8). At the beginning of an ISO-2022 encoded text, up to 4 character sets are designated as G0, G1, G2 and G3, using escape sequences, each starting with the ESCAPE control character followed by one or more bytes. Then two of the character sets are assigned to either GL or GR area as needed. Depending whether a byte has it's 8th bit set or not, it is clear to which character set it belongs. A text can also be simply encoded into 7-bits for transfer over the Internet by utilizing only the GL area and switching the character set assigned to GL. Fig. 9 shows an example text containing US-ASCII and JIS X 0208 characters encoded in ISO-2022.
Fig. 8: ISO-2022 code point organization
byte sequence | 0x1B | 0x28 | 0x42 | 0x1B | 0x24 | 0x29 | 0x42 | 0x0F | 0x41 | 0x42 | 0x0E | 0x41 | 0x42 |
ESC | ( | B | ESC | $ | ) | B | SI | A | B | S0 | A | B | |
meaning | G0 = US-ASCII | G1 = JIS X 0208 | assign G0 to GL | G0 0x41 | G0 0x42 | assign G1 to GL | G1 0x41 | G1 0x42 | |||||
G0 | ??? | US-ASCII | |||||||||||
G1 | ??? | JIS X 0208 | |||||||||||
GL | ?? | G0 | G1 | ||||||||||
A | B | 疎 |
Fig. 9: ISO-2022 encoded text
Additionally, the current character set can explictly switched, either only for the next character or until switched again. Should more than 4 character sets be needed, the Gx areas can be reassigned. The character sets are identified in the escape sequences by codes assigned to them in [ISOESC].
ISO-2022 was supposed to become the one encoding to rule them all, the many character sets, but isn't being used as widely as had been expected at the time of it's design. ISO-2022's complexity is the main reason for it's failure. Being an stateful encoding, switching from one character set to another, processing of ISO-2022 encoded sequences of bytes is non-trivial. In order to know which character set us used for a given byte, the whole sequence must be analyzed. To make matters worse, if a bytes or even a single bit of an ISO-2022 encoded text would get lost or corrupted, the whole meaning of the text could change, especially if an escape sequence was damaged. Additionally, applications must have knowledge of all character sets they expect to encounter in an ISO-2022 encoded text.
ISO-2022 found use on the Internet for transmission of E-Mail messages, but only in simplified forms like ISO-2022-JP ([RFC1468]) and ISO-2022-KR ([RFC1557]), which specify usable character sets and their fixed assignment to G0 - G4, as well as restrict to specific switching methods. One of the few large-scale applications known to fully utilize ISO-2022 newer versions of Emacs and XEmacs which include multilingual capabilities.
As the world globalizes more and more, users often need to handle multiple languages and scripts simultaneously. But the multitude and heterogeneity of character sets and encodings schemes, as described in Section 2, can't meet the needs beginning multilingual era. Also, with the global spread of personal computers, software manufactures became more and more stressed. Software was usually developed in English, with only English and 8-bit character sets in mind. After the release of the English version, it often took several month to a year for a company to provide localized versions of their software, because they not only had to translate all the text of the application, but also change the operation semantics of the software according to local needs. And parallel localization work in multiple countries cost real money, very often even more than the initial development cost of the software. And that for every new version of the software. As a result, customers in non-English countries felt discriminated because their software was more expensive, out-of-date and usually badly localized because of limits in the original software.
In the late 1980s, the idea of an Universal Character Set started to emerge at research centers of various computer firms. In 1987, the term "Unicode" was first used in course of discussions. The main difficulty in creating such a character set was, that the requirements were not clearly known. After thorough research on the world's characters, the "Unicode 1.0" standard was published by the Unicode Consortium with 4 main design principles, learning from legacy character sets. ([UNICODE3], Section 1.2)
At roughly the same time, ISO started developing an international standard with the same goals what would later become ISO/IEC 10646. After the release of Unicode 1.0 in 1991, both efforts realized that having two different, incompatible but universal character sets would be senseless and merged their work, so that both standards share the same repertoire of characters using identical code numbers. Today the standards are strongly linked so that in various technical documentation, very often one or the other is used as reference.
After the merge with ISO/IEC 10646, and the release of Unicode 1.0.1 and ISO/IEC-10646-1:1993, both standards continued to evolve together towards their goal of a Universal Character Set, adding more characters and clarifying various issues such as algorithms and encodings as needed. Today, the newest revisions are ISO/IEC-10646-2:2000 and Unicode 3.2, the latter including more than 90,000 characters. Fig. 10 summarizes the history of Unicode and Fig. 11 shows number of characters included in major releases of Unicode.
1986-87 | ideas about an unified character set @ Xerox |
Dec. 1987 | first use of the term "Unicode" |
Feb. 1988 | basic architecture of Unicode developed @ Apple |
1988-89 | Kanji-Unification work |
Jan 1991 | founding of Unicode Inc. |
Oct 1991 | Unicode 1.0 |
Jun 1993 | Unicode 1.1 - merger with ISO/IEC 10646 |
Jul 1996 | Unicode 2.0 - extention of character space by surrogate pairs |
1998 | Unicode 2.1 - add Euro symbol |
Sep 1999 | Unicode 3.0 - add precise encoding and property definition |
Mar 2001 | Unicode 3.1 - add 44,946 new characters |
Mar 2002 | Unicode 3.2 |
Fig. 10: history and major revisions of Unicode ([UNIHIST], [UNIVERS])
Fig. 11: Numbers of characters in Unicode ([UNICODE3],[UNICODE31],[UNICODE32])
When the Unicode Standard started to take form, the formation of the Unicode Consortium was announced, and shortly thereafter incorporated as the non-profit organization Unicode, Inc. in 1991. The consortium's mission is to define Unicode characters, their relationship to each other and provide technical information and guidelines to implementers of the Unicode Standard.
The consortium funds itself through sale of the standard in printed form and fees from it's members, which include prominent computer hardware and software manufactures like IBM, Hewlett-Packard, Oracle, SAP, Adobe, Apple and Microsoft, just to name a few. Individual can join either as Specialist or Individual members, neither having voting rights but the former having full access to all members-only documents. ([UNIMEMBER] contains the full list of consortium members, and [UNIJOIN] an overview of membership types and benefits)
Additionally, the consortium has liaison relationships with other national and international standardization bodies like the Internet Engineering Task Force (IETF), the World Wide Web Consortium (W3C), the High Council of Informatics of Iran and the several joint work groups of the International Organization for Standardization (ISO) and International Electrotechnical Commission (IEC) working on internationalization. Especially the relationship with ISO/IEC JTC1/SC2/WG2 is important as they work closely with Unicode on the ISO/IEC 10646 standard.
In order to realize the ambitious idea of an easily implementable Universal Character Set, the creators of Unicode made a simple but well-toughtout set of design decisions. They were not arrogant as to ignore existing legacy character sets but to embrace compatibility with them and ease migration. For this purpose, some decisions were made against the ultimate goals mentioned in Section 3.1, but paving the road for fast adaption of Unicode was given priority.
Unicode does not encode glyphs but characters. Fig. 12 shows some glyphs. They are all visually different from each, but semantically they all represent the character "A".
Fig. 12: Various glyphs of the character "A"
In Unicode, each semantically distinct character is given a unique name, with which the characters are distinguished. Reusing the above example, the character "A" has been given the name LATIN CAPITAL LETTER A. Most characters are given descriptive names like GREEK SMALL LETTER PSI (ψ) or HIRAGANA LETTER KA (か) but the many ideographs are named according to their number like CJK UNIFIED IDEOGRAPH-5C71 (山), because giving them distinguished names would have been too complex.
Each character in Unicode's character repertoire is assigned an unique number making Unicode a character set. To have enough space for all the characters in the world, each character's code is 16-bits log, allowing for a total of 65,535 code values. A character's value is noted in it's hexadecimal form with the 'U+' prefix. For example, the code of LATIN CAPITAL LETTER A is U+0041.
Deciding on a multiple of 8 was logical in the context of byte-oriented computer systems and a 24-bit code seemed unneccersary and an hindering factor in Unicode adaption. requiring three times as much space for storage as legacy codes from an American/European view, despite heavy protests from East Asian countries, which knew from the experience with their own local 16-bit character sets, that 16-bits would be insufficient even for their own needs.
Unsurprisingly, after the first implementation of Unicode were introduced into the market and the products could not be decently used in East Asian markets, it finally became apparent to Unicode designers that the 16-bit code space was indeed insufficient. Beginning with Unicode 2.0, an extention mechanism was introduced that allowed additional 16*2^16 characters to be added to Unicode by using two 16-bit values, called a surrogate pair, to represent the additional character. This way, Unicode still is a 16-bit character set.
The new design of Unicode divides the expanded code points U+000000 - U+10FFFF into 17 "planes" of 2^16 code points each. Plane 00, containing the original Unicode code points U+0000 through U+FFFF, is called Basic Multilingual Plane (BMP), the others supplementary planes. Characters in the BMP are special, because they can be represented by a single 16-bit value. Unicode 3.1 was the first Unicode Standard to include characters outside the BMP and as of Unicode 3.2, 44,944 characters out of a total of 95,156 characters are located in the supplementary planes.
To be universal, to include as many characters for as many languages as possible. But what languages are used in a heterogenous world? What kind of scripts do they use ? How should they be prioritized? Are they all equal or are some more equal than others? The Unicode Consortium was faced with many difficult decisions, as language and characters are often matters of national pride and the omission of a single character could lead to the boycott of the standard in a whole region.
Based on thorough research on the topic, Unicode was divided into blocks of different sizes and the scripts of various languages were allocated to specific blocks. This way, characters from the same script would logically be grouped together and would still have space to add new characters without disturbing their grouping. Fig. 13 shows the major blocks of Unicode and their code regions.
Name | Region | Description |
---|---|---|
General Scripts | U+0000 - U+1F00 | "normal" characters like the Latin alphabet, Greek, Hebrew, Thai, Tibetan. Smaller scripts are included here. |
Symbols | U+2000 - U+2DFF | punctuation, numbers, chemical symbols, arrows etc. Also contains OCR and Braille characters |
CJK Syllables and Symbols | U+2E00 - U+33FF | CJK phonetic characters and symbols |
CJK Unified Ideographs | U+3400 - U+9FFF | CJK Ideographs unified into one repertoire |
Yi | U+A000 - U+A4CF | Phonetic characters used in South China |
Hangul | U+AC00 - U+D743 | Pre-composed Hangul (Korean) characters |
Surrogates | U+D800 - DFFF | used to represent characters in supplementary planes |
Private Use | U+E000 - F8FF | to be used freely for private purposes, no compatibility guaranteed |
Compatibility and Specials | F900-FFFD | Characters needed for compatibility with legacy character sets and characters with special meanings in Unicode |
![]() |
Fig. 13: Major blocks and their size in Unicode
Each block is further divided into smaller sub-blocks for specific scripts and contains unused regions, reserved for future use. [UNICODE3] gives detailed description of all blocks and the characters contained within, including a sample glyph for each character. U+FFFE and U+FFFF are not considered as characters and are and will not be used in future revisions of Unicode. To discover the endianess of the system, U+FFFE is reserved as the byte-swapped form of U+FEFF (ZERO WIDTH NON-BREAK SPACE), also called the Byte-Order-Mark (BOM). U+FFFF can be used by applications to signal errors or a non-character value.
In Unicode, characters are exclusively stored in logical order. That is the order the characters are read and not necessary the order they are displayed on screen or printed. Some characters like Latin, Greek and Cyrillic characters are written left-to-right while others like Arabic and Hebrew are written right-to-left. Each Unicode character has it's written direction as a property to aid in proper graphical rendering. Additionally there are invisible control-characters that explictly mark a direction change in case of bi-directional text where the direction change might be ambiguous.
Display Order: ABCD:אבגד
Logical Order: | A | B | C | D | : | א | ב | ג | ד |
Fig. 14: Bi-directional text: storage and display
Storing characters in their logical orders complicates graphical rendering but simplifies operations like searching, sorting and editing dramatically.
Unicode characters have well-defined semantics that are specified through Character Properties. The properties operations like parsing and sorting as well as other algorithms that need to have semantic knowledge about the characters. Some properties are normative and some are only informative. For normative properties, applications conforming to the Unicode standard must react if they encounter a character having such an property. For informative properties, it is up to the application whether to honor them or not. Below is a small list of the most important properties and their description. It does not include all normative properties, [UNICODE3] Chapter 4 lists all properties and their status as well as full descriptions.
Unicode not only includes accented characters like Ü, Ç, ǻ, it also has a mechanism to dynamically create such composed characters by combining a single base character with a arbitrary number of combining character. Not only can accented characters, that are already included in Unicode be emulated this way, also new characters can be composed. Fig. 15 shows some examples.
a + ̈ ⇒ ä
C + ̧ ⇒ Ç
a + ̊ + ́ ⇒ ǻ
Fig. 15: Dynamic Composition
Now, with Dynamic Composition, there are several ways to encode a single character. For example, the above mentioned ǻ character could be encoded as ǻ, å + ́ or even a + ̊ + ́, making searching and sorting of text very difficult. In order to solve this ambiguity, characters that can be dynamically composed have a decomposition mapping, defining how a character can be dissolved apart into it's basic parts. Using the decomposed, canonic form as in-memory representation, searching and sorting becomes simple again.
All West European languages as well as some African and South Asian languages use the Latin alphabet as common script together with their individual extentions, usually accented characters. The same visual character might be pronounced differently in those languages, but it is still the same character. To reduce the number of characters and redundancy, characters with the same appearance have been been unified and allocated only a single code point.
Unification is obvious in the case of the Latin alphabet, but there are many uncertain unification candidates. For example, the comma character is mainly used in as thousands-separator in English but is used as decimal-separator in French, but there is only one Latin COMMA character. Unicode does not differentiate on usage but on only on appearance. As another example, the unit symbols for seconds, fetes and prime have been unified as the PRIME character (′). But there are also exceptions like the Greek Omega character and the Ohm symbol for electrical resistance. These haven't been unified because of legacy compatibility and their totally different semantic.
The area in most need of unification were the CJK ideographs. Sharing common roots, many of them have similar or even same visual appearance. With unification, the more than 130,000 ideographs present in legacy character sets, have been reduced to less than 30,000.
But sometimes, the characters have evolved differently and look slightly different. Fig. 13 shows two such ideographs, U+6D77 ("ocean") and U+76F4 ("straight") and their appearances in Traditional Chinese, Simplified Chinese, Korean and Japanese. For users of the Latin alphabet, the differences seem subtle, but for the actual users of the languages the difference is very big. The difference is not an glyph problem, but of the actual shape of the character. If a Japanese student would write the Korean variant of an ideograph in an ideograph-exam, he'd fail it. In some cases U+6D77, Japanese readers would probably able to guess the meaning of the Chinese and Korean variants but in other cases like U+76F4, the Chinese variant would be impossible to understand for a Japanese and vice versa. Still they have been aggressively unified, the only exceptions being those cases where a legacy character sets differentiated between variants as separate characters.
Fig. 16: unified ideographs and their possible visual appearances,
[KUBOTA]
As Unicode did not include a mechanism to specify the language of a text, applications had to depend on higher-level protocols to help them decide on the correct rendering of the characters, as the "lang" attribute in XML ([XML], 2.12). Because of this problem with variants, there also can't be a single Unicode font that would cover all characters and languages. Fig. 17 shows the same Unicode character rendered differently according to the lang Attribute of HTML. (browser and local font dependent).
Locale | Language | U+5E73 | U+76F4 |
---|---|---|---|
ja | Japanese | 平 | 直 |
ko | Korean | 平 | 直 |
zh-TW | Chinese (Taiwan) | 平 | 直 |
zh-CN | Chinese (China) | 平 | 直 |
Fig. 17: language-dependent character rendering
Unification has been the source of much grief and chaos since the the early days of Unicode. First revisions of Unicode were practically useless for East Asian countries because of overzealous unification to fit as much as possible into the 16-bits. Newer versions of Unicode provide room for more characters so that variants as well as ideographs that had been previously missing are being continously added to Unicode's repertoire.
Though the number of characters had been greatly reduced through unification work, Unicode designers soon had to realize, that the number of characters included in Unicode were simple not enough. Even the wast number of 65,536 characters was insufficient. But Unicode being an 16-bit fixed-length, how could a characters with a number higher than U+010000 be represented without fundamental changes ?
For this purpose,2048 code points have been reserved as "surrogates" beginning with Unicode 2.0. 1024 are designated "High-surrogate", and another 1024 are "Low-surrogates". These are not characters themselves, but by combining one high and one low surrogates, called a "surrogate pair", they represent a single Unicode character together. A simple algorithm is used to calculate the actual Unicode character number from the surrogate pair and is defined in [UNICODE3], Sec. 3.7.
This method is clearly a violation with Unicode's basic philosophy of simplicity, because the surrogates must be handled specially. But still, the scheme is very well thought out to minimize the negative effect based on experiences with legacy character encodings. Depending on application support, a surrogate pair is either shown as two unknown characters, if the application knows nothing about surrogates or as single character should the application be aware of surrogates. Because a high-surrogate is always followed by a low-surrogate and the encoded character is not dependant on any other values before or after the pair, character boundaries are obvious in a sequence of pairs. At the same time, should a character stream be interrupted, the maximum damage is limited to a single character.
With the introduction of surrogate pairs, the potential numbers of characters that can be included in Unicode increased 17-fold.
The Unicode Standards and the ISO-10646 standard are strongly interlinked. As ISO-10646's formal name "Information technology -- Universal Multiple-Octet Coded Character Set (UCS) " reflects, ISO-10646 has the same basic goals as Unicode, to create an Universal Character Set. An "octet" is ISO's term for an 8-bit byte. The two standards have agreed to share the same character repertoire and character numbering so that both standards are character-by-character equal in their character sets.
Usually Unicode revisions are published more frequently due to the administrative overhead of ISO standards but both organizations have agreed to synchronize as often as possible. The advantage of collaboration for Unicode is, that many national standards don't allow industry standards like Unicode to be referenced but allow ISO standards. For ISO on the other hand, Unicode has the computing industry's support and compatibility with it guarantees industry acceptance and feedback.
In difference to Unicode, ISO-10646 doesn't limit itself to 16-bits. ISO-10646 is a 4-octet (32-bit) character set capable of including more than 2*10^9 characters (the highest bit is not used). It is organized into 128 groups each containing 256 planes that again include 256 rows with 256 cells each. Plane 0x00 of Group 0x00 is called the Basic Multilingual Plane (BMP) and has exactly the same size and code points as Unicode's BMP. Planes 0x01 to 0x10 contain the characters from Unicode's additional 16 planes that are represented using surrogates in Unicode. Though ISO-10646 could contain much much more characters than Unicode, the additional groups and planes are currently reserved for future use and no characters can be defined there to maintain compatibility with Unicode as long as possible.
ISO-10646's canonical representation are UCS-4 and UCS-2. In UCS-4, a character's number is encoded one octet each for group, plane, row and cell number. Should a text only contain characters from the BMP, the group and plane octets can be omitted. This 2-octet representation is called UCS-2. Beware that ISO-10646 doesn't have the surrogate mechanism If an Unicode text is interpreted as UCS-2, all characters above U+FFFF will be lost.
(c) Oliver M. Bolzer <bolzer@informatik.uni-muenchen.de>, , 2002, All rights reserved.