Overview

Character Set and Character Encoding

Character Repertoire
a finite set of characters
Coded Character Set
a finite set of characters, where each character is assigned an unique and specific number
Character Encoding
scheme to encode a series of characters into a binary data stream

Legacy Character Set: ASCII (1)

Legacy Character Set: ASCII (2)

ASCII visual characters
ASCII characters

Legacy Character Set: ISO-646 - the international ASCII

Legacy Character Set: ISO-8859 for Europe (1)

Legacy Character Set: ISO-8859 for Europe (2)

Legacy Character Set: ISO-8859 for Europe (3)


ISO-8859-15

Legacy Character Set: ISO-8859 for Europe (4)

Legacy Character Sets : Chinese, Japanese, Korean (1)

Legacy Character Sets: Chinese, Japanese, Korean (2)

JIS X 0208 partial table
JIS X 0208 characters (partial)

Legacy Character Sets: Chinese, Japanese, Korean (3)

Legacy Character Encoding: ISO-2022 (1)

Legacy Character Encoding: ISO-2022 (2)

byte sequence 0x1B0x280x42 0x1B0x240x290x42 0x0F 0x410x42 0x0E 0x410x42
ESC(B ESC$)B SI AB S0 AB
meaning G0 = US-ASCII G1 = JIS X 0208 assign G0 to GL G0 0x41G0 0x42 assign G1 to GL G1 0x41G1 0x42
text  AB  

Legacy Character Encoding: ISO-2022 (3)

Chaotic World

Unicode: The Universal Character Set

Universal
Cover most characters world-wide
Efficient
Must be simple to implement, efficient processing
Uniform
Sorting, searching and display without special exception rules
Unambiguous
Any given code value always represents same character

Unicode: History

1986-88basic design of Unicode at Xerox, Apple
Dec. 1987first use of term "Unicode"
1988,89Kanji-Unification work
1990major players like Sun, IBM join
Jan 1991founding of Unicode, Inc.
Oct 1991Unicode 1.0
Jun 1993Unicode 1.1 - merge with ISO/IEC 10646
Jul 1996Unicode 2.0 - extention of character space
May 1998Unicode 2.1.2 - add Euro symbol
Sep 1999Unicode 3.0 - clarify encodings and properties
Mar 2001Unicode 3.1 - add many more characters
Mar 2002Unicode 3.2 - conformance tight up

Unicode: Number of Characters

# of Unicode characters

Unicode: characters, not glyphs

LATIN CAPITAL LETTER A
LATIN CAPITAL LETTER A

Unicode: 16-bit fixed-length

Unicode: Universal Character Set (1)

major blocks of Unicode

Unicode: Universal Character Set (2)

Unicode: Universal Character Set (3)