Unicode pour les nuls

Benoit Averty - @Kaidjin

What Is Unicode?

Unicode is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems.

Characters (or "graphenes")

A x 3

é ç

$ @ %

ඣ

👻

Unicode is about...

Code Points

Characters are encoded with...

U+0041 LATIN CAPITAL LETTER A

Code point

between 0x0 and 0x10FFFF

Name

U+0DA3 SINHALA LETTER MAHAAPRAANA JAYANNA (ඣ)

U+1F47B GHOST (👻)

U+0040 COMMERCIAL AT (@)

The Unicode Character Database

Code points are documented in...

http://www.unicode.org/Public/UCD/latest

128 237 characters

26 data files

109 Properties

Name (UnicodeData.txt)
Block (Blocks.txt)
General_Category (UnicodeData.txt)

A Character Is Not A Character !

There is no bijection between graphene and code point

U+00C0

LATIN CAPITAL LETTER A WITH GRAVE

U+0041

LATIN CAPITAL LETTER A

U+0300

COMBINING GRAVE ACCENT

À

⚠ Normalization needed to compare or search unicode text

What's the length of a unicode string ?

À la claire fontaine

20 characters ("graphenes")
?? number of code points

String foo = "À la claire fontaine";
System.out.println(foo.length());

Java Quizz : which one does 'length()' return ?

It returns the number of Code Units

Serialization Into Bytes

Code Point : U+0041

Bytes : 00000000 00000000 00000000 00101001

Code Point : U+01F47B

Bytes : 00000000 00000001 11110100 01111011

Useless ?

Completely useless

Easiest : UTF-32 (32 bits code units)

Code Point : U+0041

Bytes : 00000000 00101001

Code Point : U+01F47B

Bytes : 11011000 00111101 - 11011100 01111011

Only one useless byte

Better space efficiency : UTF-16 (16 bits code units)

High surrogate code point

U+D83D

Low surrogate code point

U+DC7B

Code Point : U+0041

Bytes : 00101001

Code Point : U+2F840

Bytes : 11110000 - 10101111 - 10100001 - 10000000

(F0 - AF - A1 - 80)

Only what's necessary — Compatible with ASCII

Most common : UTF-8 (8 bits code units)

⚠ UTF-8 is also the most complex

U+0000 .. U+007F	0xxxxxxx	0xxxxxxx
U+0080 .. U+07FF	00000yyy yyxxxxxx	110yyyyy 10xxxxxx
U+0800..U+D7FF, U+E000..U+FFFF	zzzzyyyy yyxxxxxx	1110zzzz 10yyyyyy 10xxxxxx
U+10000..U+10FFFF	000uuuzz zzzzyyyy yyxxxxxx	11110uuu 10zzzzzz 10yyyyyy 10xxxxxx

Range	Codepoint (binary)	Code units (1-4 x 8 bits)

è (U+00E8) => C3 A8 (UTF-8) => Ã¨ (ASCII)

U+26A0 WARNING SIGN

⚠ UTF-8 can use more space than UTF-16 !

UTF-16 : 00100110 10100000

(26A0)

UTF-8 : 11100010 - 10011010 - 10100000

(E2 - 9A - A0)

Unicode pour les nuls

What Is Unicode?

Characters (or "graphenes")

Code Points

The Unicode Character Database

A Character Is Not A Character !

What's the length of a unicode string ?

Serialization Into Bytes

Easiest : UTF-32 (32 bits code units)

Better space efficiency : UTF-16 (16 bits code units)

Most common : UTF-8 (8 bits code units)

⚠ UTF-8 is also the most complex

⚠ UTF-8 can use more space than UTF-16 !

Merci !