Unicode pour les nuls

Benoit Averty - @Kaidjin

What Is Unicode?

Unicode is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems.

Characters (or "graphenes")

A x 3

é ç

$ @ %

ඣ

👻

Unicode is about...

Code Points

Characters are encoded with...

U+0041 LATIN CAPITAL LETTER A

Code point

between 0x0 and 0x10FFFF

Name

U+0DA3 SINHALA LETTER MAHAAPRAANA JAYANNA (ඣ)

U+1F47B GHOST (👻)

U+0040 COMMERCIAL AT (@)

The Unicode Character Database

Code points are documented in...

http://www.unicode.org/Public/UCD/latest

128 237 characters

26 data files

109 Properties

Name (UnicodeData.txt)
Block (Blocks.txt)
General_Category (UnicodeData.txt)

A Character Is Not A Character !

There is no bijection between graphene and code point

U+00C0

LATIN CAPITAL LETTER A WITH GRAVE

U+0041

LATIN CAPITAL LETTER A

U+0300

COMBINING GRAVE ACCENT

À

⚠ Normalization needed to compare or search unicode text

What's the length of a unicode string ?

À la claire fontaine

20 characters ("graphenes")
?? number of code points

String foo = "À la claire fontaine";
System.out.println(foo.length());

Java Quizz : which one does 'length()' return ?

It returns the number of Code Units

Serialization Into Bytes

Code Point : U+0041

Bytes : 00000000 00000000 00000000 00101001

Code Point : U+01F47B

Bytes : 00000000 00000001 11110100 01111011

Useless ?

Completely useless

Easiest : UTF-32 (32 bits code units)

Code Point : U+0041

Bytes : 00000000 00101001

Code Point : U+01F47B

Bytes : 11011000 00111101 - 11011100 01111011

Only one useless byte

Better space efficiency : UTF-16 (16 bits code units)

High surrogate code point

U+D83D

Low surrogate code point

U+DC7B

Code Point : U+0041

Bytes : 00101001

Code Point : U+2F840

Bytes : 11110000 - 10101111 - 10100001 - 10000000

(F0 - AF - A1 - 80)

Only what's necessary — Compatible with ASCII

Most common : UTF-8 (8 bits code units)

⚠ UTF-8 is also the most complex

U+0000 .. U+007F	0xxxxxxx	0xxxxxxx
U+0080 .. U+07FF	00000yyy yyxxxxxx	110yyyyy 10xxxxxx
U+0800..U+D7FF, U+E000..U+FFFF	zzzzyyyy yyxxxxxx	1110zzzz 10yyyyyy 10xxxxxx
U+10000..U+10FFFF	000uuuzz zzzzyyyy yyxxxxxx	11110uuu 10zzzzzz 10yyyyyy 10xxxxxx

Range	Codepoint (binary)	Code units (1-4 x 8 bits)

è (U+00E8) => C3 A8 (UTF-8) => Ã¨ (ASCII)

U+26A0 WARNING SIGN

⚠ UTF-8 can use more space than UTF-16 !

UTF-16 : 00100110 10100000

(26A0)

UTF-8 : 11100010 - 10011010 - 10100000

(E2 - 9A - A0)

Merci !

Unicode pour les nuls

By Benoit Averty

Unicode pour les nuls

1,605

Benoit Averty

Kaidjin

Unicode pour les nuls

What Is Unicode?

Characters (or "graphenes")

Code Points

The Unicode Character Database

A Character Is Not A Character !

What's the length of a unicode string ?

Serialization Into Bytes

Easiest : UTF-32 (32 bits code units)

Better space efficiency : UTF-16 (16 bits code units)

Most common : UTF-8 (8 bits code units)

⚠ UTF-8 is also the most complex

⚠ UTF-8 can use more space than UTF-16 !

Merci !

Unicode pour les nuls

More from Benoit Averty