Unicode pour les nuls

Benoit Averty - @Kaidjin

What Is Unicode?

Unicode is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems.

Characters (or "graphenes")

A x 3

é ç

$ @ %

👻

Unicode is about...

Code Points

Characters are encoded with...

U+0041 LATIN CAPITAL LETTER A

Code point

between 0x0 and 0x10FFFF

Name

U+0DA3 SINHALA LETTER MAHAAPRAANA JAYANNA ()

U+1F47B GHOST (👻)

U+0040 COMMERCIAL AT (@)

The Unicode Character Database

Code points are documented in...

128 237 characters

26 data files

109 Properties

A Character Is Not A Character !

There is no bijection between graphene and code point

À

U+00C0

LATIN CAPITAL LETTER A WITH GRAVE

U+0041

LATIN CAPITAL LETTER A

U+0300

COMBINING GRAVE ACCENT

Normalization needed to compare or search unicode text

What's the length of a unicode string ?

À la claire fontaine

  • 20 characters ("graphenes")
  • ?? number of code points
String foo = "À la claire fontaine";
System.out.println(foo.length());

Java Quizz : which one does 'length()' return ?

It returns the number of Code Units

Serialization Into Bytes

Code Point : U+0041

Bytes : 00000000 00000000 00000000 00101001

Code Point : U+01F47B

Bytes : 00000000 00000001 11110100 01111011

Useless ?

Completely useless

Easiest : UTF-32 (32 bits code units)

Code Point : U+0041

Bytes : 00000000 00101001

Code Point : U+01F47B

Bytes : 11011000 00111101 - 11011100 01111011

Only one useless byte

Better space efficiency : UTF-16 (16 bits code units)

High surrogate code point

U+D83D

Low surrogate code point

U+DC7B

Code Point : U+0041

Bytes : 00101001

Code Point : U+2F840

Bytes : 11110000 - 10101111 - 10100001 - 10000000

(F0 - AF - A1 - 80)

Only what's necessary — Compatible with ASCII

Most common : UTF-8 (8 bits code units)

UTF-8 is also the most complex

U+0000 .. U+007F 0xxxxxxx 0xxxxxxx
U+0080 .. U+07FF 00000yyy yyxxxxxx 110yyyyy 10xxxxxx
U+0800..U+D7FF, U+E000..U+FFFF zzzzyyyy yyxxxxxx 1110zzzz 10yyyyyy 10xxxxxx
U+10000..U+10FFFF 000uuuzz zzzzyyyy yyxxxxxx 11110uuu 10zzzzzz 10yyyyyy 10xxxxxx
Range Codepoint (binary) Code units (1-4 x 8 bits)

è (U+00E8) => C3 A8 (UTF-8) => è (ASCII)

U+26A0 WARNING SIGN

UTF-8 can use more space than UTF-16 !

UTF-16 : 00100110 10100000

(26A0)

UTF-8 : 11100010 - 10011010 - 10100000

(E2 - 9A - A0)

Merci !

Unicode pour les nuls

By Benoit Averty

Unicode pour les nuls

  • 1,623