Unicode pour les nuls
Benoit Averty - @Kaidjin
What Is Unicode?
Unicode is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems.
Characters (or "graphenes")
A x 3
é ç
$ @ %
ඣ
👻
Unicode is about...
Code Points
Characters are encoded with...
U+0041 LATIN CAPITAL LETTER A
Code point
between 0x0 and 0x10FFFF
Name
U+0DA3 SINHALA LETTER MAHAAPRAANA JAYANNA (ඣ)
U+1F47B GHOST (👻)
U+0040 COMMERCIAL AT (@)
The Unicode Character Database
Code points are documented in...
128 237 characters
26 data files
109 Properties
- Name (UnicodeData.txt)
- Block (Blocks.txt)
- General_Category (UnicodeData.txt)
A Character Is Not A Character !
There is no bijection between graphene and code point
À
U+00C0
LATIN CAPITAL LETTER A WITH GRAVE
U+0041
LATIN CAPITAL LETTER A
U+0300
COMBINING GRAVE ACCENT
À
⚠ Normalization needed to compare or search unicode text
What's the length of a unicode string ?
À la claire fontaine
- 20 characters ("graphenes")
- ?? number of code points
String foo = "À la claire fontaine";
System.out.println(foo.length());
Java Quizz : which one does 'length()' return ?
It returns the number of Code Units
Serialization Into Bytes
Code Point : U+0041
Bytes : 00000000 00000000 00000000 00101001
Code Point : U+01F47B
Bytes : 00000000 00000001 11110100 01111011
Useless ?
Completely useless
Easiest : UTF-32 (32 bits code units)
Code Point : U+0041
Bytes : 00000000 00101001
Code Point : U+01F47B
Bytes : 11011000 00111101 - 11011100 01111011
Only one useless byte
Better space efficiency : UTF-16 (16 bits code units)
High surrogate code point
U+D83D
Low surrogate code point
U+DC7B
Code Point : U+0041
Bytes : 00101001
Code Point : U+2F840
Bytes : 11110000 - 10101111 - 10100001 - 10000000
(F0 - AF - A1 - 80)
Only what's necessary — Compatible with ASCII
Most common : UTF-8 (8 bits code units)
⚠ UTF-8 is also the most complex
U+0000 .. U+007F | 0xxxxxxx | 0xxxxxxx |
U+0080 .. U+07FF | 00000yyy yyxxxxxx | 110yyyyy 10xxxxxx |
U+0800..U+D7FF, U+E000..U+FFFF | zzzzyyyy yyxxxxxx | 1110zzzz 10yyyyyy 10xxxxxx |
U+10000..U+10FFFF | 000uuuzz zzzzyyyy yyxxxxxx | 11110uuu 10zzzzzz 10yyyyyy 10xxxxxx |
Range | Codepoint (binary) | Code units (1-4 x 8 bits) |
---|
è (U+00E8) => C3 A8 (UTF-8) => è (ASCII)
U+26A0 WARNING SIGN
⚠ UTF-8 can use more space than UTF-16 !
UTF-16 : 00100110 10100000
(26A0)
UTF-8 : 11100010 - 10011010 - 10100000
(E2 - 9A - A0)
Merci !
Unicode pour les nuls
By Benoit Averty
Unicode pour les nuls
- 1,623