Benoit Averty - @Kaidjin
Unicode is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems.
A x 3
é ç
$ @ %
ඣ
👻
Unicode is about...
Characters are encoded with...
U+0041 LATIN CAPITAL LETTER A
Code point
between 0x0 and 0x10FFFF
Name
U+0DA3 SINHALA LETTER MAHAAPRAANA JAYANNA (ඣ)
U+1F47B GHOST (👻)
U+0040 COMMERCIAL AT (@)
Code points are documented in...
128 237 characters
26 data files
109 Properties
There is no bijection between graphene and code point
À
U+00C0
LATIN CAPITAL LETTER A WITH GRAVE
U+0041
LATIN CAPITAL LETTER A
U+0300
COMBINING GRAVE ACCENT
À
⚠ Normalization needed to compare or search unicode text
À la claire fontaine
String foo = "À la claire fontaine";
System.out.println(foo.length());
Java Quizz : which one does 'length()' return ?
It returns the number of Code Units
Code Point : U+0041
Bytes : 00000000 00000000 00000000 00101001
Code Point : U+01F47B
Bytes : 00000000 00000001 11110100 01111011
Useless ?
Completely useless
Code Point : U+0041
Bytes : 00000000 00101001
Code Point : U+01F47B
Bytes : 11011000 00111101 - 11011100 01111011
Only one useless byte
High surrogate code point
U+D83D
Low surrogate code point
U+DC7B
Code Point : U+0041
Bytes : 00101001
Code Point : U+2F840
Bytes : 11110000 - 10101111 - 10100001 - 10000000
(F0 - AF - A1 - 80)
Only what's necessary — Compatible with ASCII
U+0000 .. U+007F | 0xxxxxxx | 0xxxxxxx |
U+0080 .. U+07FF | 00000yyy yyxxxxxx | 110yyyyy 10xxxxxx |
U+0800..U+D7FF, U+E000..U+FFFF | zzzzyyyy yyxxxxxx | 1110zzzz 10yyyyyy 10xxxxxx |
U+10000..U+10FFFF | 000uuuzz zzzzyyyy yyxxxxxx | 11110uuu 10zzzzzz 10yyyyyy 10xxxxxx |
Range | Codepoint (binary) | Code units (1-4 x 8 bits) |
---|
è (U+00E8) => C3 A8 (UTF-8) => è (ASCII)
U+26A0 WARNING SIGN
UTF-16 : 00100110 10100000
(26A0)
UTF-8 : 11100010 - 10011010 - 10100000
(E2 - 9A - A0)