Unicode and UTF

Motivation

Motivation

Quiz

Quiz

Go: len("世界") ?
Java: "世界 ".length() ?

Quiz

Go: len("世界") => 6
Java: "世界 ".length() => 2

Quiz

utf8.RuneCountInString("世界") // => 2

fmt.Printf("% X\n", "世界") 
// => E4 B8 96 E7 95 8C (will come to this later)

Quiz

utf8.RuneCountInString("世界") // => 2

What is Unicode?

Available code points up to

0x10FFFF (21 bits) (1 114 111),
although not every code point has a character assigned to it

What is Unicode?

What is Unicode?

What is Unicode?

What is Unicode?

Terminology

  • Character
  • Character set
  • Coded character set
  • Code points
  • Basic Multilingual Plane (BMP)
  • Supplementary characters
  • Character encoding scheme
  • UTF-{32, 16, 8}

Terminology

A character is just an abstract minimal unit of text. It doesn't have a fixed shape (that would be a glyph), and it doesn't have a value. 


"A" is a character, and so is "€", the symbol for the common currency of Germany, France, and numerous other European countries. 

A character set is a collection of characters.

For example, the Han characters are the characters originally invented by the Chinese, 
which have been used to write Chinese, Japanese, Korean, and Vietnamese. 

Terminology

A coded character set is a character set where each character has been assigned a unique number.

At the core of the Unicode standard is a coded character set that assigns the letter "A" the number 0041 16 and the letter "€" the number 20AC 16. The Unicode standard always uses hexadecimal numbers, and writes them with the prefix "U+", so the number for "A" is written as "U+0041". 

Code points are the numbers that can be used in a coded character set.

A coded character set defines a range of valid code points, but doesn't necessarily assign characters to all those code points. The valid code points for Unicode are U+0000 to U+10FFFF. Unicode 4.0 assigns characters to 96,382 of these more than a million code points. 

Terminology

Supplementary characters are characters with code points in the range U+10000 to U+10FFFF, that is, those characters that could not be represented in the original 16-bit design of Unicode.

The set of characters from U+0000 to U+FFFF is sometimes referred to as the Basic Multilingual Plane (BMP). Thus, each Unicode character is either in the BMP or a supplementary character. 

Terminology

A character encoding scheme is a mapping from the numbers of one or more coded character sets to sequences of one or more fixed-width code units.

The most commonly used code units are bytes, but 16-bit or 32-bit integers can also be used for internal processing. UTF-32, UTF-16, and UTF-8 are character encoding schemes for the coded character set of the Unicode standard. 

What is

UTF-{32, 16, 8}?

What is UTF?

What is UTF-32?

What is UTF-16?

What is UTF-8?

What is UTF-8?

What is UTF-8?

What is UTF?

What is UTF?

Byte Order Mark (BOM)

Hands-on

Hands-on

FEFF = 1111 1110 1111 1111
              zzzz yyyy yyxx xxxx
1110zzzz 10yyyyyy 10xxxxxx

1110(1111) 10(111011) 10(111111) = EF BB BF


4E16 = 0100 1110 0001 0110
1110(0100) 10(111000) 10(010110) = E4 B8 96


754C = 0111 0101 0100 1100
1110(0111) 10(010101) 10(001100) = E7 95 8C

Hands-on

But how did you get the code points at the first place?

Java:
        

        

 

Go:
 

System.out.printf("%X%n", Character.codePointAt("世界", 0)); // => 4E16
System.out.printf("%X%n", Character.codePointAt("世界", 1)); // => 754C
System.out.printf("%X%n", (int) '世'); // => 4E16
System.out.printf("%X%n", (int) '界'); // => 754C
fmt.Printf("%U %U\n", '世', '界') // => U+4E16 U+754C

Encoding in Java

Version 5.0 of the J2SE is required to support version 4.0 of the Unicode standard, so it has to support supplementary characters. 

Properties files, unfortunately, are still limited to ISO 8859-1 as their encoding (unless your application uses the new XML format). 
This means you always have to use escape sequences for supplementary characters, and again probably will want to write in a different encoding and then convert with a tool such as native2ascii.

Encoding in Java

java.lang.Character

Encoding in Go

Source code in Go is defined to be UTF-8 text; no other representation is allowed

(considering Rob Pike is one of UTF-8 creators)

Normalization

java.text.Normalizer

golang.org/x/text/unicode/norm

Code walk through

UTF-8 in the wild

References

Summary

Made with Slides.com