i18n for Web Developers

intro to

Me

Jared Anderson

ICS Stack Team

LDS Church

not an i18n expert

but I did stay inย attend a unicode conference

๐Ÿ™„

i18n & l10n

what the h2k?

i18n & l10n

internationalization and localization are means of adapting computer software to different languages, regional differences and technical requirements of a target locale

the design and development of a product, application or document content that enables easy localization for target audiences that vary in culture, region, or language.

INTERNATIONALIZATION (I18N)

LOCALIZATION (L10N)

the adaptation of a product, application or document content to meet the language, cultural and other requirements of a specific target market (a locale).

i18n

  • adaptable
  • flexible
  • enabling

l10n

  • a specific adaption

we write software this way

To Allow for Localization

so "they" can configure it to a target locale

i18n

  • allowing Unicode
  • separating content from source code
  • no hardcoded strings
  • careful string concatenation
  • customizable currency preferences
  • customizable date/time preferences

For EXAMPLE

l10n

  • displaying/storing Kanji characters
  • translating English content to Japanese (sans developer)
  • injecting in Japanese strings bundle
  • converting USD to yen currency (or just starting from yen)

allows for

i18n efforts enabling a Japanse i10n

L10N Customizations & Considerations

  • language / translation
  • text / writing systems
  • text & UI direction
  • number formats
  • numeral systems
  • date and time formats
  • calendar systems
  • currency systems
  • keyboard usage
  • collation and sorting
  • symbols, icons and colors
  • diverse cultural interpretation of graphics and action-text
  • varying legal requirements
  • rethinking of logic
  • visual design
  • and more...

Text, Writing Systems & Unicode

the foundation

Written Communication

hย  eย  lย  lย  o

hย  eย  lย  lย  o

hย  eย  lย  lย  o

hย  eย  lย  lย  o

0110 1000
0110 0101
0110 1100
0110 1100
0110 1111

we agree on meaning?

computers agree on meaning?

Unicode provides a unique number for every character,
no matter what the platform,
no matter what the program,
no matter what the language.

The Unicode Standard

  • A standard for text
  • provides a unique number (called a code point) for every character
  • replaces ASCII, UC-2, 8-bit, double-byte and 240 other โ€œstandardsโ€
  • covers all known business languages today (& then some)

The Unicode Standard

Every character in unicode contains data describing its characteristics such as its:

  • Code Point (that's the number)
  • Name
  • General Category
    • Letter, Mark, Number, Symbol, Separator, etc
  • โ€‹Punctuation
  • Bidi properties

The Unicode Standard

717,993

  • ASCII: room for 128 code points
  • Unicode: room for 1,114,112 code points ๐Ÿ˜ฑ
    • 2^16 * 17

Fun Fact

The Unicode Standard

  • Unicode 10.0 (June 2017) added 8,518 characters, for a total of 136,690 characters.
  • These additions include 4 new scripts, for a total of 139 scripts, as well as 56 new emoji characters.

The Unicode Standard

Fun Fact

The Unicode Standard

Fun Facts

  • 1,114,112ย (2^16 * 17) code points
  • divided into 17 planes
    • each plane contains 65,536 (2^16) code points
    • the first plane, plane 0, is called the Basic Multilingual Plane (BMP)
      • โ€‹contains all the most common characters
      • smaller code points
    • the other 16 planes (1-16) are called Astral or Supplementary
      • contains emoji, symbols, less commonly used chars
      • higher code-points = more "space" required

important to know

ignore & โ˜ ๏ธ

Character

a single logical unit of text

ย 

  • โ€‹includes things like: A B 1 2 ๐Ÿ˜† ๐Ÿคก โปฏ `ย  ฬƒ
Array
  .from("AB12๐Ÿ˜†๐Ÿคกโปฏ` ฬƒ")
  .map(c => `${c.codePointAt(0)} is ${c}`)

/*
[ '65 is A',โ€‹โ€‹โ€‹โ€‹โ€‹
โ€‹โ€‹โ€‹โ€‹โ€‹  '66 is B',โ€‹โ€‹โ€‹โ€‹โ€‹
โ€‹โ€‹โ€‹โ€‹โ€‹  '49 is 1',โ€‹โ€‹โ€‹โ€‹โ€‹
โ€‹โ€‹โ€‹โ€‹โ€‹  '50 is 2',โ€‹โ€‹โ€‹โ€‹โ€‹
โ€‹โ€‹โ€‹โ€‹โ€‹  '128518 is ๐Ÿ˜†',โ€‹โ€‹โ€‹โ€‹โ€‹
โ€‹โ€‹โ€‹โ€‹โ€‹  '129313 is ๐Ÿคก',โ€‹โ€‹โ€‹โ€‹โ€‹
โ€‹โ€‹โ€‹โ€‹โ€‹  '12015 is โปฏ',โ€‹โ€‹โ€‹โ€‹โ€‹
โ€‹โ€‹โ€‹โ€‹โ€‹  '96 is `',โ€‹โ€‹โ€‹โ€‹โ€‹
โ€‹โ€‹โ€‹โ€‹โ€‹  '32 is  ',โ€‹โ€‹โ€‹โ€‹โ€‹
โ€‹โ€‹โ€‹โ€‹โ€‹  '771 is ฬƒ' ]โ€‹โ€‹โ€‹โ€‹โ€‹
/*

Character, Grapheme, Glyph, Ligature

ideograms, logograms, pictographs ๐Ÿค•

"๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ".replace("๐Ÿ‘ฆ", "")

a. ๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ

b. โ€‹โ€‹โ€‹โ€‹โ€‹๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€โ€‹โ€‹โ€‹โ€‹โ€‹

c. ๐Ÿ‘จ๐Ÿ‘ฉ๐Ÿ‘ง

d. ""

Pop Quiz

when two+ sequential graphemes are represented by one glyph.

Grapheme

the smallest unit of a writing system of any given language; user perceived characters

  • may be a single Unicode code points or multiple, component glyphs positioned appropriately

  • letters, numbers, punctuation, CJK characters, symbols, etc

a single visual unit of text

  • one character may have multiple Glyphs.
  • sometimes, 2+ characters side by side are represented by a single glyph.

LIGATURE

GLYPH

PlayGround

Character Encoding

from zeroes & ones to other zeroes & ones

"๐Ÿ’ฉu"[0] === "u"
"๐Ÿ’ฉu"[1] === "u"
"๐Ÿ’ฉu"[2] === "u"

a. true false false

b. false true false

c. false false true

d. false false false

Pop Quiz

"๐Ÿฅ“=โค".indexOf("=")

a. 0

b. 1

c. 2

d. 3

1

2

char code point (decimal) code point (binary)
a 65 00000000 00000000 01000001
๐Ÿฅ“ 129,363 00000001 11111001 01010011
[not used] 1,114,111 00010000 11111111 11111111

inefficient to transport, put in memory or store all these zeros, so we encode them

a๐Ÿฅ“

00000000 00000000 01000001
00000001 11111001 01010011

Encoding

Unicode Encoding

  • UTF-8: variable width. 1-4 bytes. ASCII is valid UTF-8. Most popular for web documents.
  • UTF-16: variable width. 2 or 4 bytes with "surrogate" pairs when 4 bytes.
  • UTF-32: same width. 32-bit code units. Inefficient storage / less common.

algorithmic mapping from every Unicode code point to a unique, physical bit sequence, called code unit

utf-8 encoding algorithm

2^7 = 127
2^11 = 2,048
2^16 = 65,536
2^21 = 2,097,152
"๐Ÿ’ฉu"[2] === "u"

so back to this problem....

Javascript uses a 2-byte, fixed-width encoding to store each character. This encoding translates roughly to a UC-2 / UTF-16 Frankenstein monsterย 

UTF-16

  • variable width, 2 or 4 bytes
    • 2 bytes: all the most common characters (most of the BMP plane)
    • 4 bytes: less common characters (the Astral plane)
  • encoded using special markings to denote surrogate pairs
    • informs if char is a 2-byte or 2x2-byte character

UC-2/UTF-16 JS Encoding

  • sort of like UTF-16, but defers actual decoding
  • surrogate pairs are only recombined into a single Unicode character when theyโ€™re displayed by the browser (during layout).
  • happens outside of the JS engine
'๐Œ†'.length === 2;
'\uD834'.concat('\uDF06');

explore more?

JS Frankencoding

Solution

be aware, be careful, use a modern, Unicode-aware JS APIs, and/or a 3rd-party lib

[..."๐Ÿ’ฉu"][0] === "๐Ÿ’ฉ"
[..."๐Ÿ’ฉu"][1] === "u"
[..."๐Ÿ’ฉu"][2] === undefined
for( const symb of "๐Ÿ’ฉu" ) {
  console.log(symb);
}

Good Advice

Strings: sequence of Glyphs

Strings: sequence of Code Units

String Normalization

"nฬƒ".length === 1;

without running the code, is this true or false?

Pop Quiz

a. true

b. false

c. literally nobody knows

const string1 = "nฬƒ";
const string2 = "รฑ";

console.log(
  `string1 (${string1}) and string2 (${string2}) double equal?: `,
  string1==string2,
);

console.log(
  `string1 (${string1}) and string2 (${string2}) triple equal?: `,
  string1===string2,
);

a. true, true

b. true, false

c. false, true

d. false, false

Pop Quiz

Combining Character

characters that are intended to modify other characters

ย 

  • In the Latin script, combining diacritical marksย (including combining accents) is most common
  • Many CJK symbols are constructed using combining characters

tฬดฬ“ฬ‡อŒฬŠฬ“ฬ‰อ†ฬอƒอ‚ฬ…ฬšอ ฬฬ•ฬฒฬซฬซอ™ฬฉฬ™ฬ–ฬฅฬจฬ˜อœฬ™อ…ฬœhฬถฬšฬ€ฬฬ‚ฬƒฬฬ‹ฬ“ฬฬฝอ‹อ‘ฬ‡ฬ‘ฬ€ฬณฬ ฬอ…ฬญeฬธออ‹อ‹ฬ‡ฬอฬŠฬฬ‘ฬ„อ”ฬปฬฑอœฬฬฏอœฬœฬŸฬญฬอ“ ฬตฬฬฬณฬ ฬžฬžฬ™ฬœฬณอ•ฬžฬฑฬ–ฬบฬบฬœmฬถฬ€ฬ”อŒอฬ‡ฬ‹ฬ‡อ›ฬƒฬ‹อ ฬฝฬˆอƒฬ†อ‰ฬอ–อšฬณฬ™ฬ aฬถฬอ›ฬฬ€ฬนฬฉฬฆtฬดอฬ‰ฬฬฟอ‹ฬฟอƒอ‹ฬจฬผอ“ฬซฬณอ”ฬคrฬดฬฝฬอ—ฬ˜อšฬกฬฉฬœฬณอŽออŽอŽฬผอ•อŽอ‡ฬฒiฬธอฬ•อ‚ฬ‚ฬ”อŒฬฬฉฬปฬ ฬผฬกอ‰ฬณฬ˜อ”ฬฒอ”ฬฉxฬทอ†ฬˆอ›อƒฬ‹ฬŒอ—อ›อ’ฬšอ”อœฬžฬซอ‡

ย 

Try It

n + ย ฬƒ = รฑ (copy/paste individually)

"n".concat("\ฬƒ")

Fun Fact

String Comparison is NOT Simple

there are many correct ways to construct a string with the same canonical meaning

ย 

For example, there are 3 correct ways to make an Angstrom character (โ„ซ)

  • U+00C5 (ร…)
  • U+0041 (A) +U+030A (\ฬŠ) = AฬŠย 
  • U+212B (โ„ซ)
"ร…" === "AฬŠ" && "ร…" === "โ„ซ"

try it?

String.prototype.normalize

const string1 = "nฬƒ";
const string2 = "รฑ";

console.log(
  `string1 (${string1}) and string2 (${string2}) compare?: `,
  string1 === string2
);

console.log(
  `normalized string1 (${string1}) and normalized string2 (${string2}) compare?: `,
  string1.normalize() === string2.normalize()
);
"ร…".normalize() === "AฬŠ".normalize() && "ร…".normalize() === "โ„ซ".normalize()

try it?

4 Ways to Normalize (TMI?)

Composed (NFC)

decomposed and then recomposed by canonical equivalence (same look & meaning). Recommended by W3C & default for String.prototype.normalize

ย 

Decomposed (NFD)

decomposed by canonical equivalence and then arrange combining characters in a specific order.

ย 

Compatible Composed (NFKC)

decomposed by compatibilityย (maybe look different, same meaning sometimes), then recomposed by canonical equivalence. Used by IETF for domain names

ย 

Compatible Decomposed (NFKD)

decomposed by compatibility, and multiple combining characters are arranged in a specific order

"fluff.com" === "flu๏ฌ€.com"

"fluff.com".normalize() === "flu๏ฌ€.com".normalize()

"fluff.com".normalize("NFKC") === "flu๏ฌ€.com".normalize("NFKC")

try it?

pretty fly for a

BIDI

<div dir="rtl">ABC123</div>

How will this render in the browser?

Pop Quiz

a. ABC123

b. 321CBA

c. CBA123

d. 123CBA

Answer

Base Direction

the direction that the bidirectional algorithm falls back on when calculating how it should display

  • is set with a `dir` attribute
  • defaults to LTR when not explicitly set.
<div dir="rtl">12345, ๐Ÿ™ˆ๐Ÿ™‰๐Ÿ™Š!</div>

How will this render in the browser?

Pop Quiz

a. 12345, ๐Ÿ™ˆ๐Ÿ™‰๐Ÿ™Š!

b. !๐Ÿ™Š๐Ÿ™‰๐Ÿ™ˆ ,54321โ€‹โ€‹โ€‹โ€‹โ€‹

c. !๐Ÿ™Š๐Ÿ™‰๐Ÿ™ˆ ,12345

d. just shut up and tell me

Answer ๐Ÿคฏ

Bidirectional Text

languages that are read from right-to-left (RTL) aren't 100% RTL

  • quoting "latin characters" is LTR
  • numbers are rendered LTR

simply writing (anywhere) a "ืฉ", followed by a "ืจ", followed by a "ื”" renders how?

Pop Quiz

a. ืฉืจื”

b. ืฉืจื”

BUILT IN Unicode BIDI Properties

  • strongly typed: characters with strongly typed LTR or RTL direction will display its strongly typed direction regardless of base direction
    • Latin Characters (ABCD) are strongly typed LTR
    • Arabic & Hebrew (ืฉืจื”) are strongly typed RTL
    • always display in their direction unless explicitly overridden

  • neutrally typed: directionality indeterminable without context.
    • most white-space characters and some punctuation.ย 
      • when between 2 strongly typed chars of same directionality, assumes that directionality
      • when between 2 strongly typed characters of different directionality, assumes prevailing base direction
  • weakly typed:ย characters with vague directionality
    • european digits (1234) & arabic-indic digits (ูกูขูฃูค)*
    • arithmetic and currency symbols
    • punctuation common to many scripts (:,)

we input the following characters in the following logical order; it renders how?

  • "a"
  • "b"
  • "c"
  • " " (space)
  • "ืฉ"
  • "ืจ"
  • "ื”"

Pop Quiz

1.

abc ืฉืจื”

2.

abc ืฉืจื”

3.

neither

answer: either 1 or 2 depending on the base direction

Directional Run

When text with different directionality is mixed inline, the bidi algorithm produces a separate directional run out of each sequence of contiguous characters with the same directionality (no markup required)

the order in which directional runs are displayed across the page depends on the prevailing base direction.

<div dir="rtl">
  <bdi>&gt;</bdi>&gt;
</div>

How will this render in the browser?

Pop Quiz

a. >>

b. <<

c. ><

d. <>

Mirrored Characters

Certain characters have mirror-imaged shapes, depending on the direction of the text where they are found. For example, parenthesis and brackets have mirrored pairs that are used when direction changes.

Explore Mirrored Characters

bidirectional isolation (<bdi>)

isolates a span of text that might be formatted in a different direction from other text outside it. No effect on directionally-strong characters

bidirectional Overrides (<bdO>)

overrides the current directionality of text, so that the text within is rendered in a different direction.

  • affects directionally-strong characters
    • render order, does not cause them to mirror
  • affects mirrored characters

Explore Isolates and Overrides

Language & Locale Codes

ISO 639

is a set of standards by the International Organization for Standardization that is concerned with representation of names for languages and language groups.

language ISO 639-1 ISO 639-3
English en eng
Spanish sp spa

IETF Language Tag

ย an abbreviated language code defined by the Internet Engineering Task Force (IETF) in the BCP-47 document series which is currently composed of normative RFC 5646 ย and RFC 4647, along with the normative content of the IANA Language Subtag Registry ๐Ÿ˜…

Description IETF Language Tag
English en
Spanish sp
Brazilian Portuguese pt-BRย 
ย Min Nan Chinese as spoken in Taiwan using traditional Han characters nan-Hant-TW

IETF Language Tag

Components of language tags are drawn from ISO 639, ISO 15924, ISO 3166-1, and UN M.49.

ย 

for language codes, it wants "shortest ISO 639 code"

  • uses 2-digit country code (ISO 639-1) when available
  • falls back to 3-digit (ISO 639-3) code when not.

ย 

Provides an authoritative list of language codes at:

https://www.iana.org/assignments/language-subtag-registry/language-subtag-registryย 

IETF Language Tag

Used by/in:

  • HTTP
  • W3C
    • HTML
    • XML
    • PNG
  • ANSI
  • ECMA
  • Unicode

in the Window

window.navigator.language

ย 

In JavaScript

Intl.NumberFormat

Intl.DateTimeFormat

Intl.Collator

Intl.PluralRules

ย 

HTTP

Accept-Language

Content-Language

IETF LANGUAGE TAG: As Experienced by Web Devs

In HTML

<html lang="">

ย 

In CSS

:lang()

ย 

IN JS Libraries

MomentJS

GlobalizeJS

IETF LANGUAGE TAG

The golden rule when creating language tags is to keep the tag as short as possible.

ย 

Avoid region, script or other subtags except where they add useful distinguishing information. For instance, use ja for Japanese and not ja-JP, unless there is a particular reason that you need to say that this is Japanese as spoken in Japan, rather than elsewhere.

Converting Formats

import langs from "langs";

langs.where("3", "kor");

/*
{
  "name":"Korean",
  "local":"ํ•œ๊ตญ์–ด",
  "1":"ko",
  "2":"kor",
  "2T":"kor",
  "2B":"kor",
  "3":"kor"
}
*/

Regular Expressions

it's all about u

/a.b/.test("a๐Œ†b")

true of false?

Pop Quiz

a. true

b. false

c. ?

/a.b/u.test("a๐Œ†b")

false

why? no u

/๐Œ†{2}/.test('๐Œ†๐Œ†');

true of false?

Pop Quiz

a. true

b. false

c. ?

/๐Œ†{2}/u.test('๐Œ†๐Œ†')

false

why? no u

/^[^a]$/.test("๐Ÿ’ฉ")

true of false (do I really need u here)?

Pop Quiz

a. true

b. false

c. ?

/^[^a]$/u.test("๐Ÿ’ฉ")

false

yes you need u

the /u flag is an ES6 feature, but Babel will convert it

Fun Fact

Babel and U

More tO Come

UNicode Property Escapes

stage 3 feature

ย 

\p{UnicodePropertyName=UnicodePropertyValue}
\p{UnicodePropertyNameAlias=UnicodePropertyValueAlias}
\p{LoneUnicodePropertyNameOrValue}
/^\p{Decimal_Number}+$/u
    .test('๐Ÿ๐Ÿ๐Ÿ‘๐Ÿœ๐Ÿ๐Ÿž๐Ÿฉ๐Ÿช๐Ÿซ๐Ÿฌ๐Ÿญ๐Ÿฎ๐Ÿฏ๐Ÿบ๐Ÿป๐Ÿผ');


/^\P{Decimal_Number}+$/u
    .test('ิปีด ึ…ีคีกีฉีซีผีจ ีฌีซ ีง ึ…ีฑีกีฑีฏีฅึ€ีธีพ');


/^\p{Number}+$/u
    .test('ยฒยณยนยผยฝยพ๐Ÿ๐Ÿ๐Ÿ‘๐Ÿœ๐Ÿ๐Ÿž๐Ÿฉ๐Ÿช๐Ÿซ๐Ÿฌ๐Ÿญ๐Ÿฎ๐Ÿฏ๐Ÿบ๐Ÿป๐Ÿผใ‰›ใ‰œใ‰โ… โ…กโ…ขโ…ฃโ…คโ…ฅโ…ฆโ…งโ…จโ…ฉโ…ชโ…ซโ…ฌโ…ญโ…ฎโ…ฏโ…ฐโ…ฑโ…ฒโ…ณโ…ดโ…ตโ…ถโ…ทโ…ธโ…นโ…บโ…ปโ…ผโ…ฝโ…พโ…ฟ');

Until then

libraries

ย 

http://xregexp.com/

Language Translation

const createName = ({firstName, lastName}) =>
    `${firstName} ${lastName}`;

what name comes first?

what's a last name?

what's wrong with this?

const createMessage = ({greeting, name, message}) => `
${greeting} ${name},

${message}
`;

are you sure about that comma?

what's wrong with this?

// Hey translation team, can you translate
// my strings bundle?

const strings = {
    home: "Home",
    date: "Date"
};

Date: courtship or calendar?

what's wrong with this?

Translation

  • careful with string concatenation
  • no hardcoded strings
  • translators need context

Date Formatting

const d = new Date();

d.toLocaleDateString();

d.toLocaleTimeString();

Intl.DateTimeFormat('en-US')
  .format(d);

Intl.DateTimeFormat('en-US', {
    weekday: 'long',
    year: 'numeric',
    month: 'long',
    day: 'numeric' 
  })
  .format(d);

Built In

Built In SUpport

Jan 2018

Playground

Date Libraries

  • MomentJS
  • date-fns
  • GlobalizeJS

Number Formatting

var number = 3500;

new Intl.NumberFormat()
  .format(number);

Built In

Playground

Built In SUpport

Jan 2018

Number Libraries

  • GlobalizeJS

Collation & Sorting

[string1, string2, string3].sort()

should I do this?

Pop Quiz

a. no

b. no

Why?

Sorting strings by their code point only works for ASCII

[string1, string2, string3].sort(
    (a, b) => a.localeCompare(b)
);

Solution

[string1, string2, string3]
  .sort(
    (a, b) => new Intl.Collator("es")
      .compare(a, b)
  );

Solution

Common Locale Data REPO

tl;dr: cldr

Unicode CLDR

provides key building blocks for software to support the world's languages, with the largest and most extensive standard repository of locale data available

Building Blocks

  • Locale-specific patterns for formatting and parsing: dates, times, timezones, numbers and currency values
  • Translations of names: languages, scripts, countries and regions, currencies, eras, months, weekdays, day periods, timezones, cities, and time units
  • Language & script information: characters used; plural cases; gender of lists; capitalization; rules for sorting & searching; writing direction; transliteration rules; rules for spelling out numbers; rules for segmenting text into graphemes, words, and sentences
  • Country information: language usage, currency information, calendar preference and week conventions, and telephone codes
  • Other: ISO & BCP 47 code support (cross mappings, etc.), keyboard layouts

ICU

International Components for Unicode (ICU)

an open source project of mature C/C++ and Java libraries for Unicode support, software internationalization, and software globalization

ย 

  • parts are built into core NodeJS (may require a special Node build) and browser JS (Intl)

International Components for Unicode (ICU)

  • Code Page Conversion
  • Collation
  • Formatting
    • numbers
    • dates
    • times
    • currency
  • Time Calculations
  • Unicode Support
  • Regular Expressions
  • BiDi
  • Text Boundaries

Internal Help

i18n.ldschurch.org

ย i18n@ldschurch.org

i18n for web developers

By Jared Anderson

i18n for web developers

  • 1,759