intro to
Jared Anderson
ICS Stack Team
LDS Church
but I did stay inย attend a unicode conference
๐
what the h2k?
internationalization and localization are means of adapting computer software to different languages, regional differences and technical requirements of a target locale
the design and development of a product, application or document content that enables easy localization for target audiences that vary in culture, region, or language.
the adaptation of a product, application or document content to meet the language, cultural and other requirements of a specific target market (a locale).
i18n
l10n
we write software this way
so "they" can configure it to a target locale
i18n
l10n
allows for
i18n efforts enabling a Japanse i10n
the foundation
hย eย lย lย o
hย eย lย lย o
hย eย lย lย o
hย eย lย lย o
0110 1000 0110 0101 0110 1100 0110 1100 0110 1111
we agree on meaning?
computers agree on meaning?
Unicode provides a unique number for every character,
no matter what the platform,
no matter what the program,
no matter what the language.
Every character in unicode contains data describing its characteristics such as its:
717,993
Fun Fact
Fun Fact
Fun Facts
important to know
ignore & โ ๏ธ
a single logical unit of text
ย
Array
.from("AB12๐๐คกโปฏ` ฬ")
.map(c => `${c.codePointAt(0)} is ${c}`)
/*
[ '65 is A',โโโโโ
โโโโโ '66 is B',โโโโโ
โโโโโ '49 is 1',โโโโโ
โโโโโ '50 is 2',โโโโโ
โโโโโ '128518 is ๐',โโโโโ
โโโโโ '129313 is ๐คก',โโโโโ
โโโโโ '12015 is โปฏ',โโโโโ
โโโโโ '96 is `',โโโโโ
โโโโโ '32 is ',โโโโโ
โโโโโ '771 is ฬ' ]โโโโโ
/*
ideograms, logograms, pictographs ๐ค
"๐จโ๐ฉโ๐งโ๐ฆ".replace("๐ฆ", "")
a. ๐จโ๐ฉโ๐งโ๐ฆ
b. โโโโโ๐จโ๐ฉโ๐งโโโโโโ
c. ๐จ๐ฉ๐ง
d. ""
when two+ sequential graphemes are represented by one glyph.
the smallest unit of a writing system of any given language; user perceived characters
may be a single Unicode code points or multiple, component glyphs positioned appropriately
letters, numbers, punctuation, CJK characters, symbols, etc
a single visual unit of text
from zeroes & ones to other zeroes & ones
"๐ฉu"[0] === "u"
"๐ฉu"[1] === "u"
"๐ฉu"[2] === "u"
a. true false false
b. false true false
c. false false true
d. false false false
"๐ฅ=โค".indexOf("=")
a. 0
b. 1
c. 2
d. 3
1
2
char | code point (decimal) | code point (binary) |
---|---|---|
a | 65 | 00000000 00000000 01000001 |
๐ฅ | 129,363 | 00000001 11111001 01010011 |
[not used] | 1,114,111 | 00010000 11111111 11111111 |
inefficient to transport, put in memory or store all these zeros, so we encode them
a๐ฅ
00000000 00000000 01000001
00000001 11111001 01010011
algorithmic mapping from every Unicode code point to a unique, physical bit sequence, called code unit
utf-8 encoding algorithm
2^7 = 127
2^11 = 2,048
2^16 = 65,536
2^21 = 2,097,152
"๐ฉu"[2] === "u"
so back to this problem....
Javascript uses a 2-byte, fixed-width encoding to store each character. This encoding translates roughly to a UC-2 / UTF-16 Frankenstein monsterย
UTF-16
UC-2/UTF-16 JS Encoding
'๐'.length === 2;
'\uD834'.concat('\uDF06');
explore more?
be aware, be careful, use a modern, Unicode-aware JS APIs, and/or a 3rd-party lib
[..."๐ฉu"][0] === "๐ฉ"
[..."๐ฉu"][1] === "u"
[..."๐ฉu"][2] === undefined
for( const symb of "๐ฉu" ) {
console.log(symb);
}
Strings: sequence of Glyphs
Strings: sequence of Code Units
"nฬ".length === 1;
without running the code, is this true or false?
a. true
b. false
c. literally nobody knows
const string1 = "nฬ";
const string2 = "รฑ";
console.log(
`string1 (${string1}) and string2 (${string2}) double equal?: `,
string1==string2,
);
console.log(
`string1 (${string1}) and string2 (${string2}) triple equal?: `,
string1===string2,
);
a. true, true
b. true, false
c. false, true
d. false, false
characters that are intended to modify other characters
ย
tฬดฬฬอฬฬฬอฬออฬ ฬอ ฬฬฬฒฬซฬซอฬฉฬฬฬฅฬจฬอฬอ ฬhฬถฬฬฬฬฬฬฬฬฬฬฝออฬฬฬฬณฬ ฬอ ฬญeฬธอออฬฬอฬฬฬฬอฬปฬฑอฬฬฏอฬฬฬญฬอ ฬตฬฬฬณฬ ฬฬฬฬฬณอฬฬฑฬฬบฬบฬmฬถฬฬออฬฬฬอฬฬอ ฬฝฬอฬอฬออฬณฬฬ aฬถฬอฬฬฬนฬฉฬฆtฬดอฬฬฬฟอฬฟออฬจฬผอฬซฬณอฬคrฬดฬฝฬอฬอฬกฬฉฬฬณออออฬผอออฬฒiฬธอฬอฬฬอฬฬฉฬปฬ ฬผฬกอฬณฬอฬฒอฬฉxฬทอฬออฬฬอออฬออฬฬซอ
ย
n + ย ฬ = รฑ (copy/paste individually)
"n".concat("\ฬ")
Fun Fact
there are many correct ways to construct a string with the same canonical meaning
ย
For example, there are 3 correct ways to make an Angstrom character (โซ)
"ร
" === "Aฬ" && "ร
" === "โซ"
try it?
const string1 = "nฬ";
const string2 = "รฑ";
console.log(
`string1 (${string1}) and string2 (${string2}) compare?: `,
string1 === string2
);
console.log(
`normalized string1 (${string1}) and normalized string2 (${string2}) compare?: `,
string1.normalize() === string2.normalize()
);
"ร
".normalize() === "Aฬ".normalize() && "ร
".normalize() === "โซ".normalize()
try it?
decomposed and then recomposed by canonical equivalence (same look & meaning). Recommended by W3C & default for String.prototype.normalize
decomposed by canonical equivalence and then arrange combining characters in a specific order.
ย
decomposed by compatibilityย (maybe look different, same meaning sometimes), then recomposed by canonical equivalence. Used by IETF for domain names
ย
decomposed by compatibility, and multiple combining characters are arranged in a specific order
"fluff.com" === "flu๏ฌ.com"
"fluff.com".normalize() === "flu๏ฌ.com".normalize()
"fluff.com".normalize("NFKC") === "flu๏ฌ.com".normalize("NFKC")
try it?
pretty fly for a
<div dir="rtl">ABC123</div>
How will this render in the browser?
a. ABC123
b. 321CBA
c. CBA123
d. 123CBA
the direction that the bidirectional algorithm falls back on when calculating how it should display
<div dir="rtl">12345, ๐๐๐!</div>
How will this render in the browser?
a. 12345, ๐๐๐!
b. !๐๐๐ ,54321โโโโโ
c. !๐๐๐ ,12345
d. just shut up and tell me
languages that are read from right-to-left (RTL) aren't 100% RTL
simply writing (anywhere) a "ืฉ", followed by a "ืจ", followed by a "ื" renders how?
a. ืฉืจื
b. ืฉืจื
always display in their direction unless explicitly overridden
we input the following characters in the following logical order; it renders how?
1.
abc ืฉืจื
2.
abc ืฉืจื
3.
neither
answer: either 1 or 2 depending on the base direction
When text with different directionality is mixed inline, the bidi algorithm produces a separate directional run out of each sequence of contiguous characters with the same directionality (no markup required)
the order in which directional runs are displayed across the page depends on the prevailing base direction.
<div dir="rtl">
<bdi>></bdi>>
</div>
How will this render in the browser?
a. >>
b. <<
c. ><
d. <>
Certain characters have mirror-imaged shapes, depending on the direction of the text where they are found. For example, parenthesis and brackets have mirrored pairs that are used when direction changes.
isolates a span of text that might be formatted in a different direction from other text outside it. No effect on directionally-strong characters
overrides the current directionality of text, so that the text within is rendered in a different direction.
is a set of standards by the International Organization for Standardization that is concerned with representation of names for languages and language groups.
language | ISO 639-1 | ISO 639-3 |
---|---|---|
English | en | eng |
Spanish | sp | spa |
ย an abbreviated language code defined by the Internet Engineering Task Force (IETF) in the BCP-47 document series which is currently composed of normative RFC 5646 ย and RFC 4647, along with the normative content of the IANA Language Subtag Registry ๐
Description | IETF Language Tag |
---|---|
English | en |
Spanish | sp |
Brazilian Portuguese | pt-BRย |
ย Min Nan Chinese as spoken in Taiwan using traditional Han characters | nan-Hant-TW |
Components of language tags are drawn from ISO 639, ISO 15924, ISO 3166-1, and UN M.49.
ย
for language codes, it wants "shortest ISO 639 code"
ย
Provides an authoritative list of language codes at:
https://www.iana.org/assignments/language-subtag-registry/language-subtag-registryย
Used by/in:
window.navigator.language
ย
Intl.NumberFormat
Intl.DateTimeFormat
Intl.Collator
Intl.PluralRules
ย
Accept-Language Content-Language |
<html lang="">
ย
:lang()
ย
MomentJS GlobalizeJS |
The golden rule when creating language tags is to keep the tag as short as possible.
ย
Avoid region, script or other subtags except where they add useful distinguishing information. For instance, use ja for Japanese and not ja-JP, unless there is a particular reason that you need to say that this is Japanese as spoken in Japan, rather than elsewhere.
import langs from "langs";
langs.where("3", "kor");
/*
{
"name":"Korean",
"local":"ํ๊ตญ์ด",
"1":"ko",
"2":"kor",
"2T":"kor",
"2B":"kor",
"3":"kor"
}
*/
it's all about u
/a.b/.test("a๐b")
true of false?
a. true
b. false
c. ?
/a.b/u.test("a๐b")
why? no u
/๐{2}/.test('๐๐');
true of false?
a. true
b. false
c. ?
/๐{2}/u.test('๐๐')
why? no u
/^[^a]$/.test("๐ฉ")
true of false (do I really need u here)?
a. true
b. false
c. ?
/^[^a]$/u.test("๐ฉ")
yes you need u
the /u flag is an ES6 feature, but Babel will convert it
Fun Fact
stage 3 feature
ย
\p{UnicodePropertyName=UnicodePropertyValue}
\p{UnicodePropertyNameAlias=UnicodePropertyValueAlias}
\p{LoneUnicodePropertyNameOrValue}
/^\p{Decimal_Number}+$/u
.test('๐๐๐๐๐๐๐ฉ๐ช๐ซ๐ฌ๐ญ๐ฎ๐ฏ๐บ๐ป๐ผ');
/^\P{Decimal_Number}+$/u
.test('ิปีด ึ
ีคีกีฉีซีผีจ ีฌีซ ีง ึ
ีฑีกีฑีฏีฅึีธีพ');
/^\p{Number}+$/u
.test('ยฒยณยนยผยฝยพ๐๐๐๐๐๐๐ฉ๐ช๐ซ๐ฌ๐ญ๐ฎ๐ฏ๐บ๐ป๐ผใใใโ
โ
กโ
ขโ
ฃโ
คโ
ฅโ
ฆโ
งโ
จโ
ฉโ
ชโ
ซโ
ฌโ
ญโ
ฎโ
ฏโ
ฐโ
ฑโ
ฒโ
ณโ
ดโ
ตโ
ถโ
ทโ
ธโ
นโ
บโ
ปโ
ผโ
ฝโ
พโ
ฟ');
const createName = ({firstName, lastName}) =>
`${firstName} ${lastName}`;
what name comes first?
what's a last name?
what's wrong with this?
const createMessage = ({greeting, name, message}) => `
${greeting} ${name},
${message}
`;
are you sure about that comma?
what's wrong with this?
// Hey translation team, can you translate
// my strings bundle?
const strings = {
home: "Home",
date: "Date"
};
Date: courtship or calendar?
what's wrong with this?
const d = new Date();
d.toLocaleDateString();
d.toLocaleTimeString();
Intl.DateTimeFormat('en-US')
.format(d);
Intl.DateTimeFormat('en-US', {
weekday: 'long',
year: 'numeric',
month: 'long',
day: 'numeric'
})
.format(d);
Jan 2018
var number = 3500;
new Intl.NumberFormat()
.format(number);
Jan 2018
[string1, string2, string3].sort()
should I do this?
a. no
b. no
Sorting strings by their code point only works for ASCII
[string1, string2, string3].sort(
(a, b) => a.localeCompare(b)
);
[string1, string2, string3]
.sort(
(a, b) => new Intl.Collator("es")
.compare(a, b)
);
tl;dr: cldr
provides key building blocks for software to support the world's languages, with the largest and most extensive standard repository of locale data available
an open source project of mature C/C++ and Java libraries for Unicode support, software internationalization, and software globalization
ย
i18n.ldschurch.org
ย i18n@ldschurch.org
ย
https://kev.inburke.com/kevin/node-js-string-encoding/
https://mathiasbynens.be/notes/javascript-encoding
https://www.w3.org/International/questions/qa-what-is-encoding
https://dmitripavlutin.com/what-every-javascript-developer-should-know-about-unicode/