O panie, a kto to panu tak zlokalizował?

Szymon Teżewski

O mnie

Szymon Teżewski 👨
@jasisz1 🐦
Applause 👔
jasisz.githubio.com 🖋

Pytania? Uwagi?

Proszę poczekać z pytaniami do końca prezentacji.

*odpowiem tylko na te, na które będę umiał

Legenda:

🐍 - arkana Pythona

🎸 - arkana Django

Pojęcia

Internacjonalizacja

i18n

czyli wszystko to, co robimy aby program dało się przystosowywać do różnych grup użytkowników

Lokalizacja

l10n

czyli przystosowanie zinternacjonalizowanego wcześniej programu do konkretnej grupy

np. arabizacja, rusyfikacja

Locale

Co bym chciał?

po polsku 🇵🇱
jeżeli nie ma po polsku to po angielsku 🇬🇧
no albo po rosyjsku od biedy 🇷🇺
ceny chciałbym w złotówkach
jeżeli nie to w euro lub dolarach
temperaturę w stopniach Celsjusza
układ SI?
no i jak jestem za granicą, to niech mi pokazuje czas...

POSIX locale

locale aspects

LANGUAGE
LC_ALL
LC_XX (LC_COLLATE, LC_MONETARY, etc.)
LANG

For example, assume you are a Swedish user in Spain, and you want your programs to handle numbers and dates according to Spanish conventions, and only the messages should be in Swedish.

Then you could create a locale named ‘sv_ES’ or ‘sv_ES.UTF-8’ by use of the localedef program.

But it is simpler, and achieves the same effect, to set the LANG variable to es_ES.UTF-8 and the LC_MESSAGES variable to sv_SE.UTF-8; these two locales come already preinstalled with the operating system.

https://www.gnu.org/software/gettext/manual/html_node/Locale-Environment-Variables.html#Locale-Environment-Variables

ISO 15897

en_US
en_US.UTF-8
pl_PL.ISO8859-2
wa_BE.iso885915@euro
ca_ES.utf8@valencia
tt_RU.utf8@iqtelif

IETF language tag / BCP 47

en
en-US (> en-us, chociaż RFC mówi że wszystko jedno)
es-419 (w Ameryce Południowej i na Karaibach)
i-klingon (deprecated, bo tlh)
i-enochian (okultystyczny język aniołów)
art-lojban (depreceted bo jbo)
sr-Cyrl
zh-yue-HK (chiński ale kantoński (sic!), Hong-Kong, yue-HK)
sr-Latn-RS (serbski łacinką, Serbia)
sl-rozaj-biske (rezjański dialekt słoweńskiego z San Giorgio)
de-CH-1901 (wariant ortografii z 1901 roku)
hy-Latn-IT-arevela (wschodnioarmeński łacinką, Włochy)

pl-Cyrl-151-kociewie

ja-Latn-030-hepburn-heploc

sl-Cyrl-155-rozaj-biske-1994

sl-Cyrl-155-nedis-rozaj-biske-lipaw-njiva-osojs-solba-bohoric-dajnko-metelko-1994*

*narusza tylko should, żadnego must, niesprawdzalna poprawność

CLDR

Unicode Common Locale Data Repository

ogromny zbiór bardzo wielu informacji dotyczących lokalizacji

używa lekko zmodyfikowanego BCP 47 i definiuje rozszerzenia -u i -t

🐍 tego używa Babel

sl-155-rozaj-biske-1994-u-ca-islamic-civil-co-gb2312han-cu-pln-em-default-fw-sun-hc-h11-ka-noignore-kb-false-kc-false-kf-lower-kh-false-kk-false-kn-false-ks-identic-kv-currency-lb-strict-lw-breakall-nu-mathbold-ms-ussystem-ss-standard-tz-gldkshvn-t-googlevk...

Accept-Language

część Content Negotiation w HTTP (RFC 7231)

wartości i dopasowania opisane w RFC 4647:

pl-PL,pl;q=0.8,en-US;q=0.6,en;q=0.4

Browser fingerprinting

używa BCP 47

Jak pożenić standardy? 🎸


In [1]: from django.utils.translation.trans_real import to_locale, to_language

In [2]: to_locale('sr-Latn-RS')  # sr_RS.UTF-8@latin ?
Out[2]: u'sr_Latn-rs'

In [3]: to_language('ca_ES.utf8@valencia')  # ca-ES-valencia ?
Out[3]: u'ca-es.utf8@valencia'

*w dalszej części dlaczego w ogóle to żenimy

The HTTP Accept-Language header was originally only intended to specify the user's language. However, since many applications need to know the locale of the user, common practice has used Accept-Language to determine this information. It is not a good idea to use the HTTP Accept-Language header alone to determine the locale of the user.
(...)
A language preference of es-MX doesn't necessarily mean that a postal address form should be formatted or validated for Mexican addresses. The user might still live in the USA (or elsewhere).

https://www.w3.org/International/questions/qa-accept-lang-locales

To skąd to brać?

Zazwyczaj robią tak:

URL > wybór użytkownika > Accept-Language

+ gdzieś GeoIP/lokalizacja do walut i czasu

Tłumaczenia

gettext


# django/conf/locale/zh_Hans/LC_MESSAGES/django.po
msgid "Select a valid choice. That choice is not one of the available choices."
msgstr "选择一个有效的选项： 该选择不在可用的选项中。"

# przykład z dokumentacji gettext
msgid "One file removed"
msgid_plural "%d files removed"
msgstr[0] "%d slika je uklonjena"
msgstr[1] "%d datoteke uklonjenih"
msgstr[2] "%d slika uklonjenih"

The letters PO in .po files means Portable Object, to distinguish it from .mo files, where MO stands for Machine Object.

This paradigm, as well as the PO file format, is inspired by the NLS standard developed by Uniforum, and first implemented by Sun in their Solaris system.

https://www.gnu.org/software/gettext/manual/html_node/Files.html

ICU MessageFormat


{gender_of_host, select,
  female {
    {num_guests, plural, offset:1
      =0 {{host} does not give a party.}
      =1 {{host} invites {guest} to her party.}
      =2 {{host} invites {guest} and one other person to her party.}
      other {{host} invites {guest} and # other people to her party.}}}
  male {
    {num_guests, plural, offset:1
      =0 {{host} does not give a party.}
      =1 {{host} invites {guest} to his party.}
      =2 {{host} invites {guest} and one other person to his party.}
      other {{host} invites {guest} and # other people to his party.}}}
  other {
    {num_guests, plural, offset:1
      =0 {{host} does not give a party.}
      =1 {{host} invites {guest} to their party.}
      =2 {{host} invites {guest} and one other person to their party.}
      other {{host} invites {guest} and # other people to their party.}}}}

International Components for Unicode

nie tylko do tłumaczeń, coś jak Babel + gettext + więcej

Mozilla L20N


<brandShortName {
  *nominative: "Aurora",
  genitive: "Aurore",
  dative: "Aurori",
  accusative: "Auroro",
  locative: "Aurori",
  instrumental: "Auroro"
}>
<aboutOld "O brskalniku {{ brandShortName }}">
<about "O {{ brandShortName.locative }}">

🐍 💌 gettext

de facto w Pythonie jesteśmy na niego skazani

mamy wbudowany moduł gettext

W Django też 🎸

https://code.djangoproject.com/ticket/14974


🐧
xgettext --keyword=_ --output=messages.pot `find html/ -name "*.html"`
msginit --input=messages.pot --locale=zh_TW -o locale/zh_TW/LC_MESSAGES/messages.po
msgfmt locale/zh_TW/LC_MESSAGES/messages.po -o locale/zh_TW/LC_MESSAGES/messages.mo


🐍
pybabel extract -F babel.cfg -o messages.pot .
pybabel init -i messages.pot -d locale -l zh_TW
pybabel compile -i locale/zh_TW/LC_MESSAGES/messages.po -d locale -l zh_TW


🎸
django-admin makemessages -l zh_TW
django-admin compilemessages

Co się dzieje pomiędzy .po a .mo?

Co ze zmianami?

A co z bazą?

Tłumaczenia są proste! 🎸


from django.utils.translation import string_concat
from django.utils.translation import ugettext_lazy

name = ugettext_lazy('John Lennon')
instrument = ugettext_lazy('guitar')
result = string_concat(name, ': ', instrument)

John Lennon: gitara

John Lennon : guitare

约翰·列侬：吉他

https://docs.djangoproject.com/en/1.9/topics/i18n/translation/#joining-strings-string-concat

Interpunkcja jest fajna! 🎸


#django/conf/locale/fr/LC_MESSAGES/django.po
#. Translators: This is the default suffix added to form field labels
msgid ":"
msgstr " :"


# django/conf/locale/zh_Hans/LC_MESSAGES/django.po
msgid "Select a valid choice. That choice is not one of the available choices."
msgstr "选择一个有效的选项： 该选择不在可用的选项中。"

Wielokrotności, 1

Now, how do these functions solve the problem of the plural forms? Without the input of linguists (which was not available) it was not possible to determine whether there are only a few different forms in which plural forms are formed or whether the number can increase with every new supported language.

https://www.gnu.org/software/gettext/manual/gettext.html#Plural-forms

Wielokrotności, 2


# angielski
nplurals=2;
plural=(n != 1);

# polski
nplurals=3;
plural=(n==1 ? 0 : n%10>=2 && n%10<=4 && (n%100<10 || n%100>=20) ? 1 : 2);

# rosyjski
nplurals=3;
plural=(n%10==1 && n%100!=11 ? 0 : n%10>=2 && n%10<=4 && (n%100<10 || n%100>=20) ? 1 : 2);

# arabski
nplurals=6;
plural=(n==0 ? 0 : n==1 ? 1 : n==2 ? 2 : n%100>=3 && n%100<=10 ? 3 : n%100>=11 ? 4 : 5);

https://www.gnu.org/software/gettext/manual/gettext.html#Plural-forms

Czyli działa! 🎸

Django does not support custom plural equations in po files. As all translation catalogs are merged, only the plural form for the main Django po file (in django/conf/locale/<lang_code>/LC_MESSAGES/django.po) is considered. Plural forms in all other po files are ignored. Therefore, you should not use different plural equations in your project or application po files.

https://code.djangoproject.com/ticket/23520


nplurals=4;
plural=(n%10==1 && n%100!=11 ? 0 : n%10>=2 && n%10<=4 && (n%100<12 || n%100>14) ? 1 \
       : n%10==0 || (n%10>=5 && n%10<=9) || (n%100>=11 && n%100<=14)? 2 : 3);

Collation

nie to samo co kolacja

Sortowanie 🎸

android

Borowinka

kalafiory w sosie

Złoża uranu

Älbercik

łódź w ogniu


Entry.objects.order_by(Lower('headline'))

https://docs.djangoproject.com/en/1.9/ref/models/querysets/#django.db.models.query.QuerySet.order_by

Nie ma takiego sortowania

Nie można "dobrze" posortować nie znając języka użytkownika.

Nie można "dobrze" posortować nie znając użycia wewnątrz języka.

*ale zawsze można próbować obrazić jak najmniejszą ich grupę

Unicode collation algorithm

Default Unicode Collation Element Table

Zdefiniowane w Unicode Technical Report #10

W CLDR są dodatkowe zmiany per język i wariant.

UCA i DUCET

ICU collation

Najpopularniejsza implementacja UCA + danych z CLDR.

Problem solved! (sic!)

http://demo.icu-project.org/icu-bin/locexp?_=pl_PL&d_=en&x=col

https://wiki.postgresql.org/wiki/Todo:ICU

https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-icu-collation.html

Pismo

Liczby

https://www.google.pl/webhp?authuser=1#authuser=1&q=7*3%2C14

Coś jeszcze?

123,456,789.00, 123.456.789,00, 12,34,56,789.00
-99, 99-, (99), 99
98%, 98 %, 98 pct, %98
﬩ - alternatywny hebrajski plus
nie mówiąc o walutach!
nie mówiąc o liczbach!

零一二三四五六七八九十

Czas i data

Formaty czasu?

Phi, CLDR i jazda

Który mamy rok?

To zależy gdzie i kogo pytasz

9 Marca
AD 2016

19 Esfand
1394 SH

29 Jumada al-awwal
1437 AH

Strefy czasowe

Strefy czasowe doprowadzają mnie do szaleństwa.

Zwłaszcza strefa UTC+05:45 w Nepalu.

I UTC+12:45 na Wyspach Chatham.

Accept-Timezone!

A nie...

Accept-Datetime to rozszerzenie do historycznych stron

Ludzie robią GeoIP, obrzydliwy javascript albo

pytają użytkownika.

tz database

# Poland

# The 1919 dates and times can be found in Tygodnik Urzędowy nr 1 (1919-03-20),
# <http://www.wbc.poznan.pl/publication/32156> pp 1-2.

# Rule	NAME	FROM	TO	TYPE	IN	ON	AT	SAVE	LETTER/S
Rule	Poland	1918	1919	-	Sep	16	2:00s	0	-
Rule	Poland	1919	only	-	Apr	15	2:00s	1:00	S
Rule	Poland	1944	only	-	Apr	 3	2:00s	1:00	S
# Whitman gives 1944 Nov 30; go with Shanks & Pottenger.
Rule	Poland	1944	only	-	Oct	 4	2:00	0	-
# For 1944-1948 Whitman gives the previous day; go with Shanks & Pottenger.
Rule	Poland	1945	only	-	Apr	29	0:00	1:00	S
Rule	Poland	1945	only	-	Nov	 1	0:00	0	-
# For 1946 on the source is Kazimierz Borkowski,
# Toruń Center for Astronomy, Dept. of Radio Astronomy, Nicolaus Copernicus U.,
# http://www.astro.uni.torun.pl/~kb/Artykuly/U-PA/Czas2.htm#tth_tAb1
# Thanks to Przemysław Augustyniak (2005-05-28) for this reference.
# He also gives these further references:
# Mon Pol nr 13, poz 162 (1995) <http://www.abc.com.pl/serwis/mp/1995/0162.htm>
# Druk nr 2180 (2003) <http://www.senat.gov.pl/k5/dok/sejm/053/2180.pdf>
Rule	Poland	1946	only	-	Apr	14	0:00s	1:00	S
Rule	Poland	1946	only	-	Oct	 7	2:00s	0	-
Rule	Poland	1947	only	-	May	 4	2:00s	1:00	S
Rule	Poland	1947	1949	-	Oct	Sun>=1	2:00s	0	-
Rule	Poland	1948	only	-	Apr	18	2:00s	1:00	S
Rule	Poland	1949	only	-	Apr	10	2:00s	1:00	S
Rule	Poland	1957	only	-	Jun	 2	1:00s	1:00	S
Rule	Poland	1957	1958	-	Sep	lastSun	1:00s	0	-
Rule	Poland	1958	only	-	Mar	30	1:00s	1:00	S
Rule	Poland	1959	only	-	May	31	1:00s	1:00	S
Rule	Poland	1959	1961	-	Oct	Sun>=1	1:00s	0	-
Rule	Poland	1960	only	-	Apr	 3	1:00s	1:00	S
Rule	Poland	1961	1964	-	May	lastSun	1:00s	1:00	S
Rule	Poland	1962	1964	-	Sep	lastSun	1:00s	0	-
# Zone	NAME		GMTOFF	RULES	FORMAT	[UNTIL]
Zone	Europe/Warsaw	1:24:00 -	LMT	1880
			1:24:00	-	WMT	1915 Aug  5 # Warsaw Mean Time
			1:00	C-Eur	CE%sT	1918 Sep 16  3:00
			2:00	Poland	EE%sT	1922 Jun
			1:00	Poland	CE%sT	1940 Jun 23  2:00
			1:00	C-Eur	CE%sT	1944 Oct
			1:00	Poland	CE%sT	1977
			1:00	W-Eur	CE%sT	1988
			1:00	EU	CE%sT

I moja ukochana Indiana!

Pytz 🐍


from datetime import datetime, timedelta
import pytz

warsaw = pytz.timezone('Europe/Warsaw')

# zgodnie z dokumentacją to nie działa
datetime(2002, 10, 27, 12, 0, 0, tzinfo=warsaw).isoformat()
# 2002-10-27T12:00:00+01:24 LOL


utc_dt = datetime(2002, 10, 27, 12, 0, 0, tzinfo=pytz.utc)
loc_dt = utc_dt.astimezone(warsaw)
loc_dt.isoformat()
# 2002-10-27T13:00:00+01:00 great success!!!

Kultura i zwyczaje

Social media

🇨🇳 QQ/QZone - 700+ mln

🇨🇳 Sina Weibo - 400+ mln

🇷🇺 VK - 350+ mln

🇷🇺 Odnoklassniki - 200+ mln

🇨🇳 Tencent Weibo, WeiXin, Douban, Renren - 100+ mln

🇦🇷 Taringa! - 27 mln

🇱🇻 Draugiem - 2,4 mln

🇮🇷 Facenama - ?

Różnice kulturowe

Imiona, japońskie grupy krwi, polskie adresy na wsiach i wiele innych.

drobnostki jak dźwięki H i B