Tokenize
Japanese
Address
Using
Machine
Learning
lulalala
Ruby/Rails developer
Github: lulalala
Twitter: lulalala_it
I also draw doujinshi because I am otaku.
Goal
From To
{
"prefecture"=>"東京都",
"gun"=>"西多摩郡",
"municipality"=>"秋多町"
}
"東京都西多摩郡秋多町"
How to do it?
どうやってするの?
I think it is difficult to use
if statements
and
regular expressions
新潟県 十日町市
奈良県 大和郡山市
Parserator library /
usaddress service
- Python
- Can parse English address
- Using Machine Learning
Conditional Random Fields
条件付き確率場
I don't understand but I still tried it on Taiwanese address
私は理解していないが、私はまだ台湾の住所でそれを試した
-
Obtain data データを取得する
-
Clean data データを整理する
-
Train model トレーニング
OpenStreetMap
- Regional data download at Gisgraphy
- .pbf file (Protocolbuffer Binary Format)
gem install pbf_parser
Obtain data データを取得する
wapiti gem
- Written by inukshuk
Very nice person who helped me a lot - Wrapper around wapiti, a C library
"A simple and fast discriminative sequence labelling toolkit"
Train model トレーニング
台北市 南港區 研究院路二段 128號
city suburb street housenumber
["台 city", "北 city", "市 city",
"南 suburb", "港 suburb", "區 suburb",
"研 street", "究 street", "院 street", "路 street"...]
Input
Input --> CRF training -->
model file
Address text --> model file -->
parsed result
- 15460 address data
- 2/3 used for training
- Training time is around a few seconds
日本の場合
But it does not have many addresses
多くの住所が含まれていません
City field's data is not consistent.
データが一貫していません。
I decided to generate fake address instead.
代わりに偽のアドレスを生成することにしました。
30000 addresses are generated like this:
青森県南津軽郡藤崎町7丁目43-50
25000 used for training
5000 used for verification
Not always successful
Failure rate: 1.02%
兵庫県南あわじ市5丁目69-25
{"prefecture"=>"兵庫県", "gun"=>"南", "municipality"=>"あわじ市", "other"=>"5丁目69-25"}
鹿児島県南九州市1丁目65-89
{"prefecture"=>"鹿児島県", "gun"=>"南", "municipality"=>"九州市", "other"=>"1丁目65-89"}
岐阜県中津川市4丁目32-100
{"prefecture"=>"岐阜県", "gun"=>"中", "municipality"=>"津川市", "other"=>"4丁目32-100"}
I need to properly learn CRF to improve this.
私はこれを改善するためにCRFを正しく学ぶ必要があります。
Project URL:
https://github.com/lulalala/japan_address
ご聴取ありがとう
ございました
If you are anime otaku, please chat with me :D
ご聴取ありがとう
ございました
If you are anime otaku, please chat with me :D
Japanese Address Tokenization
By lulalala
Japanese Address Tokenization
- 1,501