Let Machine Learn to Tokenize Chinese Address
讓機器學習來切
中文地址
lulalala
Ruby/Rails developer at
Github: lulalala
Twitter: lulalala_it
likes anime, manga, games and drawing
So I wrote a parser in pure Ruby using...
1. if / else
2. regular expression
if statements and
regular expressions
are not enough
lesson
Timecop.travel(4.months.ago)
Parserator and usaddress
- Python package
- Can parse English address
- Using Machine Learning
You can
I can
Conditional Random Fields
條件隨機域
-
取得資料 obtain data
-
整理資料 clean data
-
餵食訓練 train
OpenStreetMap
- Regional data download at Gisgraphy
- .pbf file (Protocolbuffer Binary Format)
- Filter out address data
~ 15000 records
gem install pbf_parser
整理資料
OpenStreetMap data can be messy.
People put data in all kinds of places.
It does not differentiate special things like "里" "鄰" "巷" "弄"
有些人把 “巷” 放在 "street"
有些人把 "直轄市" 放在 "suburb"
有些人不小心輸入注音
整理資料
-
Import into ActiveRecord
-
Write scripts to clean data
-
Repeat step 2 many many times
Address.find_each do |a|
if m = a.street.match(/[0-9一二三四五六七八九十]+ ?巷$/)
a.xiang = m.to_s
a.xiang.strip!
a.street.gsub!(/#{a.xiang}$/, '')
a.street.strip!
a.save
end
end
After all the hardwork,
finally...
TRAINING TIME!
wapiti gem
- Written by inukshuk
Very nice person who helped me a lot - Wrapper around wapiti, a C library
"A simple and fast discriminative sequence labelling toolkit"
台北市 南港區 研究院路二段 128號
city suburb street housenumber
["台 city", "北 city", "市 city",
"南 suburb", "港 suburb", "區 suburb", "研 street", "究 street", "院 street", "路 street"...]
wapiti gem
- 15460 address data
- 2/3 used for training
- Training time is around a few seconds
台北市 南港區 研究院路二段 128號
city suburb street housenumber
["台 city", "北 city", "市 city",
"南 suburb", "港 suburb", "區 suburb", "研 street", "究 street", "院 street", "路 street"...]
http://addresstokenizer.lulalala.com/
https://goo.gl/KKdXmL
高雄市路竹區路科五路23號1-3樓
958台東縣池上鄉錦園村鳳梨園13鄰41號
Why is Machine Learning Good?
if result.wrong?
say "Not me! It's its fault!
it is too stupid to learn~~"
shrug
guilt = 0
else
say "Yay! Thanks for compliment (to me)"
happiness += 100
end
你知道巷下面還有衖嗎?
你知道衖下面還有衕嗎?
tokenizer = LulalalaAddressTokenizer.new(
'address.mod'
)
tokenizer.parse("AA縣BB鎮CC路D號")
# {"city"=>"AA縣",
"district"=>"BB鎮",
"street"=>"CC路",
"housenumber"=>"D號"}
lulalala_address_tokenizer gem
3Q
(thanks you)
Chinese Address Tokenization
By lulalala
Chinese Address Tokenization
- 1,615