讓機器學習來切
中文地址
Github: lulalala
Twitter: lulalala_it
likes anime, manga, games and drawing
So I wrote a parser in pure Ruby using...
1. if / else
2. regular expression
if statements and
regular expressions
are not enough
lesson
Timecop.travel(4.months.ago)
gem install pbf_parser
OpenStreetMap data can be messy.
People put data in all kinds of places.
It does not differentiate special things like "里" "鄰" "巷" "弄"
有些人把 “巷” 放在 "street"
有些人把 "直轄市" 放在 "suburb"
有些人不小心輸入注音
Import into ActiveRecord
Write scripts to clean data
Repeat step 2 many many times
Address.find_each do |a|
if m = a.street.match(/[0-9一二三四五六七八九十]+ ?巷$/)
a.xiang = m.to_s
a.xiang.strip!
a.street.gsub!(/#{a.xiang}$/, '')
a.street.strip!
a.save
end
end
台北市 南港區 研究院路二段 128號
city suburb street housenumber
["台 city", "北 city", "市 city",
"南 suburb", "港 suburb", "區 suburb", "研 street", "究 street", "院 street", "路 street"...]
台北市 南港區 研究院路二段 128號
city suburb street housenumber
["台 city", "北 city", "市 city",
"南 suburb", "港 suburb", "區 suburb", "研 street", "究 street", "院 street", "路 street"...]
http://addresstokenizer.lulalala.com/
https://goo.gl/KKdXmL
高雄市路竹區路科五路23號1-3樓
958台東縣池上鄉錦園村鳳梨園13鄰41號
if result.wrong?
say "Not me! It's its fault!
it is too stupid to learn~~"
shrug
guilt = 0
else
say "Yay! Thanks for compliment (to me)"
happiness += 100
end
tokenizer = LulalalaAddressTokenizer.new(
'address.mod'
)
tokenizer.parse("AA縣BB鎮CC路D號")
# {"city"=>"AA縣",
"district"=>"BB鎮",
"street"=>"CC路",
"housenumber"=>"D號"}
lulalala_address_tokenizer gem