Let Machine Learn to Tokenize Chinese Address

讓機器學習來切

中文地址

lulalala

Ruby/Rails developer at

Github: lulalala
Twitter: lulalala_it

likes anime, manga, games and drawing

So I wrote a parser in pure Ruby using...

1. if / else

2. regular expression

if statements and
regular expressions

are not enough

lesson

Timecop.travel(4.months.ago)

Parserator and usaddress

  • Python package
  • Can parse English address
  • Using Machine Learning

You can
I can

Conditional Random Fields

條件隨機域

  1. 取得資料 obtain data

  2. 整理資料 clean data

  3. 餵食訓練 train

OpenStreetMap

  • Regional data download at Gisgraphy
  • .pbf file (Protocolbuffer Binary Format)
     
  • Filter out address data
    ~ 15000 records
gem install pbf_parser

整理資料

OpenStreetMap data can be messy.

People put data in all kinds of places.

It does not differentiate special things like "里" "鄰" "巷" "弄"

有些人把 “巷” 放在 "street"

有些人把 "直轄市" 放在 "suburb"

有些人不小心輸入注音

整理資料

  1. Import into ActiveRecord

  2. Write scripts to clean data

  3. Repeat step 2 many many times

Address.find_each do |a|
  if m = a.street.match(/[0-9一二三四五六七八九十]+ ?巷$/)
    a.xiang = m.to_s
    a.xiang.strip!

    a.street.gsub!(/#{a.xiang}$/, '')
    a.street.strip!

    a.save
  end
end

After all the hardwork,

finally...

TRAINING TIME!

wapiti gem

  • Written by inukshuk
    Very nice person who helped me a lot
  • Wrapper around wapiti, a C library
    "A simple and fast discriminative sequence labelling toolkit"

台北市 南港區 研究院路二段  128號

       city           suburb            street                 housenumber

["台 city", "北 city", "市 city",

"南 suburb", "港 suburb", "區 suburb", "研 street", "究 street", "院 street", "路 street"...]

wapiti gem

  • 15460 address data
  • 2/3 used for training
  • Training time is around a few seconds

台北市 南港區 研究院路二段  128號

       city           suburb            street                 housenumber

["台 city", "北 city", "市 city",

"南 suburb", "港 suburb", "區 suburb", "研 street", "究 street", "院 street", "路 street"...]

http://addresstokenizer.lulalala.com/
https://goo.gl/KKdXmL

高雄市路竹區路科五路23號1-3樓

958台東縣池上鄉錦園村鳳梨園13鄰41號

Why is Machine Learning Good?

if result.wrong?
  say "Not me! It's its fault!
       it is too stupid to learn~~"
  shrug
  
  guilt = 0
else
  say "Yay! Thanks for compliment (to me)"
  
  happiness += 100
end

你知道巷下面還有衖嗎?

你知道下面還有衕嗎?

tokenizer = LulalalaAddressTokenizer.new(
  'address.mod'
)
tokenizer.parse("AA縣BB鎮CC路D號")
# {"city"=>"AA縣", 
   "district"=>"BB鎮", 
   "street"=>"CC路", 
   "housenumber"=>"D號"}

lulalala_address_tokenizer gem

3Q

(thanks you)

Chinese Address Tokenization

By lulalala

Chinese Address Tokenization

  • 1,615