Deep Address Parser Brownbag
Jinkela Huang (v-jinkhu)
Mentor: George Wang
Manager: Alex Chiang
Outline
- Design Architecture
- word2vec
- Lexicon
-
Delexicalization
- Data processing for training
- Delexicalization test
- Reduce Memory
- Report
Design Architecture
Our Model
BOS Rua Canindé, 102, São Paulo, Brasil EOS
O B-road I-road B-house_number B-city I-city B-country pt-br
What We Want To Train
Encoding to floating point matrix
Output probability matrix
Words Input
Tags Output
language code
Training Flowchart
Words Input
Tags Input
encoding
AP model using TF with Keras
Our
Module
Test
encoding
biLSTM
* Model design is based on paper: A bi-model based rnn semantic frame parsing model for intent detection and slot filling (Link)
UML
NLUModel
word2vec
DataIniter
KerasModelHelper
Report
Builder
TrainData
Source
TestData
Source
DelexicalizationTest
DataSource
Association
Inherit
word2vec
Brasil
...
Brasil: 6
Rua: 3
Canindé: 7
...
Word To Unsigned int Map
Word Input
Embedding
Layer
B-country
...
I-road: 4
B-country: 9
I-city: 5
...
Tag To Unsigned int Map
Tag Input
0 | 1 | ... | 8 | 9 | ... | 39 |
---|---|---|---|---|---|---|
0 | 0 | ... | 0 | 1 | ... | 0 |
One Hot encoding
6
9
Brasil
Word To Vector Map
Word Input
0 | 1 | ... | 98 | 99 |
---|---|---|---|---|
0.01 | 0.13 | ... | 0.19 | 0.11 |
vector
a lot of sentence
Input
gensim
word2vec training
word2vec
Lexicon
Barcelona
Word Input
0 | 1 | ... | 98 | 99 |
---|---|---|---|---|
0.21 | 0.03 | ... | 0.19 | 0.11 |
vector
6
Get All Possible Tags
Of Word Input
City: 4
Province: 7
100 | ... | 104 | ... | 107 | 108 | ... | 149 |
---|---|---|---|---|---|---|---|
0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |
Word To Vector Map
Delexicalization
Data processing
for training
Tag2WordDict
- We create mapping Tag2WordDict:
- Key is each word in Train Data.
- Value is the most corresponding tag of this word.
...
Brasil: country
Toledo: city
615: house_number
Guilherme: road
...
Original Data
Training Data
Words | Tags |
---|---|
BOS São Paulo EOS | O B-city I-city pt-br |
Words | Tags |
---|---|
BOS São Paulo EOS | O B-city I-city pt-br |
BOS city Paulo EOS | O B-city I-city pt-br |
BOS São city EOS | O B-city I-city pt-br |
Replace word by Tag2WordDict
Delexicalization test
Words
number | number | number |
---|
Tags
0.85 | 0.72 | 0.76 |
---|
Probability
567 | 2nd | ave |
---|
Round 1
Round 2
567 |
road_w |
ave |
---|
Words
number | road | number |
---|
Tags
0.84 | 0.83 | 0.74 |
---|
Probability
Round 3
567 | road_w | type_w |
---|
Words
number | road | type |
---|
Tags
0.87 | 0.90 | 0.96 |
---|
Probability
Merge
User given expanding rate Tr between 0~1
For example: \(Tr=0.9\)
Words
Probability
merge 567, road_w
because \(0.87<Tr\)
and
\(0.87<0.90\)
Words
road_w | type_w |
---|
567 |
road_w |
type_w |
---|
0.87 | 0.90 | 0.96 |
---|
Round 4
Words
Tags
0.89 |
0.94 |
---|
Probability
road_w | type_w |
---|
road |
type |
---|
road |
road |
type |
---|
Tags
0.89 |
0.89 |
0.94 |
---|
Probability
\(\times 2\)
\(\times 2\)
get result tags
road | road | type |
---|
Tags
0.89 | 0.89 | 0.94 |
---|
Probability
number | road | type |
---|
0.87 | 0.90 | 0.96 |
---|
number | road | number |
---|
0.84 | 0.83 | 0.74 |
---|
number | number | number |
---|
0.85 | 0.72 | 0.76 |
---|
avg
0.776
0.807
0.910
0.907
Reduce Memory
Use Generator
def generator(features, labels, batch_size):
batch_words = np.zeros((batch_size, ...))
batch_tags = np.zeros((batch_size, ...))
while True:
for i in range(batch_size):
# choose random index in Data (but how?)
index = random.choice(DataSize,1)
batch_words[i] = some_processing_1(words[index])
batch_tags[i] = some_processing_2(tags[index])
yield batch_words, batch_tags
linecache
import linecache
linecache.getline(linecache.__file__, 8)
Get any line from a file, while attempting to optimize internally, using a cache.
Report
train size = 19356
test size =25808
normal
word2vec
94.950208%
95.810693%
delexicalization
95.539979%
delexicalization
test
90.277966%
92.260949%
delexicalization
92.260949%
delexicalization
test
94.811467%
lexcion
94.669825%
delexicalization
94.628734%
delexicalization
test
train size = 1000
test size =25808
normal
word2vec
47.582906%
delexicalization
79.918302%
delexicalization
test
70.644397%
83.244223%
delexicalization
83.244223%
delexicalization
test
73.698637%
lexcion
85.942183%
delexicalization
85.851784%
delexicalization
test
80.092816%
END
BingGC brownbag
By jacky860226
BingGC brownbag
- 62