Kevin Clark, Minh-Thang Luong, Christopher D. Manning, Quoc V. Le
Focus: sequence modeling tasks attached to a shared Bi-LSTM sentence encoder
Aim: improves the representations of a Bi-LSTM sentence encoder using a mix of labeled and unlabeled data
Semi-supervise learning
[Labeled examples] standard supervised learning.
[Unlabeled examples] train auxiliary prediction modules with different views of the input to agree with the primary prediction module.
view 1:
view 2:
view 4:
view 3:
CVT trains the modules to match the primary prediction module on the unlabeled data by minimizing:
The module is trained on labeled examples by minimizing standard cross entropy loss:
primary prediction
auxiliary prediction
Primary module:
Auxiliary modules
"head"
Words in a sentence are
treated as nodes in a graph.
"relation"
(solved, I, SUB)
Words in a sentence are
treated as nodes in a graph.
The probability of an edge (u,t,r) is given as:
: encoder hidden unit on word
: decoder hidden unit on word
The bilinear classifier uses a weight matrix Wr specific to the candidate relation as well as a weight matrix W shared across all relations.
Words in a sentence are
treated as nodes in a graph.
Auxiliary modules:
Model: encoder-decoder with attention
Attention
context
...
...
encoder
decoder
Two auxiliary decoders share embedding and LSTM parameters but maintain independent parameters for attention and softmax
auxiliary1:
auxiliary 2:
encoder
decoder
...
...
encoder
decoder
No target sequence for unlabeled examples.
Teacher forcing to get an output distribution over the vocabulary from the primary decoder at each time step?
...
...
?
encoder
decoder
produce hard targets for the auxiliary modules by running the primary decoder with beam search on the input sequence.
...
...
[Step 1] select 2 words with max probability, i.e. a, c
Example: beam size = 2, vocab={a,b,c}
[Step 2] all combination {aa,ab,ac,ca,cb,cc}, select 2 sequences with max probability, i.e. aa, ca
[Step 3] repeat, until encounter <END>, output 2 sequences with max probability
Ref: https://www.zhihu.com/question/54356960
Compare Cross-View Training on 7 tasks:
Semi-supervised learning is an interesting method to address label deficiency in deep learning
About elegance: general to many tasks, succinct mathematical expression
What's more?