I love the pizza
| w0 |
|---|
| w1 |
|---|
| w2 |
|---|
| w3 |
|---|
| 1 |
| 2 |
| 3 |
| ... |
| D |
word embedding
in
A total of L words in a text
| 1 |
| ... |
| Kn |
A total of (L-3) windows
| 1 |
| ... |
| Kn |
| 1 |
| ... |
| Kn |
A total of (L-4) windows
| 1 |
| ... |
| Kn |
convolution
map table
| 1 |
| ... |
| Kn |
| 1 |
| ... |
| Kn |
| 1 |
| ... |
| Kn |
| 1 |
| ... |
| Kn |
| 1 |
| ... |
| Kn |
| 1 |
| ... |
| Kn |
pooling
| 1 |
| ... |
| Kn |
3-grams
| 1 |
| ... |
| Kn |
4-grams
| 1 |
| ... |
| Kn |
5-grams
pooling
pooling
| 1 |
| ... |
| Kn |
| 1 |
| ... |
| Kn |
| 1 |
| ... |
| Kn |
I love the pizza
text length: L word embedding dim: D chunk number: U kernel size: Ks kernel number: K_n
| ks=3 |
| 0 | 1 | 0 | 0 |
| 1 | 0 | 0 | 0 |
| 1 | 0 | 0 |
| feed-forward layer |
|---|
NIL
Positive
Negative
Neutral
| ks=4 |
| ks=5 |
Traditional Method
| w0 |
| w1 |
| w2 |
| ... |
| wL |
I love the pizza
text length: L word embedding dim: D chunk number: U kernel size: Ks kernel number: K_n
| 0 | 1 | 0 | 0 |
| 1 | 0 | 0 | 0 |
| 1 | 0 | 0 |
| c0 | c1 | ... | cU |
|---|
NIL
Positive
Negative
Neutral
| ks=3 |
| ks=4 |
| ks=5 |
interact N-grams
Our Method
| feed-forward layer |
|---|
U sets of high-level features
Baseline 1:full-cont
whole representations via linear layers
| ks=3 |
| ks=4 |
| ks=5 |
Baseline 2&3:seg-cont
use diverse "small" linear mapping
...
| ks=3 |
| ks=4 |
| ks=5 |
Baseline 1: deep connected linear layers
...
62.69, 55.89
Baseline 3: independent partial view
63.87, 55.91
Baseline 2: independent view
62.99, 54.40
Baseline 4: weighted partial view
64.93, 56.91
segment number
Baseline 5:
self-attn weights
Multi-head attention
Key
Query
Value
0
0
0
feedforward
| ks=3 |
| ks=4 |
| ks=5 |
| ks=3 |
| ks=4 |
| ks=5 |
Ours
Why subspace?
high-dimensional data can be clustered around a collection of linear or affine subspaces
I love the pizza
Vary levels of Text vector
word-level (4,D1)
2gram (3,D2)
context (U,D3)
words are axis
terms are axis
anchors are axis
idea: clustering high-dimensional data sets distributed around a collection of linear and affine subspaces
linear case
supports with similar patterns yield heavy weights, and the feature points are closer
if two points are far, the is small, the has less weights on the final representation
the rare sentiment feature has less neighbors, can be enriched by constructing neighbors in subspaces
idea: clustering high-dimensional data sets distributed around a collection of linear and affine subspaces
if is small, the optimization move backwards supports, vice versus
linear case
Baseline 4: only encodings
idea: clustering high-dimensional data sets distributed around a collection of linear and affine subspaces
if , the optimization move backwards supports, vice versus
linear case
For each layer:
For each layer
| ks=3 | ks=4 | ks=5 |
64.48, 56.50
1
3
linear mapping -> random select
by relation rather than attention, all elements are related
Sub-space clustering
Consider the problem of modeling a collection of data points with a union of subspaces:
Let be a given set of points drawn from an unknown union of linear or affine subspaces of dimension
The subspaces are described as:
an arbitrary point in subspace , for linear subspaces
a basis for subspace
is a low-dimensional representation for subspace n
The goal is to find
text length: L word embedding dim: D chunk number: U kernel size: Ks kernel number: Kn
| ... |
|---|
encodings
chunk size, hyper-parameter
For each i in 1,2,...,n
| ks=3 | ks=4 | ks=5 |
text length: L word embedding dim: D chunk number: U kernel size: Ks kernel number: Kn
| ... |
|---|
encodings
decodings
| ks=3 | ks=4 | ks=5 |
| w0 |
| w1 |
| w2 |
| ... |
| wL |
I love the pizza
| 0 | 1 | 0 | 0 |
| 1 | 0 | 0 | 0 |
| 1 | 0 | 0 |
NIL
Positive
Negative
Neutral
| ... |
|---|
Local Featuring Modeling
By Jing Liao
Local Featuring Modeling
- 84