I love the pizza

w0
w1
w2
w3
1
2
3
...
D

word embedding

in 

R^D

A total of L words in a text

1
...
Kn
t_{0,2}
t_{1,3}

A total of (L-3) windows

Ks=3
1
...
Kn
1
...
Kn
t_{0,3}
t_{1,4}

A total of (L-4) windows

Ks=4
1
...
Kn

convolution

map table

f_1
1
...
Kn
t_{0,2}
t_{1,3}
Ks=3
1
...
Kn
1
...
Kn
t_{0,3}
t_{1,4}
Ks=4
1
...
Kn
1
...
Kn
t_{0,3}
t_{1,4}
Ks=5
1
...
Kn

pooling

1
...
Kn

3-grams

1
...
Kn

4-grams

1
...
Kn

5-grams

pooling

pooling

1
...
Kn
1
...
Kn
1
...
Kn
f_1

I love the pizza

text length: L   word embedding dim: D    chunk number: U    kernel size: Ks    kernel number: K_n

ks=3
0 1 0 0
1 0 0 0
1 0 0
feed-forward layer
L
L\times D
3K_n
M\times C

NIL

A_1

Positive

A_2
A_5

Negative

Neutral

ks=4
ks=5

Traditional Method

f_1
w0
w1
w2
...
wL

I love the pizza

text length: L   word embedding dim: D    chunk number: U    kernel size: Ks    kernel number: K_n

0 1 0 0
1 0 0 0
1 0 0
c0 c1 ... cU
v=
L
L\times D
3K_n
M\times C

NIL

A_1

Positive

A_2
A_M

Negative

Neutral

ks=3
ks=4
ks=5

interact N-grams

Our Method

feed-forward layer

U sets of high-level features

Baseline 1:full-cont

whole representations via linear layers

3K_n
3K_n \times 3K_n ...
3K_n
ks=3
ks=4
ks=5

Baseline 2&3:seg-cont

use diverse "small" linear mapping 

\frac{3K_n}{C} \times \frac{3K_n}{C}
3K_n
\frac{3K_n}{C} \times \frac{3K_n}{C}

...

3K_n
ks=3
ks=4
ks=5

Baseline 1: deep connected linear layers

3K_n \times 3K_n

...

62.69, 55.89  

Baseline 3: independent partial view

\frac{3K_n}{C} \to \frac{3K_n}{C} \to 3K_n

63.87, 55.91

Baseline 2: independent view

3Kn \to C\times 3K_n \to 3K_n

62.99,  54.40

Baseline 4: weighted partial view

\frac{3 Kn}{C} * C
v'_i:\frac{3 Kn}{C}
u:3K_n

64.93, 56.91

W
3K_n
C:

segment number

Baseline 5: 

self-attn weights

\omega_0

Multi-head attention

Key

Query

Value

0

0

0

3K_n \times 3K_n

feedforward

W_K
W_Q
W_V
W
3K_n \times 3K_n
3K_n \times 3K_n
v:3K_n
ks=3
ks=4
ks=5
r_\alpha:\frac{3 Kn}{C}
r_\beta:\frac{3 Kn}{C}
v'_i:\frac{3 Kn}{C}
R^d:d=(3 Kn)/C
R^D:D=(3 Kn)
v:3K_n
3K_n \times 3K_n
\omega_0
u:3K_n
\sigma
\odot
\otimes
ks=3
ks=4
ks=5

Ours

3K_n \times 3K_n
3K_n \times 3K_n

Why subspace?

high-dimensional data can be clustered around a collection of linear or affine subspaces

I love the pizza

Vary levels of Text vector

word-level (4,D1)

2gram (3,D2)

context (U,D3)

words are axis

terms are axis

anchors are axis

idea: clustering high-dimensional data sets distributed around a collection of linear and affine subspaces

r_\alpha:\frac{3 Kn}{C}
r_\beta:\frac{3 Kn}{C}
\omega_0
\omega_0=\frac{r_\alpha^0 r_\beta^0}{|r^0_\alpha|\cdot|r^0_\beta|}
r_\alpha:\frac{3 Kn}{C}
r_\beta:\frac{3 Kn}{C}
\omega_0
\sigma

linear case

supports with similar patterns yield heavy weights, and the feature points are closer

\otimes
\omega_0
\Omega=cos(r_\alpha, r_\beta)

if two points are far, the       is small, the      has less weights on the final representation 

v'_i:\frac{3 Kn}{C}
\omega_0
u:3K_n
\sigma
\omega_i
v'_i

the rare sentiment feature has less neighbors,  can be enriched by constructing neighbors in subspaces

idea: clustering high-dimensional data sets distributed around a collection of linear and affine subspaces

\omega_0
\Omega=r_\alpha r_\beta, \alpha,\beta \in C
v'_i:\frac{3 Kn}{C}
\omega_0
e:3K_n
\sigma
+
W \times

if       is small, the optimization move backwards supports, vice versus

\omega_i

linear case

\omega_0>0
\omega_0<0

Baseline 4: only encodings 

idea: clustering high-dimensional data sets distributed around a collection of linear and affine subspaces

\omega_0
\Omega=r_\alpha r_\beta, \alpha,\beta \in C
v'_i:\frac{3 Kn}{C}
\omega_0
e:3K_n
\sigma
+
W \times

if             , the optimization move backwards supports, vice versus

\omega_i<0

linear case

\omega_0>0
\omega_0<0

For each layer:

r_\alpha:\frac{3 Kn}{C}
r_\beta:\frac{3 Kn}{C}
v'_i:\frac{3 Kn}{C}
e:3K_n

For each layer

\Omega:C\times 1
ks=3 ks=4 ks=5
3K_n/C \times 3K_n

64.48, 56.50

1

3

linear mapping -> random select

by relation rather than attention, all elements are related

W

Sub-space clustering

Consider the problem of modeling a collection of data points with a union of subspaces:

Let                   be a given set of points drawn from an unknown union of            linear or affine subspaces               of dimension 

\{x_j \in R^D \}
\{ S_i \}_{i=1}^n
d_i=dim(S_i)
0 < d_i < D, i=1,...,n

The subspaces are described as:

S_i=\{ x \in R^D: x=\mu_i + U_i y \}, i=1,...,n

  an arbitrary point in subspace       ,                  for linear subspaces

  a basis for subspace 

is a low-dimensional representation for subspace n

\mu_i:
U_i:
y:

The goal is to find 

\mu_i, U_i, n
S_i
\mu_i=0
R^{D\times d}

text length: L   word embedding dim: D    chunk number: U    kernel size: Ks    kernel number: Kn

\Omega=softmax(r_\alpha W_r r_\beta)
...
\omega_0 v'_0
\omega_1 v'_1
\omega_C v'_C

encodings

(y)
(U_i)
n:

chunk size, hyper-parameter

For each i in 1,2,...,n

ks=3 ks=4 ks=5
3K_n
3K_n \times 3K_n
r_\alpha:\frac{3 Kn}{C}
r_\beta:\frac{3 Kn}{C}
v'_i:\frac{3 Kn}{C}
C\times \frac{3 Kn}{C}

text length: L   word embedding dim: D    chunk number: U    kernel size: Ks    kernel number: Kn

v+We
...
e_0
e_1
e_C
\theta':R^d \to R^D

encodings

decodings

e
3K_n \times 3K_n
ks=3 ks=4 ks=5
3K_n
w0
w1
w2
...
wL

I love the pizza

L
L\times D
0 1 0 0
1 0 0 0
1 0 0
A_1
A_2
A_5
M\times C
D=3K_n
d=3K_n

NIL

Positive

Negative

Neutral

...
v_0+W_0 e_0
v_1+W_1 e_1
v_C+W_C e_C

Local Featuring Modeling

By Jing Liao

Local Featuring Modeling

  • 84