I love the pizza

w0

w1

w2

w3

...

word embedding

R^D

A total of L words in a text

...

t_{0,2}

t_{1,3}

A total of (L-3) windows

Ks=3

...

t_{0,3}

t_{1,4}

A total of (L-4) windows

Ks=4

...

convolution

map table

f_1

...

t_{0,2}

t_{1,3}

Ks=3

...

t_{0,3}

t_{1,4}

Ks=4

...

t_{0,3}

t_{1,4}

Ks=5

...

pooling

...

3-grams

...

4-grams

...

5-grams

pooling

...

f_1

I love the pizza

text length: L word embedding dim: D chunk number: U kernel size: Ks kernel number: K_n

ks=3

0	1	0	0
1	0	0	0

1	0	0

feed-forward layer

L\times D

3K_n

M\times C

NIL

A_1

Positive

A_2

A_5

Negative

Neutral

ks=4

ks=5

Traditional Method

f_1

...

I love the pizza

text length: L word embedding dim: D chunk number: U kernel size: Ks kernel number: K_n

0	1	0	0
1	0	0	0

1	0	0

c0	c1	...	cU

L\times D

3K_n

M\times C

NIL

A_1

Positive

A_2

A_M

Negative

Neutral

ks=3

ks=4

ks=5

interact N-grams

Our Method

feed-forward layer

U sets of high-level features

Baseline 1:full-cont

whole representations via linear layers

3K_n

3K_n \times 3K_n ...

3K_n

ks=3

ks=4

ks=5

Baseline 2&3:seg-cont

use diverse "small" linear mapping

\frac{3K_n}{C} \times \frac{3K_n}{C}

3K_n

\frac{3K_n}{C} \times \frac{3K_n}{C}

...

3K_n

ks=3

ks=4

ks=5

Baseline 1: deep connected linear layers

3K_n \times 3K_n

...

62.69, 55.89

Baseline 3: independent partial view

\frac{3K_n}{C} \to \frac{3K_n}{C} \to 3K_n

63.87, 55.91

Baseline 2: independent view

3Kn \to C\times 3K_n \to 3K_n

62.99, 54.40

Baseline 4: weighted partial view

\frac{3 Kn}{C} * C

v'_i:\frac{3 Kn}{C}

u:3K_n

64.93, 56.91

3K_n

segment number

Baseline 5:

self-attn weights

\omega_0

Multi-head attention

Key

Query

Value

3K_n \times 3K_n

feedforward

W_K

W_Q

W_V

3K_n \times 3K_n

v:3K_n

ks=3

ks=4

ks=5

r_\alpha:\frac{3 Kn}{C}

r_\beta:\frac{3 Kn}{C}

v'_i:\frac{3 Kn}{C}

R^d:d=(3 Kn)/C

R^D:D=(3 Kn)

v:3K_n

3K_n \times 3K_n

\omega_0

u:3K_n

\sigma

\odot

\otimes

ks=3

ks=4

ks=5

Ours

3K_n \times 3K_n

Why subspace?

high-dimensional data can be clustered around a collection of linear or affine subspaces

I love the pizza

Vary levels of Text vector

word-level (4,D1)

2gram (3,D2)

context (U,D3)

words are axis

terms are axis

anchors are axis

idea: clustering high-dimensional data sets distributed around a collection of linear and affine subspaces

r_\alpha:\frac{3 Kn}{C}

r_\beta:\frac{3 Kn}{C}

\omega_0

\omega_0=\frac{r_\alpha^0 r_\beta^0}{|r^0_\alpha|\cdot|r^0_\beta|}

r_\alpha:\frac{3 Kn}{C}

r_\beta:\frac{3 Kn}{C}

\omega_0

\sigma

linear case

supports with similar patterns yield heavy weights, and the feature points are closer

\otimes

\omega_0

\Omega=cos(r_\alpha, r_\beta)

if two points are far, the is small, the has less weights on the final representation

v'_i:\frac{3 Kn}{C}

\omega_0

u:3K_n

\sigma

\omega_i

v'_i

the rare sentiment feature has less neighbors, can be enriched by constructing neighbors in subspaces

idea: clustering high-dimensional data sets distributed around a collection of linear and affine subspaces

\omega_0

\Omega=r_\alpha r_\beta, \alpha,\beta \in C

v'_i:\frac{3 Kn}{C}

\omega_0

e:3K_n

\sigma

W \times

if is small, the optimization move backwards supports, vice versus

\omega_i

linear case

\omega_0>0

\omega_0<0

Baseline 4: only encodings

idea: clustering high-dimensional data sets distributed around a collection of linear and affine subspaces

\omega_0

\Omega=r_\alpha r_\beta, \alpha,\beta \in C

v'_i:\frac{3 Kn}{C}

\omega_0

e:3K_n

\sigma

W \times

if , the optimization move backwards supports, vice versus

\omega_i<0

linear case

\omega_0>0

\omega_0<0

For each layer:

r_\alpha:\frac{3 Kn}{C}

r_\beta:\frac{3 Kn}{C}

v'_i:\frac{3 Kn}{C}

e:3K_n

For each layer

\Omega:C\times 1

ks=3

ks=4

ks=5

3K_n/C \times 3K_n

64.48, 56.50

linear mapping -> random select

by relation rather than attention, all elements are related

Sub-space clustering

Consider the problem of modeling a collection of data points with a union of subspaces:

Let be a given set of points drawn from an unknown union of linear or affine subspaces of dimension

\{x_j \in R^D \}

\{ S_i \}_{i=1}^n

d_i=dim(S_i)

0 < d_i < D, i=1,...,n

The subspaces are described as:

S_i=\{ x \in R^D: x=\mu_i + U_i y \}, i=1,...,n

an arbitrary point in subspace , for linear subspaces

a basis for subspace

is a low-dimensional representation for subspace n

\mu_i:

U_i:

The goal is to find

\mu_i, U_i, n

S_i

\mu_i=0

R^{D\times d}

text length: L word embedding dim: D chunk number: U kernel size: Ks kernel number: Kn

\Omega=softmax(r_\alpha W_r r_\beta)

		...

\omega_0 v'_0

\omega_1 v'_1

\omega_C v'_C

encodings

(y)

(U_i)

chunk size, hyper-parameter

For each i in 1,2,...,n

ks=3

ks=4

ks=5

3K_n

3K_n \times 3K_n

r_\alpha:\frac{3 Kn}{C}

r_\beta:\frac{3 Kn}{C}

v'_i:\frac{3 Kn}{C}

C\times \frac{3 Kn}{C}

text length: L word embedding dim: D chunk number: U kernel size: Ks kernel number: Kn

v+We

		...

e_0

e_1

e_C

\theta':R^d \to R^D

encodings

decodings

3K_n \times 3K_n

ks=3

ks=4

ks=5

3K_n

...

I love the pizza

L\times D

0	1	0	0
1	0	0	0

1	0	0

A_1

A_2

A_5

M\times C

D=3K_n

d=3K_n

NIL

Positive

Negative

Neutral

		...

v_0+W_0 e_0

v_1+W_1 e_1

v_C+W_C e_C

Local Featuring Modeling

By Jing Liao

Baseline 1:full-cont

whole representations via linear layers

Baseline 2&3:seg-cont

use diverse "small" linear mapping

Ours

Why subspace?

high-dimensional data can be clustered around a collection of linear or affine subspaces

Vary levels of Text vector

Local Featuring Modeling

More from Jing Liao