Week 15 Report
b02901085 徐瑞陽
b02901054 方為
We focus on subtask1 first
Given a review text about a target entity (laptop, restaurant, etc.),
identify the following information:
Originally , we think this is a simple task...
At first , high input dimension of SVM seems awful : (
We need to reduce it !
Every application category has better feature set to describe
Ex. MFCC <-> audio
Gabor <-> video...etc
For text corpus ... it's keywords !
1000 VS 3000
remove stopwords
stemming
tf-idf
simple word clustering...
not work ... OTL
Find short descriptions of the members of a collection
whlie preserving the essential relationships
Differ from tf-idf :
tf-idf is is "indexing" (give proper weight to every unigram)
like normalization ... I think
use SVD to find subpace of tf-idf matrix that
capture most of the variance in the collection
model each word as a sample of a mixture model
: hidden topic
d is only a dummy label :p
size grows in mixture(corpus)
We don't have generative model for portions of topic given arbitary document
But , I don't figure it totally now , actually :p
Assumption : BOW (order of words can be neglected)
exchangeability
de Finetti's Theorem
Any collection of exchangeable RVs
has a representation as a mixture distribution
In our case ,
=> word , topic is a mixture distrubution
And some hidden variable will determine those distribution
multinomial
(they are all "distribution over distributions")
Randomly choose corpus-level param
for subsequent document-level param
decide topic dis.
use to decide word probability conditioned on topic
Dirichlet is multi-nomimal version of beta distribution
Advantage : Avoid over-fitting issue in MLE (ref : CMU)
(conjugate to multinomial dis.)
After using PCA(SVD) to reduce dimension and get error ...
I found I use sklearn's "fit_transform" incorrectly in previous ...
OTL
Without LSI :
restaurant : 61% (12 category)
laptop : 52% (81 category)
restaurant : 60% (12 category)
laptop : 50.05% (81 category)
With LSI : (reduce dimension to 1000)
restaurant : 60% (12 category)
laptop : 52% (81 category)
Need more time...
Seems no improvement ,
but the 1000 dimension seems preserve the essential info
Bigram
9000/? (origin size) |
3000 (unigram size) |
|
---|---|---|
restaurant | 61% | 61% |
laptop | 51% | 51% |
seems high input dimension is not a problem :p
15k/17k (origin size) |
3000 (unigram size) |
|
---|---|---|
restaurant | 61% | |
laptop | 51% |
Trigram
Nothing special :p
Two kinds of classifier
something like dimensionality reduction
use sample to classify directly
use sample to guess hidden model's param <learn a model>
use these models to predict unseen sample <classify>
Domain | 3-class accuracy |
---|---|
Restaurant | 71.23% |
Laptop | 74.93% |
Without using aspect information and removing conflicting polarities
Seems like we're on the right track...we hope so...
When modifying TreeLSTM code, the model can be trained on our dataset, but we can only feedforward data in the train and dev set, but not on the test set.
Our guess is that the dictionary was extracted from the train/dev data but not the test data, so some words in the test set cannot be encoded, causing the error