Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions

Socher et al, 2011 

Context

  • Different from Socher 2011a
  • Growing field
  • Based on previous model

Summary

  • Domain: NLP
  • Introduces a novel framework
  • Achieves state-of-the-art results

Background

Semi-Supervised Learning

Semi-supervised learning uses both labeled data (S) and unlabeled data (U) as inputs. Its goal can either be to learn the labels of U or to learn the correct mapping from X to Y.

Semi-Supervised Learning

  • For classification problems:
    1. learn categories from S
    2. learn features/boundaries from U
  • Improves learning accuracy
  • Great for NLP

Semi-Supervised Learning

Autoencoders

Autoencoders are a form of unsupervised learning using neural networks.  They map a set of inputs X to itself by learning an approximation of the identity function.

Autoencoders

Recursive Autoecoders

A recursive autoencoder (RAE) takes a set of inputs and recursively merges pairs until a single element remains, which captures the information of all inputs.

Recursive Autoecoders

Sentiment Distributions

Sentiment analysis is the process of identifying and classifying the opinion or emotion expressed in a piece of text.  A sentiment distribution is the distribution of sentiment labels over a body of texts.

Sentiment Distributions

  • One-dimensional

Sentiment Distributions

  • Multi-dimensional

Building the Model

Problem

Hypothesis

Learn how to identify the sentiment of a piece of text.

A semi-supervised recursive RAE can learn this mapping without the help of traditional resources.

Continuous Word Vectors

A continuous word vector is a mapping of a word to a vector in a feature space where each dimension captures some syntactic or semantic meaning.  

Continuous Word Vectors

  • random initialization
    • sample from  N (0, sigma^2)
  • learned initialization
    • pre-train with unsupervised neural language model

Structure Prediction

Structure prediction uses an RAE to find the vector representation of a sentence that minimizes the total reconstruction error over all levels of the recursion tree.

Structure Prediction

Structure Prediction

Greedy Algorithm

  1. For each neighboring pair of vectors (c1 , c2):
    • Give them as input to the autoencoder
    • Record the parent node p and reconstruction error
  2. Choose the pair with the lowest error and replace them with p
  3. Repeat the process with the new set of vectors until only a single vector remains

Semi-Supervised RAEs

A semi-supervised RAE applies an additional softmax layer at each level of the recursion tree to predict the sentiment distribution of the parent feature vector, p.

Semi-Supervised RAEs

Semi-Supervised RAEs

Learning (Step 1)

Given the current set of parameters

 

 

greedily construct the optimal tree for each sentence.

Learning (Step 2)

Compute the gradient for the objective function

 

 

 

and update the parameter values using L-BFGS.

Regularization

Weight decay

Length Normalization

Components of Learning

  • Input: sentence
  • Output: sentiment distribution
  • Hypothesis Set: semi-supervised RAEs
  • Learning Algorithm: L-BFGS; BPTS
  • f: ideal mapping of sentences to sentiment labels

Experiments

Remember this?

Example of multinomial sentiment distribution.

Experience Project

Website for anonymous sharing of personal stories with other users who can respond by "voting" for one of five categories:

  • Sorry, Hugs (condolences)
  • You Rock (approval/congratulations)
  • Tehee (amusement)
  • I Understand (empathy)
  • Wow, Just Wow (surprise)

EP: Dataset

  • 6,129 observations
  • label distribution: [.22, .20, .11, .37, .10]
  • 129 words per entry on average 
  • 49% training, 21% validation, 30% testing

EP: Predicting the Label

EP: Predicting the Distribution

KL(g||p) = .03
KL(g||p) = .92

EP: Predicting the Distribution

Comparing Approaches

Most traditional sentiment analysis methods only perform binary polarity classification.  In order to compare semi-supervised RAEs to these methods, the model was evaluated on two common datasets:

  • movie reviews (MR)
  • opinions (MPQA)

Comparing Approaches

Conclusion

Made with Slides.com