Machine Learning

in the Wild

#wins and #fails

 

Part 3

Steven Rich

The Washington Post

@dataeditor

 

http://bit.ly/machinelearningnicar

Sentiment Analysis

What is Sentiment Analysis?

  • Opinion Mining
  • Useful when attempting to determine the attitude of a writer toward a topic
  • Also good for determining the contextual polarity of a document (or documents)
  • Can be categorical or ordinal

Sentiment Analysis is a classification problem

Mostly deals in shades of positive, negative and neutral

What are some of our options?

  • Could query for 30 positive and 30 negative terms and do a simple count
  • Could choose two or three documents told us were worst and do in-depth reporting on them
  • Could manually comb through each document and identify positive, negative and neutral terms
  • Could use sentiment analysis

The story

Whistleblowers say USAID’s IG removed critical details from public reports

I never had a #win without having a #fail (or five) first

 

#fail

The training set, while good for most types of documents, was awful for audits

#win

I could edit the training set (repeatedly) to fit my needs

#fail

The English language sucks. There can be 1,000 ways to say something, so there are 1 million words to analyze. 

#partialwin

You can keep training sets for future analyses to make those better than the last. 

#fail

Words can have multiple connotations

#win

Could tweak the analysis to understand the context in which words were used

#fail

Negative and positive words could have been used to talk about anything

#win

Could tweak the analysis to use words only associated with other terms (USAID, mission, etc.)

#fail

Sentiment analysis isn't perfect and there's a certain level of uncertainty

#partialwin

Went with an overly conservative model to reduce the amount of potential error

Why is this better than the other options? 

  • Negative words aren't always negative
    • "poor" 
  • Documents can have multiple subjects
    • People, places, objects, etc. 
  • Way more than 30 positive and negative words
  • Not all portions of documents are created equal
  • It can take a hell of a lot less time
  • It's more thorough
  • It allows for tweaking while keeping underlying model
  • It's largely reusable (with some tweaks)

Machine Learning

in the Wild

#wins and #fails

 

Part 3

Steven Rich

The Washington Post

@dataeditor

 

http://bit.ly/machinelearningnicar

Machine Learning

By Steven Rich

Machine Learning

  • 3,483
Loading comments...

More from Steven Rich