Person of Interest [WIP]

Raúl Roa

raul@tagshelf.com

Who's this guy

Lead Software Architect

WE MAKE COMPUTERS DO AMAZING THINGS...

PUT STUFF WHERE THEY BELONG

MAKE THEM UNDERSTAND

what's this talk about?

MAKING COMPUTERS UNDERSTAND PEOPLE THROUGH TEXT

DISCLAIMER

WORD VECTOR WILL BE USED A LOT!

SERIOUSLY...

A LOT!

WE'VE SEEN HOW COMPUTERS SEE 

OPERATIONS OVER DENSE ENCODED VECTORS

SEEING IS COOL, BUT HOW ABOUT READING?!

I saw a man on a hill with a telescope.

Source: nlpforhackers.io

LET'S TALK ABOUT LIMITATIONS

NLP systems traditionally treat words as discrete atomic symbols

Fancy way of saying: finite unique symbols

No useful info regarding "relationship" between words

YES! YOU GOT IT!

CONTEXT? WHAT'S THAT? OH NO!

WE CAN STILL DO INTERESTING THINGS...

NO CONTEXT, NO PROBLEMO!

WORD FREQUENCY CAN HELP US ACHIEVE SOME INTERESTING THINGS

  • Similarity between text corpuses

    • Useful for classification, clustering, etc

  • How limited or rich is vocabulary

    • From text, but text is written by people

Useful stuff we can achieve with TF/IDF (BOW)

THERE'S NO NEED FOR NEURAL NETWORKS

BUT WHAT ABOUT CONTEXT?

Word Embeddings/Sentence Embeddings

Word embedding is one of the most popular representation of document vocabulary. It is capable of capturing context of a word in a document, semantic and syntactic similarity, relation with other words, etc.

  • Word2vec, Glove, etc
  • BERT, makes use of Transformer, an attention mechanism that learns contextual relations between words (or sub-words) in a text.

 

Still, these approaches have their own shortcomings

  • Lexical Inference

  • Superficial Correlation

  • Adversarial Evaluation

  • Semantic Variability

This is all great but how do we map this to people?

Well, because of the Big 5!

This theory uses descriptors of common language and therefore suggests five broad dimensions commonly used to describe the human personality and psyche.

Big Five

  • openness to experience
    • I have excellent ideas.
  • conscientiousness
    • I always am prepared.
  • extraversion
    • I am the life of the party.
  • agreeableness
    • I am interested in people.
  • neuroticism
    • I get upset easily.

Big Five

BUT!

Simpler techniques can be used to

Survey data on categorized likes and dislikes could be vector encoded and then tested using cosine similarity.

How good is this?

Let's ask Johnny V

THINGS WILL ALWAYS BE AS GOOD AS YOUR DATA!

Person of Interest [WIP]

By Raúl G. Roa Gómez

Person of Interest [WIP]

  • 313