Person of Interest [WIP]
Raúl Roa
raul@tagshelf.com

Who's this guy

Lead Software Architect
WE MAKE COMPUTERS DO AMAZING THINGS...
PUT STUFF WHERE THEY BELONG

MAKE THEM UNDERSTAND

what's this talk about?
MAKING COMPUTERS UNDERSTAND PEOPLE THROUGH TEXT
DISCLAIMER
WORD VECTOR WILL BE USED A LOT!
SERIOUSLY...
A LOT!
WE'VE SEEN HOW COMPUTERS SEE

OPERATIONS OVER DENSE ENCODED VECTORS

SEEING IS COOL, BUT HOW ABOUT READING?!
I saw a man on a hill with a telescope.


Source: nlpforhackers.io
LET'S TALK ABOUT LIMITATIONS
NLP systems traditionally treat words as discrete atomic symbols
Fancy way of saying: finite unique symbols
No useful info regarding "relationship" between words
YES! YOU GOT IT!
CONTEXT? WHAT'S THAT? OH NO!
WE CAN STILL DO INTERESTING THINGS...
NO CONTEXT, NO PROBLEMO!
WORD FREQUENCY CAN HELP US ACHIEVE SOME INTERESTING THINGS
-
Similarity between text corpuses
-
Useful for classification, clustering, etc
-
-
How limited or rich is vocabulary
-
From text, but text is written by people
-
Useful stuff we can achieve with TF/IDF (BOW)

THERE'S NO NEED FOR NEURAL NETWORKS
BUT WHAT ABOUT CONTEXT?
Word Embeddings/Sentence Embeddings
Word embedding is one of the most popular representation of document vocabulary. It is capable of capturing context of a word in a document, semantic and syntactic similarity, relation with other words, etc.
- Word2vec, Glove, etc
- BERT, makes use of Transformer, an attention mechanism that learns contextual relations between words (or sub-words) in a text.
Still, these approaches have their own shortcomings
-
Lexical Inference
-
Superficial Correlation
-
Adversarial Evaluation
-
Semantic Variability
This is all great but how do we map this to people?
Well, because of the Big 5!

This theory uses descriptors of common language and therefore suggests five broad dimensions commonly used to describe the human personality and psyche.
Big Five
-
openness to experience
- I have excellent ideas.
-
conscientiousness
- I always am prepared.
-
extraversion
- I am the life of the party.
-
agreeableness
- I am interested in people.
-
neuroticism
- I get upset easily.
Big Five
BUT!
Simpler techniques can be used to
Survey data on categorized likes and dislikes could be vector encoded and then tested using cosine similarity.
How good is this?
Let's ask Johnny V
THINGS WILL ALWAYS BE AS GOOD AS YOUR DATA!

Person of Interest [WIP]
By Raúl G. Roa Gómez
Person of Interest [WIP]
- 313