Linguistics
written language
spoken language
statistics
machine learning
Computer Science
deep learning
.... to "understand" the content of a corpus of documents
Field aiming at analyzing natural language by computer means
tweets
forum's
message
e-mails
articles
books
...
surveys
meetings
songs
speechs
recipies
conversations
Text & speech processing
Natural Language Understanding
Lexical
semantics
Syntactic
analysis
Relational semantics
Discource
Natural Language Generation
Morphological analysis
Natural Language Understanding
Natural Language Generation
Automatic Summarization
Dialogue management
Grammatical error correction
Machine translation
Question answering
Music generation
Text generation
Voice synthesis
Text & speech processing
Optical Character Recognition
Speech recognition
Speech segmentation
Text-to-speech
Speech-to-text
Word segmentation
Morphological analysis
Lemmatization
Morphological segmentation
Part of speech tagging
stemming
Syntatic
analysis
Grammar induction
Sentence parsing tree
Sentence breaking
Lexical
semantics
Distributionnal semantics
Sentiment analysis
Named entity recognition
Terminology extraction
Word sense disambiguation
Entity linking
Relational semantics
Relationship extraction
Semantic parsing
Discourse
Coreference
resolution
Argument mining
Topic segmentation
Implicit semantic
role labelling
Topic segmentation
Recognizing textual entailment
Discourse analysis
Textual data
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus ac nunc tristique, maximus enim eget, sodales neque.
text 1
Aenean lobortis ornare diam, nec pellentesque odio viverra at.
text 2
Corpus
Numerical data
variables
...
...
observations
1. Segmenting the N-grams into tokens
2. Count the frequency of appearance of these tokens in each document
3. Apply optional normalization
Steps :
Transform a collection of documents into a table of
numerical data
(without requiring knowledge of the field studied)
Transform a collection of documents into a table of
numerical data
(without requiring knowledge of the field studied)
Result :
frequency vector of tokens
...
...
documents
1. Imaging databases can be huge
4 documents (Coelho & Richert, 2015)
2. Most imaging databases save images permanently
3. Imaging databases store images
4. Imaging databases store images. Imaging databases store images. Imaging databases store images
Counting the frequency of tokens (ngram = 1)
documents
| imaging | databases |
can |
get |
huge |
most | save |
images |
permanently |
store |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 0 |
| 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| 1 |
|---|
| 2 |
| 3 |
| 4 |
Some words are not intrinsically meaningful
removal of stop-words
documents
| imaging | databases |
can |
get |
huge |
most | save |
images |
permanently |
store |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 0 |
| 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| 1 |
|---|
| 2 |
| 3 |
| 4 |
Some words have the same semantic content
stemming & lemmatization
Term Frequency -Inverse Document Frequency (TF-IDF)
The count does not take into account the total frequency of occurrence
Words and phrases context is not encoded !
Its dimensionality is
often very high
The bag-of-words data table is special:
It contains many zeros
might need to use dimensionality reduction methods after
use sparse-vector representation to save memory
dictionary of common words
#$%&\'{}*+
...
the
your
is
...
keep the same morphological stem
playing, played, plays
play
keep the same semantic form
am, are, is
be
Term Frequency
(TF)
Term Frequency -Inverse Document Frequency
(TF-IDF)
Creation of vectors of tokens capable of encoding the meaning contained in the documents
We count the frequency of appearance of the tokens
1-gram
2-gram
...
Tokens
We divide this frequency by the number of documents in which the token appears
databases
imaging databases
| imaging | databases |
... |
|---|---|---|
| 1 | 1 | |
| 1 | 1 | |
| 1 | 1 | |
| 1 | 1 |
documents
| 1 |
|---|
| 2 |
| 3 |
| 4 |
| imaging | databases |
... |
|---|---|---|
| 1/4 | 1/4 | |
| 1/4 | 1/4 | |
| 1/4 | 1/4 | |
| 1/4 | 1/4 |
documents
| 1 |
|---|
| 2 |
| 3 |
| 4 |
Word embedding
The vectors are created by models taking into account,
for each token, the context of the other tokens
The semantic proximity between two words is captured by the distance between two encoded vectors this way
Output property