Boolean Retrieval
Untitled TechShare 2021/03/18
Agenda
- Information Retrieval Problem
- Inverted Index
- Boolean Queries
- General Problems with Boolean Search & Extended Boolean Model
Information Retrieval Problem
What's "Information Retrieval"
- in life
- in academic field

finding material (usually documents)
of an unstructured nature (usually text)
that satisfies an information need
from within large collections (usually stored on computers)
Considering the Following Scenario
- Which plays of Shakespeare contain the words "Brutus AND Caesar AND NOT Calpurnia"
Linear Scan?
fast processingflexible matchingranked retrieval
Index

binary term-document incidence matrix
Brutus AND Caesar AND NOT Calpurnia

110100 AND 110111 AND 101111
What's its Major Flaw

Observation: Few Non-zero Terms


Inverted Index
Inverted Index
a data structure directing a word to documents

How to Build
- Collect the documents
- Tokenize the text
- Do linguistic preprocessing
- Create an inverted index
Tokenize the text
- document ➜ a list of tokens

Just chop on whitespace and throw away punctuation characters?
Mr. O'Neill thinks that the boys' stories about Chile's capital aren't amusing.


language-specific ➜ Language identification
Tokenize the text
- dropping common words (stop words) or not

Do linguistic preprocessing
- Equivalent classes
- Case-folding
- Stemming
- Lemmatization
-
Equivalent classes
-
Case-folding
Dog ➜ dog

Do linguistic preprocessing
-
Lemmatization
thought ➤ think
studies ➤ study
better ➤ good
-
Stemming
catty ➤ cat
studies ➤ studi
the boy's cars are different colors
➤ the boy car be differ color
Do linguistic preprocessing
Create an Inverted Index

each document has a unique serial number
➜ docID


Create an Inverted Index
- sort

Create an Inverted Index
- same term from the same document ➜ merged.
- same term ➜ grouped
- dictionary + postings lists

- dictionary ➜ memory
- postings lists ➜ disk
- doc. freq. for what?
Summary
- Collect the documents
- Tokenize the text
- more than removing whitespace and punctuation
- Do linguistic preprocessing
- Equivalent classes
- Case-folding
- Stemming
- Lemmatization
- Create an inverted index
- sort/merge/group
- dictionary + postings list
Boolean Queries
Conjunctive Query
Brutus AND Calpurnia

Conjunctive Query
- Locate Brutus in the Dictionary
- Retrieve its postings
- Locate Calpurnia in the Dictionary
- Retrieve its postings
- Intersect the two postings lists (merging)

Heuristic
(eyes AND trees) AND kaleidoscope

(kaleidoscope AND eyes) AND trees
(kaleidoscope AND eyes) AND trees
General Problems with Boolean Search &
Extended Boolean Model
General Problems with Boolean Search
- AND operators
- high precision but low recall searches
- OR operators
- low precision but high recall searches
difficult or impossible to find a satisfactory middle ground
Extended Boolean Model
- tolerant to spelling mistakes
- search for compounds or phrases
- rank
- term frequency information
Summary
General Problems with
Boolean Search &
Extended Boolean Model
Information Retrieval Problem
Inverted Index
Boolean Queries
Boolean Retrieval
Boolean Retrieval
By hsutzu
Boolean Retrieval
Untitled TechShare 2021/03/11
- 362