Boolean Retrieval

Untitled TechShare 2021/03/18

Agenda

  • Information Retrieval Problem
  • Inverted Index
  • Boolean Queries
  • General Problems with Boolean Search & Extended Boolean Model

Information Retrieval Problem

What's "Information Retrieval"

  • in life

  • in academic field

 

finding material (usually documents)

of an unstructured nature (usually text)

that satisfies an information need

from within large collections (usually stored on computers)

Considering the Following Scenario

  • Which plays of Shakespeare contain the words "Brutus AND Caesar AND NOT Calpurnia​"

Linear Scan?

  • fast processing
  • flexible matching
  • ranked retrieval

Index

binary term-document incidence matrix

Brutus AND Caesar AND NOT Calpurnia​

110100 AND 110111 AND 101111

What's its Major Flaw

Observation: Few Non-zero Terms

Inverted Index

Inverted Index

a data structure directing a word to documents

How to Build

  1. Collect the documents
  2. Tokenize the text
  3. Do linguistic preprocessing
  4. Create an inverted index

Tokenize the text

  • document ➜ a list of tokens

Just chop on whitespace and throw away punctuation characters?

Mr. O'Neill thinks that the boys' stories about Chile's capital aren't amusing.

language-specific ➜ Language identification

Tokenize the text

  • dropping common words (stop words) or not

Do linguistic preprocessing

  • Equivalent classes
  • Case-folding
  • Stemming
  • Lemmatization
  • Equivalent classes

  • Case-folding

Dog ➜ dog

Do linguistic preprocessing

  • Lemmatization

thought ➤ think

studies ➤ study

better ➤ good

  • Stemming

catty ➤ cat

studies ➤ studi

the boy's cars are different colors

the boy car be differ color

Do linguistic preprocessing

Create an Inverted Index

each document has a unique serial number

➜ docID

Create an Inverted Index

  • sort

Create an Inverted Index

  • same term from the same document ➜ merged.
  • same term ➜ grouped
  • dictionary + postings lists
  • dictionary ➜ memory
  • postings lists ➜ disk
  • doc. freq. for what?

Summary

  1. Collect the documents
  2. Tokenize the text
    • ​more than removing whitespace and punctuation
  3. Do linguistic preprocessing​
    • Equivalent classes
    • Case-folding
    • Stemming
    • Lemmatization
  4. Create an inverted index
    • ​sort/merge/group
    • dictionary + postings list

Boolean Queries

Conjunctive Query

Brutus AND Calpurnia

Conjunctive Query

  1. Locate Brutus in the Dictionary
  2. Retrieve its postings
  3. Locate Calpurnia in the Dictionary
  4. Retrieve its postings
  5. Intersect the two postings lists (merging)

Heuristic

(eyes AND trees) AND kaleidoscope

(kaleidoscope AND eyes) AND trees

(kaleidoscope AND eyes) AND trees

General Problems with Boolean Search &

Extended Boolean Model

General Problems with Boolean Search

  • AND operators
    • high precision but low recall searches
  • OR operators
    • low precision but high recall searches

difficult or impossible to find a satisfactory middle ground

Extended Boolean Model

  • tolerant to spelling mistakes
  • search for compounds or phrases
  • rank
    • term frequency information 

Summary

General Problems with

Boolean Search &

Extended Boolean Model

Information Retrieval Problem

Inverted Index

Boolean Queries

Boolean Retrieval

Boolean Retrieval

By hsutzu

Boolean Retrieval

Untitled TechShare 2021/03/11

  • 362