Building Modern Enterprise Searching Platform: Introduction & Concept

Han Yi

2018.06.07

Three major Enterprise data oriented systems

OLTP system, mostly based on relational database: Oracle, MySQL, etc
Core search engine, commonly Elasticsearch/Solr
OLAP system, like hadoop ecosystem

General vs Enterprise purpose search engine

Data source types: Web, image, document, multimedia, database, log, data from other commercial systems
Data retrieval: Public / Internal
Data update frequency: Near real time
Data integrity: No / High
Sort: Relevancy, time, business attributes
Allow SEO: Yes / No
Result control: Full automatic / Strong customized
Index structure: Web / Variable due to business requirement

Methodology to design & build modern eCommerce search engine

Business driven

Need to deeply understand Business Vision, Domain Knowledge, and Technology of Information Retrieval

Continuous feedback

Have more understanding on source data, and user intention of search
Keep on collecting feedback from user and react fast

Challenges

Relational database data model
Index model needs to be flexible to satisfy query requirement
Advanced search
API integration
Performance
Endless expectation on conversion rate

Information Retrieval Basics

Target data types
- Structured data
- Unstructured data
- Semistructured data

Information Retrieval Basics

Basic query types: Boolean Query
- Linear searching: grep (Global Regular Expression Print)
- Non-linear searching: Incidence Matrix

Information Retrieval Basics

Advanced query: Ad hoc Query
- One time query to for user information need
- Information need != query

Information Retrieval Basics

Effectiveness Assessment
- Precision
- Recall

Information Retrieval Basics

Inverted Index
- Dictionary
- Posting list (Inverted List)

Information Retrieval Basics

Building Inverted Index
- Index data structure
- Dictionary data structure
- Indexing algorithm
- Distributed indexing
- Dynamic indexing
- Index Compression

Information Retrieval Basics

Tolerant Retrieval
- Wildcard query
- Spelling correction
- Phonetic correction

Information Retrieval Basics

Relevancy & Score
Term Weighting
Vector Space Model

Information Retrieval Basics

Evaluation
- measurement
- test collections
- unranked result
- ranked result
- user utility

Information Retrieval Basics: Building Index

Steps to create Inverted Index
- Raw document analysis
- Tokenization (based on language, also can be used for language identification)
- Linguistic preprocessing
- Physical index building

Information Retrieval Basics: Raw Doc Analysis

Raw document analysis
- Raw document collection
- Char sequence generation
- Indexing granularity
  - Precision vs Recall
  - Business specific requirement

Information Retrieval Basics: Token Analysis

Tokenization
- Token
- Type
- Term
Stop word deletion
- Stop word list extraction from collection frequency
Token normalization
- Equivalence, case-folding, language specific
Stemming & lemmatization (resolve flection)
- Stem vs Root
- Pragmatic analysis (语用分析)

Information Retrieval Basics: Index Data Structure

Example 1: Skip list pointer — Fast post list intersection (Boolean Query)
- step list sqrt(P), P is length of post list

Example 2: Positional index — Fast phrase search
- Commonly less than 4 words

angels: 2: {36,174,252,651}; 4: {12,22,102,432}; 7: {17}; 
fools: 2: {1,17,74,222}; 4: {8,78,108,458}; 7: {3,13,23,193}; 
fear: 2: {87,704,722,901}; 4: {13,43,113,433}; 7: {18,328,528}; 
in: 2: {3,37,76,444,851}; 4: {10,20,110,470,500}; 7: {5,15,25,195}; 
rush: 2: {2,66,194,321,702}; 4: {9,69,149,429,569}; 7: {4,14,404}; 
to: 2: {47,86,234,999}; 4: {14,24,774,944}; 7: {199,319,599,709}; 
tread: 2: {57,94,333}; 4: {15,35,155}; 7: {20,320}; 
where: 2: {67,124,393,1001}; 4: {11,41,101,421,431}; 7: {16,36,736};

Information Retrieval Basics: Index Data Structure

Blocked sort-based Indexing
- First pass to create termID-docID pairs
- Second pass to sort pairs based on termID and docID
- Final pass to organize docIDs for each term into a posting list

Information Retrieval Basics: Index Construction

Single-pass in-memory Indexing
- Adds posting directly to its postings list. Instead of first collecting all termID-docID pairs and then sorting them
- Run faster because there is no sorting required, and it saves memory because we keep track of the term a postings list belongs to, so the termIDs of postings need not be stored

Information Retrieval Basics: Index Construction

Distributed Indexing
- Term-partitioned index
- Document-partitioned index

Information Retrieval Basics: Index Construction

Information Retrieval Basics: Tolerant Retrieval

Wildcard query
- Example: hel*, *llo, he*o
- N-gram index

POST _analyze
{
  "tokenizer": "ngram",
  "text": "Quick Fox"
}

[ Q, Qu, u, ui, i, ic, c, ck, k, "k ", " ", " F", F, Fo, o, ox, x ]

Information Retrieval Basics: Tolerant Retrieval

Spelling correction
- Isolated term: hallo wolrd
  - Edit (Levenshtein) distance
  - N-gram overlap

Information Retrieval Basics: Tolerant Retrieval

Spelling correction
- Context sensitive: flew form Heathrow
  - Check searching based on top ranked terms

Information Retrieval Basics: Tolerant Retrieval

Phonetic correction
- Syllable Analysis (commonly 4 at most)
- Soundex

Information Retrieval Basics: Classic Ranking

Parametric Index
- a document may contain multiple "fields"
- a "Zone" means text type field
- Build inverted index for each zone
- Ranked boolean retrieval
  - Example: Bool query scoring by weighted zone matching

Term frequency and weighting
- tf: term frequency in specific zone
- cf: term frequency in documents of whole index
- df: document frequency that contains specific term of whole index
- idf: inverse document frequency
- weighting of specific term: tf-idf = tf x idf

Information Retrieval Basics: Classic Ranking

Vector space model

Information Retrieval Basics: Classic Ranking

Document/Search Similarity

Information Retrieval Basics: Classic Ranking

vs

Example: search "car insurance" (tf as weighting)

[0.707, 0, 0.707, 0]

Information Retrieval Basics: Evaluation

Evaluation
- test collection
  - commonly 50 documents at least
- test cases
  - in queries
- relevancy standards
  - user feedback

Information Retrieval Basics: Evaluation

Open source data collection
- GOV2
- NTCIR
- CLEF
- Reuters
- 20 Newsgroups

Information Retrieval Basics: Evaluation

Evaluation on unranked searching results
- Precision vs Recall
Evaluation on ranked searching results
- Precision vs Recall for top K items

Information Retrieval Basics: Evaluation

System measurement
- index time
- index size
- query time
- friendliness of API
User utility
- for design: study on user satisfaction
- for improvement: clickstream mining, A/B testing

Thanks

Made with Slides.com