Search Engine

Beyond Simple Text Match

Srikanth Venugopalan

ThoughtWorks, Chennai

need for Search?

Address disconnect between definition of entities from a user and provider viewpoint
A very prominent user flow
Users are so used to Google that they end up searching for almost everything

Why not use something like Google?

Google/Yahoo/Bing/Baidu etc

are web search engines .. crawl webpages
very mature in retrieving them

But!!

Webpages denormalize data!

Web Search Engines

Pros

Indexing pages in Google etc, provide a wider reach
coherent, independent source of data
Optionally, one could get statistics out of the box.

Cons

not customizable to various data structures
Multiple views becomes difficult (Faceting / Sorting)
Optionally, one could get statistics out of the box.

Tools available

Lucene based

SOLR
ElasticSearch

Endecca

Support in RDBMS
Postgres, MS SQL Server, Oracle etc have full text indexing support

Going beyond Text Match

Search on structured data!

Or rather, semi-structured

De-normalize highly nested structures.

Extremes!

Foo has Bar and Baz. Baz comes with Fizz1 and Fizz2, whereas Baz consists of Blah with Buzz1 and Buzz2

Foo
-Bar
--Fizz1
--Fizz2
-Baz
--Blah
---Buzz1
---Buzz2

[ nth normal form ] <----------------------------> [flat free text]

Ideal for search engines?

Foo
-Bar
-Bar_Fizz1
-Bar_Fizz2
-Baz
-Baz_Blah
-Baz_Blah_Buzz1
-Baz_Blah_Buzz2

Exploiting Data-structures

Look-up by a specific field

Combine fields to get multiple combinations

(Nothing new, we've been doing this since SQL)

But...

..SQL fails when

You get collision of matches across fields.

Name = Kawasaki

Brand = Kawasaki

Weights / Boosts

Another Example:

Surface ~ A reflective surface

Surface ~ http://www.microsoft.com/surface/

So when I search for Surface, what should I get?

contd..

Weights / Boosts

"Surface" by itself

Rarely used by a user looking for a reflective surface.
High chance that this is a search for "Microsoft Surface"

Analyze most of such cases, to arrive at the most searched by fields (priorities).

Most search engine tools support definition of Boosts at query and Index time.

Defining Weights / Boosts

Identifying Priority of relevance
Eliminate the ambiguity
Work around Fuzzyness of the natural language

Identify the right field type
Creates scope of working with the right operations meant for datatypes

Some More Examples

RANGE QUERY

These queries fetch results that fall within the range of a given field values.

An Example

price:[100 TO 200]

A user could fetch all documents that have price between 100 and 200, for example.

Some more Examples

Aggregate Query
Conditional Query
function query
dynamic fields
Multi lingual
date time
geo spatial

Iterative development and testing

Why?

Number of cases to be tested gets distributed across the span of development

Regression testing is continuous - changing/adding/removing one rule can be immediately tested for impact.

So why not do it in all cases?

...it comes at a cost.

Analyzing the data and choosing the right training set could be a daunting task, particularly when the dataset is large and complex.

choosing a training set

A training set is a subset of actual data, that can be used to run the rules and verify behaviour.

Guidelines -

good representation of actual data
Right size : is small enough for a tester to be able to handle, and is big enough so that ripple effect can be simulated

Implementing using a training set

Steps

Model the search engine schema using a training set
Make sure all rules are satisfied independently
Combinations of the rules are also tested, and schema refined until all the conditions are satisfactory

Identify the below

fields to be indexed
analyzers
tokenizers

The iteration

Pass n : run rules -> anomalies -> tweak rules & schema -> regression -> Pass (n+1)

Some Tools

..that help in implementing free text search

Apache Nutch

A web crawler that can index web pages. Has integration support with Solr.

contd..

Apache UIMA

Unstructured Information Management Applications

analyze large volumes of unstructured information in order to discover knowledge that is relevant to an end user.

Ex - ingest plain text and identify entities, such as persons, places, organizations; or relations, such as works-for or located-at.

THank YOU

Questions?