Search Engine

Beyond Simple Text Match

Srikanth Venugopalan

ThoughtWorks, Chennai

need for Search?

  •     Address disconnect between definition of entities from a user and provider viewpoint
  •     A very prominent user flow
  •     Users are so used to Google that they end up searching for almost everything

Why not use something like Google?

    Google/Yahoo/Bing/Baidu etc

  • are web search engines .. crawl webpages
  • very mature in retrieving them


Webpages denormalize data!

Web Search Engines


  • Indexing pages in Google etc, provide a wider reach
  • coherent, independent source of data
  • Optionally, one could get statistics out of the box.


  • not customizable to various data structures
  • Multiple views becomes difficult (Faceting / Sorting)
  • Optionally, one could get statistics out of the box.

Tools available

Lucene based

  • SOLR
  • ElasticSearch


Support in RDBMS
Postgres, MS SQL Server, Oracle etc have full text indexing support


Going beyond Text Match

Search on structured data!

Or rather, semi-structured

    De-normalize highly nested structures.


Foo has Bar and Baz. Baz comes with Fizz1 and Fizz2, whereas Baz consists of Blah with Buzz1 and Buzz2


[ nth normal form ] <----------------------------> [flat free text]


Ideal for search engines?


Exploiting Data-structures

Look-up by a specific field

Combine fields to get multiple combinations

(Nothing new, we've been doing this since SQL)


..SQL fails when

You get collision of matches across fields.

Name = Kawasaki

Brand = Kawasaki


Weights / Boosts

Another Example:

Surface ~ A reflective surface

Surface ~

So when I search for Surface, what should I get?


Weights / Boosts

"Surface" by itself

  • Rarely used by a user looking for a reflective surface.
  • High chance that this is a search for "Microsoft Surface"

Analyze most of such cases, to arrive at the most searched by fields (priorities).

Most search engine tools support definition of Boosts at query and Index time.

Defining Weights / Boosts

Identifying Priority of relevance
Eliminate the ambiguity
Work around Fuzzyness of the natural language       

Identify the right field type
Creates scope of working with the right operations meant for datatypes


Some More Examples


        These queries fetch results that fall within the range of a given field values.
An Example

price:[100 TO 200] 

        A user could fetch all documents that have price between 100 and 200, for example.

Some more Examples

  • Aggregate Query
  • Conditional Query
  • function query
  • dynamic fields
  • Multi lingual
  •  date time
  • geo spatial

Iterative development and testing


  • Number of cases to be tested gets distributed across the span of development

  • Regression testing is continuous -  changing/adding/removing one rule can be immediately tested for impact.

So why not do it in all cases? comes at a cost.

Analyzing the data and choosing the right training set could be a daunting task, particularly when the dataset is large and complex.

choosing a training set

A training set is a subset of actual data, that can be used to run the rules and verify behaviour.

Guidelines -

  • good representation of actual data
  • Right size : is small enough for a tester to be able to handle, and is big enough so that ripple effect can be simulated

Implementing using a training set


  1. Model the search engine schema using a training set
  2. Make sure all rules are satisfied independently
  3. Combinations of the rules are also tested, and schema refined until all the conditions are satisfactory

Identify the below

  • fields to be indexed
  • analyzers
  • tokenizers

The iteration

Pass n : run rules -> anomalies -> tweak rules & schema -> regression -> Pass (n+1)

Some Tools

..that help in implementing free text search

Apache Nutch

A web crawler that can index web pages. Has integration support with Solr.


Apache UIMA

Unstructured Information Management Applications

analyze large volumes of unstructured information in order to discover knowledge that is relevant to an end user.

Ex - ingest plain text and identify entities, such as persons, places, organizations; or relations, such as works-for or located-at.



Search Engine

By steam

Search Engine

  • 1,309