Search Engine

Beyond Simple Text Match



Srikanth Venugopalan

ThoughtWorks, Chennai

need for Search?

  •     Address disconnect between definition of entities from a user and provider viewpoint
  •     A very prominent user flow
  •     Users are so used to Google that they end up searching for almost everything


Why not use something like Google?


    Google/Yahoo/Bing/Baidu etc

  • are web search engines .. crawl webpages
  • very mature in retrieving them

But!!


Webpages denormalize data!

Web Search Engines

    Pros

  • Indexing pages in Google etc, provide a wider reach
  • coherent, independent source of data
  • Optionally, one could get statistics out of the box.


    Cons

  • not customizable to various data structures
  • Multiple views becomes difficult (Faceting / Sorting)
  • Optionally, one could get statistics out of the box.



Tools available

Lucene based

  • SOLR
  • ElasticSearch


Endecca


Support in RDBMS
Postgres, MS SQL Server, Oracle etc have full text indexing support



 

Going beyond Text Match

Search on structured data!

Or rather, semi-structured


    De-normalize highly nested structures.


Extremes!

Foo has Bar and Baz. Baz comes with Fizz1 and Fizz2, whereas Baz consists of Blah with Buzz1 and Buzz2

Foo
-Bar
--Fizz1
--Fizz2
-Baz
--Blah
---Buzz1
---Buzz2


[ nth normal form ] <----------------------------> [flat free text]

                                                           ^

Ideal for search engines?

Foo
-Bar
-Bar_Fizz1
-Bar_Fizz2
-Baz
-Baz_Blah
-Baz_Blah_Buzz1
-Baz_Blah_Buzz2 

Exploiting Data-structures

Look-up by a specific field

Combine fields to get multiple combinations


(Nothing new, we've been doing this since SQL)


But...

..SQL fails when


You get collision of matches across fields.

Name = Kawasaki

Brand = Kawasaki

??

Weights / Boosts

Another Example:

Surface ~ A reflective surface

Surface ~ http://www.microsoft.com/surface/


So when I search for Surface, what should I get?


contd..

Weights / Boosts

"Surface" by itself

  • Rarely used by a user looking for a reflective surface.
  • High chance that this is a search for "Microsoft Surface"


Analyze most of such cases, to arrive at the most searched by fields (priorities).


Most search engine tools support definition of Boosts at query and Index time.

Defining Weights / Boosts



Identifying Priority of relevance
Eliminate the ambiguity
Work around Fuzzyness of the natural language       



Identify the right field type
Creates scope of working with the right operations meant for datatypes

 

Some More Examples

RANGE QUERY

        These queries fetch results that fall within the range of a given field values.
       
An Example

price:[100 TO 200] 


        A user could fetch all documents that have price between 100 and 200, for example.

Some more Examples

  • Aggregate Query
  • Conditional Query
  • function query
  • dynamic fields
  • Multi lingual
  •  date time
  • geo spatial


Iterative development and testing

Why?

  • Number of cases to be tested gets distributed across the span of development


  • Regression testing is continuous -  changing/adding/removing one rule can be immediately tested for impact.


So why not do it in all cases?


...it comes at a cost.


Analyzing the data and choosing the right training set could be a daunting task, particularly when the dataset is large and complex.

choosing a training set

A training set is a subset of actual data, that can be used to run the rules and verify behaviour.


Guidelines -

  • good representation of actual data
  • Right size : is small enough for a tester to be able to handle, and is big enough so that ripple effect can be simulated


Implementing using a training set

Steps

  1. Model the search engine schema using a training set
  2. Make sure all rules are satisfied independently
  3. Combinations of the rules are also tested, and schema refined until all the conditions are satisfactory


Identify the below

  • fields to be indexed
  • analyzers
  • tokenizers



The iteration

Pass n : run rules -> anomalies -> tweak rules & schema -> regression -> Pass (n+1)

Some Tools

..that help in implementing free text search

Apache Nutch

A web crawler that can index web pages. Has integration support with Solr.


contd..

Apache UIMA


Unstructured Information Management Applications


analyze large volumes of unstructured information in order to discover knowledge that is relevant to an end user.


Ex - ingest plain text and identify entities, such as persons, places, organizations; or relations, such as works-for or located-at.

THank YOU


Questions?

Search Engine

By steam

Search Engine

  • 1,585