Beyond Simple Text Match
Srikanth Venugopalan
ThoughtWorks, Chennai
Webpages denormalize data!
Pros
Cons
Lucene based
Endecca
Support in RDBMS
Postgres, MS SQL Server, Oracle etc have full text indexing support
Or rather, semi-structured
De-normalize highly nested structures.
Foo has Bar and Baz. Baz comes with Fizz1 and Fizz2, whereas Baz consists of Blah with Buzz1 and Buzz2
Foo
-Bar
--Fizz1
--Fizz2
-Baz
--Blah
---Buzz1
---Buzz2
[ nth normal form ] <----------------------------> [flat free text]
^
Ideal for search engines?
Foo
-Bar
-Bar_Fizz1
-Bar_Fizz2
-Baz
-Baz_Blah
-Baz_Blah_Buzz1
-Baz_Blah_Buzz2
Look-up by a specific field
Combine fields to get multiple combinations
(Nothing new, we've been doing this since SQL)
But...
You get collision of matches across fields.
Name = Kawasaki
Brand = Kawasaki
??
Another Example:
Surface ~ A reflective surface
Surface ~ http://www.microsoft.com/surface/
So when I search for Surface, what should I get?
contd..
"Surface" by itself
Analyze most of such cases, to arrive at the most searched by fields (priorities).
Most search engine tools support definition of Boosts at query and Index time.
Identifying Priority of relevance
Eliminate the ambiguity
Work around Fuzzyness of the natural language
Identify the right field type
Creates scope of working with the right operations meant for datatypes
These queries fetch results that fall within the range of a given field values.
An Example
price:[100 TO 200]
A user could fetch all documents that have price between 100 and 200, for example.
...it comes at a cost.
Analyzing the data and choosing the right training set could be a daunting task, particularly when the dataset is large and complex.
A training set is a subset of actual data, that can be used to run the rules and verify behaviour.
Guidelines -
Steps
Identify the below
Pass n : run rules -> anomalies -> tweak rules & schema -> regression -> Pass (n+1)
..that help in implementing free text search
Apache Nutch
A web crawler that can index web pages. Has integration support with Solr.
Unstructured Information Management Applications
analyze large volumes of
unstructured information in order to discover knowledge
that is relevant to an end user.
Ex - ingest plain text and identify entities, such as persons, places, organizations; or relations, such as works-for or located-at.
Questions?