Introduction to Solr

NKHumphreys

What is Solr

"Solr is an open source enterprise search platform from the Apache Lucene project. its major features include full-text search, hit highlighting, faceted search..." - Wikipedia

It is written in Java, and runs as a standalone server within a servlet (we use Tomcat)

What Solr is not

It is not a scraper.

It will index what you pass to it but it will not scrape the content from your site or your database.

There are various packages and libraries that will do this for you.

Alternatively, you can write your own site spider and while you are at it, I think the world is in need of a circular shaped item for the front of wheel barrows.

Quick overview of Solr

Define the contents of the documents you want to be searchable (Schema) e.g. Product name, description, price etc...
Deploy Solr to your application server
Pass Solr the documents you want to be searchable (index)
Expose search functionality in your application

Solr queries are RESTful and the responses can be in either XML, JSON or CSV

Solr also supports replication

What you need to know before you start...

Before you start implementing your schema:

Talk to you PM and understand the clients search requirements
Agree your search criteria and get some examples of desired results
Get to know the data you are indexing and searching
Provide your PM with your intended implementation (description of the schema)
And finally, after the initial setup is done, provide your PM with an interface where they can test the search (talk to DD about exposing the admin interface to our sub-domain)

Schema

Solr schema tells Solr what your data is going to look like and how to treat each element of your data

e.g

 <field name="title" type="text_en" stored="true" indexed="true" />

This tells Solr to treat the title attribute as type text_en. It also tells Solr that the field should be retrievable during search (stored) and it should itself be searchable (indexed)

Schema... how to treat data types

We can then tell Solr how to treat a data type during indexing, This is called an Analyzer:

 <fieldType name="text_en" class="solr.TextField">
    <analyzer type="index">
        <tokenizer class="solr.WhiteSpaceTokenizerFactory" />
        <filter class=solr.LowerCaseFilterFactory />
    </analyzer>

The text field will be passed through the "pipe" of tokenizers and filters...

Schema... how to treat data types

We can also tell Solr how to treat data types during a query:

<analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory" />
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
    <filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>

Again, the query text input will be passed through a "pipe" of tokenizers and filters...

Tokenizers

A Tokenizer breaks text up into a stream of tokens. A token is usually a subset of the characters in the text input.

Tokens also contain meta data, such as where in the text the token value occurs.

e.g.

<analyzer>
    <tokenizer class="solr.WhitespaceTokenizerFactory" />
</analyzer>

This creates tokens of characters separated by whitespace

Filters

Like Tokenizers, filters take an input and produce a stream of tokens. However, a Filter's input is also a token stream. This input can come from either a Tokenizer or another Filter.

e.g.

<analyzer>
    <tokenizer class="solr.WhitespaceTokenizerFactory" />
    <filter class="solr.LowerCaseFilterFactory" />
    <filter class="solr.EnglishPorterFilterFactory" />
</analyzer>

This first breaks the text up into a stream of tokens separated by white space, then indexes the data in lower case, and finally applies standard English stemming to each token (more on this later).

Putting it all together

<fieldType>
    <analyzer type="index">
        <tokenizer>
        <filter>
    </analyzer>
    <analyzer type="query">
        <tokenizer>
        <filter>
    </analyzer>
</fieldType>

You can have one of these for each fieldType you define

The result of applying the tokenizers and filters is Solr storing or searching the data using several forms of the input e.g. if a user searches for "Murdering", Solr will also perform the search "Murder" and "murder" and return matches.

Some useful filters and tokenizers

WhitespaceTokenizerFactory
ApostropheFilterFactory
LowerCaseFilterFactory
StopFilterFactory
SynonymFilterFactory
PorterStemFilterFactory

For a list of tokenizers and filters and what they do, see:

https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

Stopwords and Synonyms

You can supply a stopwords.txt and a synonyms.txt

When the StopFilterFactory is applied to an analyzer, any words in the stopwords.txt will be removed from the query

When the SynonymFilterFactory is applied to an analyzer, any words in the synonyms.txt will be replaced by the value in the synonyms.txt

synonyms.txt
i-pod, i pod => ipod
# multiple instances are merged
foo => foo bar
foo => baz
# is equivelent to
foo => foo bar, baz

Copy Fields

Another useful feature of the Solr schema is "copy fields". These fields tell Solr to take a copy of the original field, before tokenizers and filters are applied and store it, so that you may interpret the field in more than one way (different types -> different tokenizers and filters applied).

A good example of copy field usage, is storing the title field for sorting

<field name="title_sort" type="string" stored="true" indexed="true"/>
<copyField source="title" dest="title_sort"/>

This copies what is in the title field (which had type "text_en") and stores it in a field called title_sort (which has type "string"). This is because the string type is sortable.

Solr Config

In the solrconfig.xml we can specify the parameters for configuring Solr

As well as default parameter configuration, you can define Request Handlers in this file.

Request Handlers

A request handler defines the logic executed for any request as well as default parameters for the search

<requestHandler name="/selectSimple" class="solr.SearchHandler">
		<lst name="defaults">
			<str name="wt">json</str>
			<str name="defType">edismax</str>
			<str name="fl">product_id,score</str>
			<str name="qf">author^10 title^10 search_text</str>
			<str name="bf">availability^15 scale(rating,0,100)</str>
			<str name="sort">author, published_date</str>
			<str name="pf">author^30 full_title^30 search_isbn^10</str>
	<str name="facet.field">author</str>
			<str name="f.author.facet.mincount">1</str>
</requestHandler>

This request handler can be accessed via the "name" parameter in the URL

Request Handlers...Cont

What do these parameters mean? Here are some of them...

wt (writer type): which response type should be returned (XML, JSON, CSV)
defType: specifies the query parser to use ( you should probably be using edismax - extended disjunction max)
fl (field list): specifies which fields to return (limits the amount of information in the response)
qf (query fields): list of fields, and the "boosts" to apply to them, when building the query (part of edismax, more on this later)
bf (boost fields): functions with optional boosts that will be included in the query to influence the relevancy score

(more on boosts and relevancy later)

Request Handlers Cont...

To find out what the rest of the parameters mean, and how to use edismax, visit:

http://wiki.apache.org/solr/ExtendedDisMax

There are too many to go into in this talk

Lets look at the Solr interface

http://127.0.0.1:5432/solr

Facets

Faceted results are just results broken up into categories (often showing counts for each category).

These can be defined in the solrconfig.xml Request Handler definition.

<str name="facet.field">author</str>

The "author" field must exists in the schema

A facet query parameter (fq) can then be passed in the query URL

Facets...Cont

For example:

We could index the author of books using a copy field of type solr.StringField

e.g. Tolkien, Rowling etc...

Then when a user searches for a term e.g. "Fantasy" and clicks on a category (facet) to drill down on an author e.g. "Tolkien" we can reissue the query with a fq parameter e.g.

 fq=author:"Tolkien"

This will only return results in that facet.
The copy fields and facets must already exist in the schema and config.

BOOSTING

Field Boosting

The default order in which results are returned from Solr is "Relevancy". You can boost a fields contribution to the relevancy score using the following syntax (you should be using eDisMax, coming up in a few slides)

<str name="bf">author^15 title^10</str>

This takes the score for author, and boosts the result by 15. The boost is additive, so 15 is added to the relevancy score.

This code is placed in the Solr config for a request handler.

Field Boosting Cont...

So if a query for the word "Tolkien" is performed, and one result has token in the "author" field, and the another result has "Tolkien" in the description field, the first result will appear first in the results

You can also use a multiplicative boost

 <str name="boost">author^15 description^2</str>

A word of warning: Using a multiplicative boost with a large boost value can render all other relevancy contributions redundant.

Getting the relevancy score correct is a bit of a dark art, and requires a bit of careful planning and a lot of trial and error.

Other Boost Functions

There are other boost functions other than additive and multiplicative. Some useful ones are:

Scale: scales the relevancy scores relative to other relevancy scores so they appear between a min and max value.
Pow: raises the relevancy to the supplied power

 <str name="bf">scale(rating,0,100)</str>

For a full list of available functions see:

http://wiki.apache.org/solr/FunctionQuery

eDisMax

(Extended Disjunction Max)

This is a query parser with extended functionality.

It allows you to specify additional boosting queries and filtering queries. YOU SHOULD BE USING THIS!!

All of the eDisMax parameters can be specified in the SolrConfig or overridden using GET parameters in the request URL

eDisMax Cont...

All of the query parameters/attributes you have seen in these slides are available when using eDisMax (even if they are not specific to eDisMax)

Have I said that YOU SHOULD BE USING IT?

Committing data to the Solr Index

Once you have scraped the data from your site, and put it into the format that your Solr Schema is expecting, you can pass it the data as a JSON GET parameter to:

<solr_url>/solr/<core>update?commit=true

e.g.

curl 'http://localhost:5432/solr/default/update?commit=true' -H 'Content-type:application/json' -d '
[
    {
        "section": "detectives",
        "title": "Sharkie and George",
        "content": "Sharkie and George, crime busters of the sea, Sharkie and George, solve every mystery",
        "link": "http://www.google.com",
        "meta_bob": ["bob","bob2"]
    }
]'

Don't forget the "commit=true"!!

Delete the entire index

Sometimes it is useful to delete the entire index and re-index your data. To delete the entire index, simply use the following url in curl request:

curl http://<host>:<port>/solr/<core>/update?commit=true -d '<delete><query>*:*</query></delete>'

Here we are passing an xml document, the previous slide passed a JSON document, Solr can handle both.

Gotchas

Some gotchas that we have found along the way:

The NGramFilterFactory should probably never be used, it breaks the index/query text down in the chunks of a specified size and searches against them. e.g. "batteries would search for "bat", "ter" and "ies", if an ngram size of 3 is specified, which would have a positive match with "mysteries"....which is obviously wrong
The cause of most mysterious search results seems to be boosting, what works for one query can cause drastic effects with other queries....test test test!

Gotchas Cont...

The Solr Schema and Config are never finished, they are living documents and should be altered as your understanding of your data changes

SUMMARY

The two most important files are your schema.xml and solrconfig.xml
schema.xml describes what your data will look like and how to treat the incoming data from indexing and queries
solrconfig.xml describes the default solr parameters and how to handle requests
request handling parameters can be overridden in the request URL
there are various tokenizers and filters you can apply to incoming data
results can be faceted
USE EDISMAX!!

localhost:5432/solr/default/select?q=questions&fq=question_type="not-stupid"&wt=json