NKHumphreys
"Solr is an open source enterprise search platform from the Apache Lucene project. its major features include full-text search, hit highlighting, faceted search..." - Wikipedia
It is written in Java, and runs as a standalone server within a servlet (we use Tomcat)
It is not a scraper.
It will index what you pass to it but it will not scrape the content from your site or your database.
There are various packages and libraries that will do this for you.
Alternatively, you can write your own site spider and while you are at it, I think the world is in need of a circular shaped item for the front of wheel barrows.
Solr queries are RESTful and the responses can be in either XML, JSON or CSV
Solr also supports replication
Before you start implementing your schema:
Solr schema tells Solr what your data is going to look like and how to treat each element of your data
e.g
<field name="title" type="text_en" stored="true" indexed="true" />
This tells Solr to treat the title attribute as type text_en. It also tells Solr that the field should be retrievable during search (stored) and it should itself be searchable (indexed)
We can then tell Solr how to treat a data type during indexing, This is called an Analyzer:
<fieldType name="text_en" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.WhiteSpaceTokenizerFactory" />
<filter class=solr.LowerCaseFilterFactory />
</analyzer>
The text field will be passed through the "pipe" of tokenizers and filters...We can also tell Solr how to treat data types during a query:
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
Again, the query text input will be passed through a "pipe" of tokenizers and filters...
A Tokenizer breaks text up into a stream of tokens. A token is usually a subset of the characters in the text input.
Tokens also contain meta data, such as where in the text the token value occurs.
e.g.
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory" />
</analyzer>
This creates tokens of characters separated by whitespace
Like Tokenizers, filters take an input and produce a stream of tokens. However, a Filter's input is also a token stream. This input can come from either a Tokenizer or another Filter.
e.g.
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.EnglishPorterFilterFactory" />
</analyzer>
This first breaks the text up into a stream of tokens separated by white space, then indexes the data in lower case, and finally applies standard English stemming to each token (more on this later).<fieldType>
<analyzer type="index">
<tokenizer>
<filter>
</analyzer>
<analyzer type="query">
<tokenizer>
<filter>
</analyzer>
</fieldType>
You can have one of these for each fieldType you defineFor a list of tokenizers and filters and what they do, see:
https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
You can supply a stopwords.txt and a synonyms.txt
When the StopFilterFactory is applied to an analyzer, any words in the stopwords.txt will be removed from the query
When the SynonymFilterFactory is applied to an analyzer, any words in the synonyms.txt will be replaced by the value in the synonyms.txt
synonyms.txt
i-pod, i pod => ipod
# multiple instances are merged
foo => foo bar
foo => baz
# is equivelent to
foo => foo bar, baz
Another useful feature of the Solr schema is "copy fields". These fields tell Solr to take a copy of the original field, before tokenizers and filters are applied and store it, so that you may interpret the field in more than one way (different types -> different tokenizers and filters applied).
A good example of copy field usage, is storing the title field for sorting
<field name="title_sort" type="string" stored="true" indexed="true"/>
<copyField source="title" dest="title_sort"/>
This copies what is in the title field (which had type "text_en") and stores it in a field called title_sort (which has type "string"). This is because the string type is sortable.In the solrconfig.xml we can specify the parameters for configuring Solr
As well as default parameter configuration, you can define Request Handlers in this file.
A request handler defines the logic executed for any request as well as default parameters for the search
<requestHandler name="/selectSimple" class="solr.SearchHandler">
<lst name="defaults">
<str name="wt">json</str>
<str name="defType">edismax</str>
<str name="fl">product_id,score</str>
<str name="qf">author^10 title^10 search_text</str>
<str name="bf">availability^15 scale(rating,0,100)</str>
<str name="sort">author, published_date</str>
<str name="pf">author^30 full_title^30 search_isbn^10</str>
<str name="facet.field">author</str>
<str name="f.author.facet.mincount">1</str>
</requestHandler>
This request handler can be accessed via the "name" parameter in the URLWhat do these parameters mean? Here are some of them...
(more on boosts and relevancy later)
To find out what the rest of the parameters mean, and how to use edismax, visit:
http://wiki.apache.org/solr/ExtendedDisMax
There are too many to go into in this talk
Faceted results are just results broken up into categories (often showing counts for each category).
These can be defined in the solrconfig.xml Request Handler definition.
<str name="facet.field">author</str>
The "author" field must exists in the schema
A facet query parameter (fq) can then be passed in the query URL
For example:
We could index the author of books using a copy field of type solr.StringField
e.g. Tolkien, Rowling etc...
Then when a user searches for a term e.g. "Fantasy" and clicks on a category (facet) to drill down on an author e.g. "Tolkien" we can reissue the query with a fq parameter e.g.
fq=author:"Tolkien"
This will only return results in that facet.The default order in which results are returned from Solr is "Relevancy". You can boost a fields contribution to the relevancy score using the following syntax (you should be using eDisMax, coming up in a few slides)
This takes the score for author, and boosts the result by 15. The boost is additive, so 15 is added to the relevancy score.<str name="bf">author^15 title^10</str>
So if a query for the word "Tolkien" is performed, and one result has token in the "author" field, and the another result has "Tolkien" in the description field, the first result will appear first in the results
<str name="boost">author^15 description^2</str>
A word of warning: Using a multiplicative boost with a large boost value can render all other relevancy contributions redundant. There are other boost functions other than additive and multiplicative. Some useful ones are:
<str name="bf">scale(rating,0,100)</str>
For a full list of available functions see:
http://wiki.apache.org/solr/FunctionQuery
(Extended Disjunction Max)
This is a query parser with extended functionality.
It allows you to specify additional boosting queries and filtering queries. YOU SHOULD BE USING THIS!!
All of the eDisMax parameters can be specified in the SolrConfig or overridden using GET parameters in the request URL
All of the query parameters/attributes you have seen in these slides are available when using eDisMax (even if they are not specific to eDisMax)
Have I said that YOU SHOULD BE USING IT?
Once you have scraped the data from your site, and put it into the format that your Solr Schema is expecting, you can pass it the data as a JSON GET parameter to:
<solr_url>/solr/<core>update?commit=true
e.g.
curl 'http://localhost:5432/solr/default/update?commit=true' -H 'Content-type:application/json' -d '
[
{
"section": "detectives",
"title": "Sharkie and George",
"content": "Sharkie and George, crime busters of the sea, Sharkie and George, solve every mystery",
"link": "http://www.google.com",
"meta_bob": ["bob","bob2"]
}
]'
Don't forget the "commit=true"!!Sometimes it is useful to delete the entire index and re-index your data. To delete the entire index, simply use the following url in curl request:
curl http://<host>:<port>/solr/<core>/update?commit=true -d '<delete><query>*:*</query></delete>'
Here we are passing an xml document, the previous slide passed a JSON document, Solr can handle both.Some gotchas that we have found along the way:
localhost:5432/solr/default/select?q=questions&fq=question_type="not-stupid"&wt=json