SOLR in Rails

SOLR

  • full-text search server with Apache Lucene at the backend
  • Opensource, maintained by Apache
  • It's not a abbreviation. :P
  • exposes Lucene's JAVA API as REST like API's which can be called over HTTP from any programming language/platform.
  • Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java.

Features

  • Full Text Search
  • Faceted search
  • More items like this(Recommendation)/ Related searches
  • Spell Suggest/Auto-Complete
  • Custom document ranking/ordering
  • Snippet generation/highlighting
  • And a lot More....

why/when to use SOLR?

 

  • Want Greater control over your website search.
  • Caching, Replication, Distributed search.
  • Reallly fast Indexing/Searching, Indexes can be merged/optimized (Index compaction).
  • Great admin interface can be used over HTTP.
  • Awesome community support too.
  • Support for integration with various other products like drupal CMS, etc.
  • Can be used in E-commerce sites, CMS, Blog sites.
  • Heavily used by LinkedIn, Twitter, Cnet, Netflix, Digg.

Sunspot

 

  • Ruby library for expressive, powerful interaction with the Solr search engine
  • built on top of the RSolr library, which provides a low-level interface for Solr interaction
  • provides a simple, intuitive, expressive DSL backed by powerful features for indexing objects and searching for them.
  • easily plugged in to any ORM

Installation

# Add to Gemfile:

gem 'sunspot_rails'
gem 'sunspot_solr' 

# optional pre-packaged Solr distribution for use in development

#Bundle it!

bundle install

# Generate a default configuration file:

rails generate sunspot_rails:install

# If sunspot_solr was installed, start the packaged Solr distribution with:

bundle exec rake sunspot:solr:start # or sunspot:solr:run to start in foreground

Setting up Objects

Text



    class Post < ActiveRecord::Base
      searchable do
        text :title, :body
        text :comments do
          comments.map { |comment| comment.body }
        end
    
        boolean :featured
        integer :blog_id
        integer :author_id
        integer :category_ids, :multiple => true
        double  :average_rating
        time    :published_at
        time    :expired_at
    
        string  :sort_title do
          title.downcase.gsub(/^(an?|the)/, '')
        end
      end
    end

Searching Objects

Text



   Post.search do
      fulltext 'big pizza' do
        fields(:body, :title => 2.0)
        phrase_fields :title => 2.0
        phrase_slop   1
        boost(2.0) { with(:featured, true) }
      end
    
      with :blog_id, 1
      with :category_id, 5
      with(:published_at).less_than Time.now
      order_by :published_at, :desc
      paginate :page => 2, :per_page => 15
   end


# Note*: 
# text fields will be full-text searchable. 
# Other fields (e.g., integer and string) can be used to scope queries.

Configurations

  • sunspot.yml
  • solr.xml
  • solrconfig.xml
  • schema.xml
    
    <fieldType name="text" class="solr.TextField" omitNorms="false">
      <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StandardFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
        <charFilter class="solr.HTMLStripCharFilterFactory"/>
        <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
    </fieldType>


    <dynamicField name="*_text" stored="false" type="text" multiValued="true" indexed="true"/>

    
    <solrQueryParser defaultOperator="OR"/>

Lucene Scoring Model

  • tf - Term Frequency. The frequency with which a term appears in a document. Given a search query, the higher the term frequency, the higher the document score.
  • idf - Inverse Document Frequency. The rarer a term is across all documents in the index, the higher it's contribution to the score.
  • coord - Coordination Factor. The more query terms that are found in a document, the higher it's score.
  • fieldNorm - Field length. The more words that a field contains, the lower it's score. This factor penalizes documents with longer field values.
  • Boosts - In addition to the scoring factors mentioned above, the primary method of modifying document scores is by boosting.
    • Index-time boosts are applied when adding documents, and apply to the entire document or to specific fields.
    • Query-time boosts are applied when constructing a search query, and apply to specific fields.

Lucene scoring Formula

score(q,d) = coord-factor(q,d) . query-boost(q) . V(q) . V(d) . doc-len-norm(d) . doc-boost(d)
                                                  ____________
                                                     |V(q)|      

References

  • https://github.com/sunspot/sunspot
  • https://wiki.apache.org/solr/
  • https://cwiki.apache.org/confluence/display/solr/
  • http://lucene.apache.org/core/3_5_0/api/core/org/apache/lucene/search/Similarity.html

Questions?

Thank you

Using SOLR in Rails

By Datt Dongare

Using SOLR in Rails

Solr in rails

  • 527