Information Management
api
replication
caching
consistency
persistence
management
Information Retrieval
indexing
querying
suggestions
ranking
sorting
highlighting
Analyzing each document with a tokenizer to find the terms
Can be as simple as splitting on space
In order to increase index quality and compress it different techniques can be used
Stemming and lemmatization is the process of reducing words to their common form
Stoplist helps with ignoring very common words.
Lucene contains analyzers for different languages.
Finding documents that matches the query
Based on set theory it powers AND/OR/NOT queries and will approve all the documents whos indexes fullfills the predicate
Based on the index we want to be able to autocorrect
or suggest alternative query terms
croccodial => crocodile
Straightforward levehnstein can be a problem since it takes O(n*m) so with a really big index this can become expensive
Provides an O(n) time test with any string where n is the length of the query term
A finite state machine is constructed based on the query term and an edit distance and it's used to test other strings
Lucene uses a Levenshtein automata translated from a python implementation by one Jean-Philippe Barrette-LaPierre to Java because the Lucene committers originally didn't understand how it worked and couldn't get their own implementation working!
When we have found documents that matches the query we want to sort them by relevance
We represent both the document and the query as vectors
The angle between the two vectors is our similarity score and as it gets smaller the documents are more similar
We wish to weigh the documents such that we can represent them as vectors
Many different ways and variations to weighting exist
Term frequencies can be raw frequencies or boolean frequencies or augmented frequencies to prevent bias toward longer documents
Similar to the stoplist during analysis the inverse document frequency gives lower weight to terms that are very common in the document corpus.
Since we're using cosine the larger the similarity score the more relevant.