Hybrid vector search

Text

Keywords

KeyBERT

Using default embedding model
https://maartengr.github.io/KeyBERT/faq.html " the default model in KeyBERT ("all-MiniLM-L6-v2")"

def extract_keywords(text, length=3, threshold=0.5):
    kw_model = KeyBERT()
    keywords = kw_model.extract_keywords(text, keyphrase_ngram_range=(1, length), stop_words="english", use_mmr=True, top_n=20)
    output = []
    for k in keywords:
        if k[1] > threshold:
            output.append(k[0])
    return output

Example

Time: 134.5 s, Documents processed: 50, Words processed: 20654
max tokens as 512 according to https://huggingface.co/spaces/mteb/leaderboard
But its own model card https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 says "By default, input text longer than 256 word pieces is truncated."
Large document: 18113 words, 49.3 s

Example

You can specify embedding model. Currently labeling is using all-mpnet-base-v2
https://huggingface.co/sentence-transformers/all-mpnet-base-v2 "By default, input text longer than 384 word pieces is truncated." - Need to check input size!
It's the best one on KeyBERT list https://www.sbert.net/docs/sentence_transformer/pretrained_models.html
all-MiniLM-L12-v2 is also available (~1.3% worse, ~3x faster)

Change embedding model

Embedding models leaderboard https://huggingface.co/spaces/mteb/leaderboard

Rank	Model	Size (M)	Memory usage (GB, fp32)	Dimensions	Max tokens	Benchmark (avg 56 datasets)
102	all-mpnet-base-v2	110	0.41	768	514	57.78
98	jina-embeddings-v2-small-en	33	0.12	512	8192	58
111	all-MiniLM-L12-v2	33	0.12	384	512	56.53

Pinecone

https://docs.pinecone.io/reference/api/data-plane/upsert

Hybrid search

https://www.pinecone.io/learn/hybrid-search-intro/

Two types of embeddings - dense vectors, sparse vectors
Dense: house -> text-embedding-3-small -> ~100s-1000s of dimensions

..., 0.011934157460927963, -0.06418240070343018, -0.013386393897235394, ...

Sparse: house -> BM25 ->
{'indices': [2406372295], 'values': [0.7539757220045282]}
1 dimension?
Yes and no. It's actually way more dimensions than dense vector
"Sparse vector values can contain up to 1000 non-zero values and 4.2 billion dimensions."
We only store non-zero values.

Sparse vector

Example using BERT tokenizer:

from transformers import BertTokenizerFast  # !pip install transformers

tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
inputs = tokenizer(
   some_text, padding=True, truncation=True,
   max_length=512
)
inputs.keys()

Inputs contain:

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])
# Where input_ids are
[101, 16984, 3526, 2331, 1006, 7473, 2094, ...]

Tokens seen in text. Transform into:

{101: 1, 16984: 1, 3526: 2, 2331: 2, 1006: 10, ... }

BM25

From Pinecone: "We recommend using either BM25 or SPLADE sparse vectors."
BM25 - Best match 25 - term-based ranking model
Scores documents based on their term frequency and document length

Term frequency: number of times a particular term appears.
Inverse document frequency: Importance of the term in the entire corpus.
- Cardiology corpus: "heart" would have low importance, you can't really find document by typing "heart" in the search box
- "atrial natriuretic peptide" specific, higher importance
Document length normalization: Address the impact of document length on relevance scoring. Avoid bias for longer documents.
Query term saturation: Mitigate impact of repeating same term (cat cat example)
Read more: https://archive.is/mlzYJ

BM25

Implementation in Pinecone https://docs.pinecone.io/guides/data/encode-sparse-vectors

from pinecone_text.sparse import BM25Encoder

data = ["Some text", "This other text", "The quick brown fox jumps over a lazy dog"]
bm25 = BM25Encoder()
bm25.fit(data)
# Or using default dataset
bm25 = BM25Encoder.default() # parameters fitted based on MS MARCO dataset
sparse_vector = dm25.encode_documents(text)

Output:

{"indices": [102, 18, 12, ...], "values": [0.21, 0.38, 0.15, ...]}

To encode vector for hybrid search you do:

query_space_vector = bm25.encode_queries(text)

Is it slow?

Data: basically all of Moby Dick.
Words: 200k+
Vector size: ~13k non-zero dimensions
Time: 2.1 s

For a few sentences it's basically instant (few ms).

Usage in Pinecone

https://docs.pinecone.io/guides/data/query-sparse-dense-vectors Upserting

upsert_response = index.upsert(
  vectors=[
    {'id': 'vec1',
      'values': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8],
      'metadata': {'genre': 'drama'},
      'sparse_values': {
          'indices': [1, 5],
          'values': [0.5, 0.5]
      }},....
  ],
  namespace='example-namespace'
)

You can add weight to the vectors:

def hybrid_score_norm(dense, sparse, alpha: float):
    """Hybrid score using a convex combination

    alpha * dense + (1 - alpha) * sparse

    Args:
        dense: Array of floats representing
        sparse: a dict of `indices` and `values`
        alpha: scale between 0 and 1
    """
    if alpha < 0 or alpha > 1:
        raise ValueError("Alpha must be between 0 and 1")
    hs = {
        'indices': sparse['indices'],
        'values':  [v * (1 - alpha) for v in sparse['values']]
    }
    return [v * alpha for v in dense], hs

Query:

sparse_vector = {
   'indices': [10, 45, 16],
   'values':  [0.5, 0.5, 0.2]
}
dense_vector = [0.1, 0.2, 0.3]

hdense, hsparse = hybrid_score_norm(dense_vector, sparse_vector, alpha=0.75)

query_response = index.query(
    namespace="example-namespace",
    top_k=10,
    vector=hdense,
    sparse_vector=hsparse
)

Ok ok ok but why would we care about this? How does this improve search?

Example: https://towardsdatascience.com/the-untold-side-of-rag-addressing-its-challenges-in-domain-specific-searches-808956e3ecc8 or https://archive.is/2aPTG
https://www.carsales.com.au/, search tool that scans tens of thousands of car-related articles

Embedded Google Search
RAG using dense embedding (chunking, overlap, append title)

Problems:

Search: "Mazda CX-9 2018 review"
Top results: "2019 Mazda CX-9: Video review", "Mazda CX-9 2017 Review" - wrong car
Recency of articles overlooked. Searching "Toyota Corolla" returns 10 year old articles.
General questions return specific articles. "What is a hybrid car?" finds specific hybrid models.

Solutions:
- Hybrid search
- Hierarchical document ranking - boost based on title
- Year boosting
- Instructor large dense embedding - task instruction + embedding

Qdrant

https://qdrant.tech/

Open source vector database

vector size: 65535
metadata: no limit
different vectors: same collection can contain different vectors
collections: It is highly recommended not to create many small collections, as it will lead to significant resource consumption overhead.

Qdrant

Fusion vs reranking

Qdrant

Fusion vs reranking