Hybrid vector search

Text

Keywords

def extract_keywords(text, length=3, threshold=0.5):
    kw_model = KeyBERT()
    keywords = kw_model.extract_keywords(text, keyphrase_ngram_range=(1, length), stop_words="english", use_mmr=True, top_n=20)
    output = []
    for k in keywords:
        if k[1] > threshold:
            output.append(k[0])
    return output

Example

Example

Change embedding model

Rank Model Size (M) Memory usage (GB, fp32) Dimensions Max tokens Benchmark (avg 56 datasets)
102 all-mpnet-base-v2 110 0.41 768 514 57.78
98 jina-embeddings-v2-small-en 33 0.12 512 8192 58
111 all-MiniLM-L12-v2 33 0.12 384 512 56.53

Pinecone

Hybrid search

  • Two types of embeddings - dense vectors, sparse vectors
  • Dense: house -> text-embedding-3-small -> ~100s-1000s of dimensions

..., 0.011934157460927963, -0.06418240070343018, -0.013386393897235394, ...

Sparse vector

Example using BERT tokenizer:

from transformers import BertTokenizerFast  # !pip install transformers

tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
inputs = tokenizer(
   some_text, padding=True, truncation=True,
   max_length=512
)
inputs.keys()

Inputs contain:

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])
# Where input_ids are
[101, 16984, 3526, 2331, 1006, 7473, 2094, ...]

Tokens seen in text. Transform into:

{101: 1, 16984: 1, 3526: 2, 2331: 2, 1006: 10, ... }

BM25

  • From Pinecone: "We recommend using either BM25 or SPLADE sparse vectors."
  • BM25 - Best match 25 - term-based ranking model
  • Scores documents based on their term frequency and document length
  • Term frequency: number of times a particular term appears.
  • Inverse document frequency: Importance of the term in the entire corpus.
    • Cardiology corpus: "heart" would have low importance, you can't really find document by typing "heart" in the search box
    • "atrial natriuretic peptide" specific, higher importance
  • Document length normalization: Address the impact of document length on relevance scoring. Avoid bias for longer documents.
  • Query term saturation: Mitigate impact of repeating same term (cat cat example)
  • Read more: https://archive.is/mlzYJ

BM25

from pinecone_text.sparse import BM25Encoder

data = ["Some text", "This other text", "The quick brown fox jumps over a lazy dog"]
bm25 = BM25Encoder()
bm25.fit(data)
# Or using default dataset
bm25 = BM25Encoder.default() # parameters fitted based on MS MARCO dataset
sparse_vector = dm25.encode_documents(text)

Output:

{"indices": [102, 18, 12, ...], "values": [0.21, 0.38, 0.15, ...]}

To encode vector for hybrid search you do:

query_space_vector = bm25.encode_queries(text)

Is it slow?

  • Data: basically all of Moby Dick.
  • Words: 200k+
  • Vector size: ~13k non-zero dimensions
  • Time: 2.1 s

For a few sentences it's basically instant (few ms).

Usage in Pinecone

upsert_response = index.upsert(
  vectors=[
    {'id': 'vec1',
      'values': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8],
      'metadata': {'genre': 'drama'},
      'sparse_values': {
          'indices': [1, 5],
          'values': [0.5, 0.5]
      }},....
  ],
  namespace='example-namespace'
)

You can add weight to the vectors:

def hybrid_score_norm(dense, sparse, alpha: float):
    """Hybrid score using a convex combination

    alpha * dense + (1 - alpha) * sparse

    Args:
        dense: Array of floats representing
        sparse: a dict of `indices` and `values`
        alpha: scale between 0 and 1
    """
    if alpha < 0 or alpha > 1:
        raise ValueError("Alpha must be between 0 and 1")
    hs = {
        'indices': sparse['indices'],
        'values':  [v * (1 - alpha) for v in sparse['values']]
    }
    return [v * alpha for v in dense], hs

Query:

sparse_vector = {
   'indices': [10, 45, 16],
   'values':  [0.5, 0.5, 0.2]
}
dense_vector = [0.1, 0.2, 0.3]

hdense, hsparse = hybrid_score_norm(dense_vector, sparse_vector, alpha=0.75)

query_response = index.query(
    namespace="example-namespace",
    top_k=10,
    vector=hdense,
    sparse_vector=hsparse
)

Ok ok ok but why would we care about this? How does this improve search?

  • Embedded Google Search
  • RAG using dense embedding (chunking, overlap, append title)

Problems:

  • Search: "Mazda CX-9 2018 review"
  • Top results: "2019 Mazda CX-9: Video review", "Mazda CX-9 2017 Review" - wrong car
  • Recency of articles overlooked. Searching "Toyota Corolla" returns 10 year old articles.
  • General questions return specific articles. "What is a hybrid car?" finds specific hybrid models.
  • Solutions:
    • Hybrid search
    • Hierarchical document ranking - boost based on title
    • Year boosting
    • Instructor large dense embedding - task instruction + embedding

Qdrant

Open source vector database

 

  • vector size: 65535
  • metadata: no limit
  • different vectors: same collection can contain different vectors
  • collections: It is highly recommended not to create many small collections, as it will lead to significant resource consumption overhead.

Qdrant

Qdrant

Fusion vs reranking

Qdrant

Fusion vs reranking

Qdrant

Qdrant

Hybrid search

By Sasa Trivic

Hybrid search

Hybrid search

  • 107