Hybrid vector search
Text
Keywords

- Using default embedding model
-
https://maartengr.github.io/KeyBERT/faq.html " the default model in KeyBERT (
"all-MiniLM-L6-v2")"
def extract_keywords(text, length=3, threshold=0.5):
kw_model = KeyBERT()
keywords = kw_model.extract_keywords(text, keyphrase_ngram_range=(1, length), stop_words="english", use_mmr=True, top_n=20)
output = []
for k in keywords:
if k[1] > threshold:
output.append(k[0])
return output
Example
- Time: 134.5 s, Documents processed: 50, Words processed: 20654
- max tokens as 512 according to https://huggingface.co/spaces/mteb/leaderboard
- But its own model card https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 says "By default, input text longer than 256 word pieces is truncated."
- Large document: 18113 words, 49.3 s
Example
- You can specify embedding model. Currently labeling is using
all-mpnet-base-v2 - https://huggingface.co/sentence-transformers/all-mpnet-base-v2 "By default, input text longer than 384 word pieces is truncated." - Need to check input size!
- It's the best one on KeyBERT list https://www.sbert.net/docs/sentence_transformer/pretrained_models.html
-
all-MiniLM-L12-v2is also available (~1.3% worse, ~3x faster)
Change embedding model
- Embedding models leaderboard https://huggingface.co/spaces/mteb/leaderboard
| Rank | Model | Size (M) | Memory usage (GB, fp32) | Dimensions | Max tokens | Benchmark (avg 56 datasets) |
|---|---|---|---|---|---|---|
| 102 | all-mpnet-base-v2 | 110 | 0.41 | 768 | 514 | 57.78 |
| 98 | jina-embeddings-v2-small-en | 33 | 0.12 | 512 | 8192 | 58 |
| 111 | all-MiniLM-L12-v2 | 33 | 0.12 | 384 | 512 | 56.53 |
Pinecone

Hybrid search
- Two types of embeddings - dense vectors, sparse vectors
- Dense:
house->text-embedding-3-small-> ~100s-1000s of dimensions
..., 0.011934157460927963, -0.06418240070343018, -0.013386393897235394, ...
- Sparse:
house-> BM25 ->
{'indices': [2406372295], 'values': [0.7539757220045282]} 1 dimension?- Yes and no. It's actually way more dimensions than dense vector
- "Sparse vector values can contain up to 1000 non-zero values and 4.2 billion dimensions."
- We only store non-zero values.
Sparse vector
Example using BERT tokenizer:
from transformers import BertTokenizerFast # !pip install transformers
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
inputs = tokenizer(
some_text, padding=True, truncation=True,
max_length=512
)
inputs.keys()
Inputs contain:
dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])
# Where input_ids are
[101, 16984, 3526, 2331, 1006, 7473, 2094, ...]
Tokens seen in text. Transform into:
{101: 1, 16984: 1, 3526: 2, 2331: 2, 1006: 10, ... }

BM25
- Term frequency: number of times a particular term appears.
- Inverse document frequency: Importance of the term in the entire corpus.
- Cardiology corpus: "heart" would have low importance, you can't really find document by typing "heart" in the search box
- "atrial natriuretic peptide" specific, higher importance
- Document length normalization: Address the impact of document length on relevance scoring. Avoid bias for longer documents.
- Query term saturation: Mitigate impact of repeating same term (cat cat example)
- Read more: https://archive.is/mlzYJ
BM25
Implementation in Pinecone https://docs.pinecone.io/guides/data/encode-sparse-vectors
from pinecone_text.sparse import BM25Encoder
data = ["Some text", "This other text", "The quick brown fox jumps over a lazy dog"]
bm25 = BM25Encoder()
bm25.fit(data)
# Or using default dataset
bm25 = BM25Encoder.default() # parameters fitted based on MS MARCO dataset
sparse_vector = dm25.encode_documents(text)
Output:
{"indices": [102, 18, 12, ...], "values": [0.21, 0.38, 0.15, ...]}
To encode vector for hybrid search you do:
query_space_vector = bm25.encode_queries(text)
Is it slow?
- Data: basically all of Moby Dick.
- Words: 200k+
- Vector size: ~13k non-zero dimensions
- Time: 2.1 s
For a few sentences it's basically instant (few ms).
Usage in Pinecone
upsert_response = index.upsert(
vectors=[
{'id': 'vec1',
'values': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8],
'metadata': {'genre': 'drama'},
'sparse_values': {
'indices': [1, 5],
'values': [0.5, 0.5]
}},....
],
namespace='example-namespace'
)
You can add weight to the vectors:
def hybrid_score_norm(dense, sparse, alpha: float):
"""Hybrid score using a convex combination
alpha * dense + (1 - alpha) * sparse
Args:
dense: Array of floats representing
sparse: a dict of `indices` and `values`
alpha: scale between 0 and 1
"""
if alpha < 0 or alpha > 1:
raise ValueError("Alpha must be between 0 and 1")
hs = {
'indices': sparse['indices'],
'values': [v * (1 - alpha) for v in sparse['values']]
}
return [v * alpha for v in dense], hs
Query:
sparse_vector = {
'indices': [10, 45, 16],
'values': [0.5, 0.5, 0.2]
}
dense_vector = [0.1, 0.2, 0.3]
hdense, hsparse = hybrid_score_norm(dense_vector, sparse_vector, alpha=0.75)
query_response = index.query(
namespace="example-namespace",
top_k=10,
vector=hdense,
sparse_vector=hsparse
)
Ok ok ok but why would we care about this? How does this improve search?
- Example: https://towardsdatascience.com/the-untold-side-of-rag-addressing-its-challenges-in-domain-specific-searches-808956e3ecc8 or https://archive.is/2aPTG
- https://www.carsales.com.au/, search tool that scans tens of thousands of car-related articles
- Embedded Google Search
- RAG using dense embedding (chunking, overlap, append title)
Problems:
- Search: "Mazda CX-9 2018 review"
- Top results: "2019 Mazda CX-9: Video review", "Mazda CX-9 2017 Review" - wrong car
- Recency of articles overlooked. Searching "Toyota Corolla" returns 10 year old articles.
- General questions return specific articles. "What is a hybrid car?" finds specific hybrid models.
- Solutions:
- Hybrid search
- Hierarchical document ranking - boost based on title
- Year boosting
- Instructor large dense embedding - task instruction + embedding
Qdrant
Open source vector database
- vector size: 65535
- metadata: no limit
- different vectors: same collection can contain different vectors
- collections: It is highly recommended not to create many small collections, as it will lead to significant resource consumption overhead.
Qdrant

Qdrant
Fusion vs reranking

Qdrant
Fusion vs reranking

Qdrant

Qdrant

Hybrid search
By Sasa Trivic
Hybrid search
Hybrid search
- 107