Pinecone and hybrid search
Text
Pinecone

Storing text in metadata
Cardinality is an important factor to consider when managing an index, as high cardinality can lead to performance issues, pod fullness, and a reduction in the number of possible vectors that can fit per pod.
Bla bla lots of unique values in metadata bad
Storing text in metadata
For pod-based indexes, Pinecone indexes all metadata by default. When metadata contains many unique values, pod-based indexes will consume significantly more memory, which can lead to performance issues, pod fullness, and a reduction in the number of possible vectors that fit per pod.
Bla bla lots of unique values in metadata bad if indexed
What is cardinality?
Cardinality in databases refers to the uniqueness of data values contained in a column. It essentially measures how many distinct values exist in a column compared to the total number of rows in a table.
So does it affect us here since we are not filtering by text? Well the documentation says that
For pod-based indexes, Pinecone indexes all metadata by default.
Is it turned off? Do we want to index potentially millions of words of text?
Resources used
Using text-embedding-ada-002 with 1536 dimensions. All of these are given for metadata size of 500 bytes.
Read units:
- fetch: 10 records = 1 RU
- query: 100k records = 6 - 18 RUs.
- list: 1 call = 1 RU
Write units:
- upsert: 7 WUs
- update: 4-11+ WUs
- delete: 7 WUs
Example usage storing/reading text
Data: ~ 14 kb of text, 4 documents, 20 sections (vectors) in total
| Start | Only metadata | Metadata that contains text |
|---|---|---|
| 557 WUs | 689 WUs | 832 WUs |
| WUs used | 132 | 143 |
6000 Tokens, 1536 dimensions - 6 RUs with metadata, 5 RUs without
Overview
Cons:
- Limited size - currently 40 kb of metadata https://docs.pinecone.io/guides/data/filter-with-metadata#supported-metadata-size
- Degraded performance if indexed
- LLM/SLM contexts are getting longer so we can pull in larger and larger chunks
- Indexing degrades performance
- Majority of server resources end up being used for text instead of vectors
Pros:
- You avoid making multiple requests
- Only manage 1 database
Keywords

- Using default embedding model
-
https://maartengr.github.io/KeyBERT/faq.html " the default model in KeyBERT (
"all-MiniLM-L6-v2")"
def extract_keywords(text, length=3, threshold=0.5):
kw_model = KeyBERT()
keywords = kw_model.extract_keywords(text, keyphrase_ngram_range=(1, length), stop_words="english", use_mmr=True, top_n=20)
output = []
for k in keywords:
if k[1] > threshold:
output.append(k[0])
return output
Example
- Time: 134.5 s, Documents processed: 50, Words processed: 20654
- max tokens as 512 according to https://huggingface.co/spaces/mteb/leaderboard
- But its own model card https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 says "By default, input text longer than 256 word pieces is truncated."
- Large document: 18113 words, 49.3 s
Example
- You can specify embedding model. Currently labeling is using
all-mpnet-base-v2 - https://huggingface.co/sentence-transformers/all-mpnet-base-v2 "By default, input text longer than 384 word pieces is truncated." - Need to check input size!
- It's the best one on KeyBERT list https://www.sbert.net/docs/sentence_transformer/pretrained_models.html
-
all-MiniLM-L12-v2is also available (~1.3% worse, ~3x faster)
Change embedding model
- Embedding models leaderboard https://huggingface.co/spaces/mteb/leaderboard
| Rank | Model | Size (M) | Memory usage (GB, fp32) | Dimensions | Max tokens | Benchmark (avg 56 datasets) |
|---|---|---|---|---|---|---|
| 102 | all-mpnet-base-v2 | 110 | 0.41 | 768 | 514 | 57.78 |
| 98 | jina-embeddings-v2-small-en | 33 | 0.12 | 512 | 8192 | 58 |
| 111 | all-MiniLM-L12-v2 | 33 | 0.12 | 384 | 512 | 56.53 |
Hybrid search
- Two types of embeddings - dense vectors, sparse vectors
- Dense:
house->text-embedding-3-small-> ~100s-1000s of dimensions
..., 0.011934157460927963, -0.06418240070343018, -0.013386393897235394, ...
- Sparse:
house-> BM25 ->
{'indices': [2406372295], 'values': [0.7539757220045282]} 1 dimension?- Yes and no. It's actually way more dimensions than dense vector
- "Sparse vector values can contain up to 1000 non-zero values and 4.2 billion dimensions."
- We only store non-zero values.
Sparse vector
Example using BERT tokenizer:
from transformers import BertTokenizerFast # !pip install transformers
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
inputs = tokenizer(
some_text, padding=True, truncation=True,
max_length=512
)
inputs.keys()
Inputs contain:
dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])
# Where input_ids are
[101, 16984, 3526, 2331, 1006, 7473, 2094, ...]
Tokens seen in text. Transform into:
{101: 1, 16984: 1, 3526: 2, 2331: 2, 1006: 10, ... }

BM25
- Term frequency: number of times a particular term appears.
- Inverse document frequency: Importance of the term in the entire corpus.
- Cardiology corpus: "heart" would have low importance, you can't really find document by typing "heart" in the search box
- "atrial natriuretic peptide" specific, higher importance
- Document length normalization: Address the impact of document length on relevance scoring. Avoid bias for longer documents.
- Query term saturation: Mitigate impact of repeating same term (cat cat example)
- Read more: https://archive.is/mlzYJ
BM25
Implementation in Pinecone https://docs.pinecone.io/guides/data/encode-sparse-vectors
from pinecone_text.sparse import BM25Encoder
data = ["Some text", "This other text", "The quick brown fox jumps over a lazy dog"]
bm25 = BM25Encoder()
bm25.fit(data)
# Or using default dataset
bm25 = BM25Encoder.default() # parameters fitted based on MS MARCO dataset
sparse_vector = dm25.encode_documents(text)
Output:
{"indices": [102, 18, 12, ...], "values": [0.21, 0.38, 0.15, ...]}
To encode vector for hybrid search you do:
query_space_vector = bm25.encode_queries(text)
Is it slow?
- Data: basically all of Moby Dick.
- Words: 200k+
- Vector size: ~13k non-zero dimensions
- Time: 2.1 s
For a few sentences it's basically instant (few ms).
Usage in Pinecone
upsert_response = index.upsert(
vectors=[
{'id': 'vec1',
'values': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8],
'metadata': {'genre': 'drama'},
'sparse_values': {
'indices': [1, 5],
'values': [0.5, 0.5]
}},....
],
namespace='example-namespace'
)
You can add weight to the vectors:
def hybrid_score_norm(dense, sparse, alpha: float):
"""Hybrid score using a convex combination
alpha * dense + (1 - alpha) * sparse
Args:
dense: Array of floats representing
sparse: a dict of `indices` and `values`
alpha: scale between 0 and 1
"""
if alpha < 0 or alpha > 1:
raise ValueError("Alpha must be between 0 and 1")
hs = {
'indices': sparse['indices'],
'values': [v * (1 - alpha) for v in sparse['values']]
}
return [v * alpha for v in dense], hs
Query:
sparse_vector = {
'indices': [10, 45, 16],
'values': [0.5, 0.5, 0.2]
}
dense_vector = [0.1, 0.2, 0.3]
hdense, hsparse = hybrid_score_norm(dense_vector, sparse_vector, alpha=0.75)
query_response = index.query(
namespace="example-namespace",
top_k=10,
vector=hdense,
sparse_vector=hsparse
)
Ok ok ok but why would we care about this? How does this improve search?
- Example: https://towardsdatascience.com/the-untold-side-of-rag-addressing-its-challenges-in-domain-specific-searches-808956e3ecc8 or https://archive.is/2aPTG
- https://www.carsales.com.au/, search tool that scans tens of thousands of car-related articles
- Embedded Google Search
- RAG using dense embedding (chunking, overlap, append title)
Problems:
- Search: "Mazda CX-9 2018 review"
- Top results: "2019 Mazda CX-9: Video review", "Mazda CX-9 2017 Review" - wrong car
- Recency of articles overlooked. Searching "Toyota Corolla" returns 10 year old articles.
- General questions return specific articles. "What is a hybrid car?" finds specific hybrid models.
- Solutions:
- Hybrid search
- Hierarchical document ranking - boost based on title
- Year boosting
- Instructor large dense embedding - task instruction + embedding
Pinecone and hybrid search
By Sasa Trivic
Pinecone and hybrid search
Storing text inside Pinecone metadata. Keyword extraction. Sparse vectors and hybrid search.
- 416