Text
Cardinality is an important factor to consider when managing an index, as high cardinality can lead to performance issues, pod fullness, and a reduction in the number of possible vectors that can fit per pod.
Bla bla lots of unique values in metadata bad
For pod-based indexes, Pinecone indexes all metadata by default. When metadata contains many unique values, pod-based indexes will consume significantly more memory, which can lead to performance issues, pod fullness, and a reduction in the number of possible vectors that fit per pod.
Bla bla lots of unique values in metadata bad if indexed
What is cardinality?
Cardinality in databases refers to the uniqueness of data values contained in a column. It essentially measures how many distinct values exist in a column compared to the total number of rows in a table.
So does it affect us here since we are not filtering by text? Well the documentation says that
For pod-based indexes, Pinecone indexes all metadata by default.
Is it turned off? Do we want to index potentially millions of words of text?
Using text-embedding-ada-002 with 1536 dimensions. All of these are given for metadata size of 500 bytes.
Read units:
Write units:
Data: ~ 14 kb of text, 4 documents, 20 sections (vectors) in total
| Start | Only metadata | Metadata that contains text |
|---|---|---|
| 557 WUs | 689 WUs | 832 WUs |
| WUs used | 132 | 143 |
6000 Tokens, 1536 dimensions - 6 RUs with metadata, 5 RUs without
Cons:
Pros:
"all-MiniLM-L6-v2")"def extract_keywords(text, length=3, threshold=0.5):
kw_model = KeyBERT()
keywords = kw_model.extract_keywords(text, keyphrase_ngram_range=(1, length), stop_words="english", use_mmr=True, top_n=20)
output = []
for k in keywords:
if k[1] > threshold:
output.append(k[0])
return output
Example
Example
all-mpnet-base-v2
all-MiniLM-L12-v2 is also available (~1.3% worse, ~3x faster)Change embedding model
| Rank | Model | Size (M) | Memory usage (GB, fp32) | Dimensions | Max tokens | Benchmark (avg 56 datasets) |
|---|---|---|---|---|---|---|
| 102 | all-mpnet-base-v2 | 110 | 0.41 | 768 | 514 | 57.78 |
| 98 | jina-embeddings-v2-small-en | 33 | 0.12 | 512 | 8192 | 58 |
| 111 | all-MiniLM-L12-v2 | 33 | 0.12 | 384 | 512 | 56.53 |
house -> text-embedding-3-small -> ~100s-1000s of dimensions..., 0.011934157460927963, -0.06418240070343018, -0.013386393897235394, ...
house -> BM25 ->{'indices': [2406372295], 'values': [0.7539757220045282]}
1 dimension?Sparse vector
Example using BERT tokenizer:
from transformers import BertTokenizerFast # !pip install transformers
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
inputs = tokenizer(
some_text, padding=True, truncation=True,
max_length=512
)
inputs.keys()
Inputs contain:
dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])
# Where input_ids are
[101, 16984, 3526, 2331, 1006, 7473, 2094, ...]
Tokens seen in text. Transform into:
{101: 1, 16984: 1, 3526: 2, 2331: 2, 1006: 10, ... }
Implementation in Pinecone https://docs.pinecone.io/guides/data/encode-sparse-vectors
from pinecone_text.sparse import BM25Encoder
data = ["Some text", "This other text", "The quick brown fox jumps over a lazy dog"]
bm25 = BM25Encoder()
bm25.fit(data)
# Or using default dataset
bm25 = BM25Encoder.default() # parameters fitted based on MS MARCO dataset
sparse_vector = dm25.encode_documents(text)
Output:
{"indices": [102, 18, 12, ...], "values": [0.21, 0.38, 0.15, ...]}
To encode vector for hybrid search you do:
query_space_vector = bm25.encode_queries(text)
Is it slow?
For a few sentences it's basically instant (few ms).
upsert_response = index.upsert(
vectors=[
{'id': 'vec1',
'values': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8],
'metadata': {'genre': 'drama'},
'sparse_values': {
'indices': [1, 5],
'values': [0.5, 0.5]
}},....
],
namespace='example-namespace'
)
You can add weight to the vectors:
def hybrid_score_norm(dense, sparse, alpha: float):
"""Hybrid score using a convex combination
alpha * dense + (1 - alpha) * sparse
Args:
dense: Array of floats representing
sparse: a dict of `indices` and `values`
alpha: scale between 0 and 1
"""
if alpha < 0 or alpha > 1:
raise ValueError("Alpha must be between 0 and 1")
hs = {
'indices': sparse['indices'],
'values': [v * (1 - alpha) for v in sparse['values']]
}
return [v * alpha for v in dense], hs
Query:
sparse_vector = {
'indices': [10, 45, 16],
'values': [0.5, 0.5, 0.2]
}
dense_vector = [0.1, 0.2, 0.3]
hdense, hsparse = hybrid_score_norm(dense_vector, sparse_vector, alpha=0.75)
query_response = index.query(
namespace="example-namespace",
top_k=10,
vector=hdense,
sparse_vector=hsparse
)
Ok ok ok but why would we care about this? How does this improve search?
Problems: