Text
"all-MiniLM-L6-v2")"def extract_keywords(text, length=3, threshold=0.5):
kw_model = KeyBERT()
keywords = kw_model.extract_keywords(text, keyphrase_ngram_range=(1, length), stop_words="english", use_mmr=True, top_n=20)
output = []
for k in keywords:
if k[1] > threshold:
output.append(k[0])
return output
Example
Example
all-mpnet-base-v2
all-MiniLM-L12-v2 is also available (~1.3% worse, ~3x faster)Change embedding model
| Rank | Model | Size (M) | Memory usage (GB, fp32) | Dimensions | Max tokens | Benchmark (avg 56 datasets) |
|---|---|---|---|---|---|---|
| 102 | all-mpnet-base-v2 | 110 | 0.41 | 768 | 514 | 57.78 |
| 98 | jina-embeddings-v2-small-en | 33 | 0.12 | 512 | 8192 | 58 |
| 111 | all-MiniLM-L12-v2 | 33 | 0.12 | 384 | 512 | 56.53 |
house -> text-embedding-3-small -> ~100s-1000s of dimensions..., 0.011934157460927963, -0.06418240070343018, -0.013386393897235394, ...
house -> BM25 ->{'indices': [2406372295], 'values': [0.7539757220045282]}
1 dimension?Sparse vector
Example using BERT tokenizer:
from transformers import BertTokenizerFast # !pip install transformers
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
inputs = tokenizer(
some_text, padding=True, truncation=True,
max_length=512
)
inputs.keys()
Inputs contain:
dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])
# Where input_ids are
[101, 16984, 3526, 2331, 1006, 7473, 2094, ...]
Tokens seen in text. Transform into:
{101: 1, 16984: 1, 3526: 2, 2331: 2, 1006: 10, ... }
Implementation in Pinecone https://docs.pinecone.io/guides/data/encode-sparse-vectors
from pinecone_text.sparse import BM25Encoder
data = ["Some text", "This other text", "The quick brown fox jumps over a lazy dog"]
bm25 = BM25Encoder()
bm25.fit(data)
# Or using default dataset
bm25 = BM25Encoder.default() # parameters fitted based on MS MARCO dataset
sparse_vector = dm25.encode_documents(text)
Output:
{"indices": [102, 18, 12, ...], "values": [0.21, 0.38, 0.15, ...]}
To encode vector for hybrid search you do:
query_space_vector = bm25.encode_queries(text)
Is it slow?
For a few sentences it's basically instant (few ms).
upsert_response = index.upsert(
vectors=[
{'id': 'vec1',
'values': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8],
'metadata': {'genre': 'drama'},
'sparse_values': {
'indices': [1, 5],
'values': [0.5, 0.5]
}},....
],
namespace='example-namespace'
)
You can add weight to the vectors:
def hybrid_score_norm(dense, sparse, alpha: float):
"""Hybrid score using a convex combination
alpha * dense + (1 - alpha) * sparse
Args:
dense: Array of floats representing
sparse: a dict of `indices` and `values`
alpha: scale between 0 and 1
"""
if alpha < 0 or alpha > 1:
raise ValueError("Alpha must be between 0 and 1")
hs = {
'indices': sparse['indices'],
'values': [v * (1 - alpha) for v in sparse['values']]
}
return [v * alpha for v in dense], hs
Query:
sparse_vector = {
'indices': [10, 45, 16],
'values': [0.5, 0.5, 0.2]
}
dense_vector = [0.1, 0.2, 0.3]
hdense, hsparse = hybrid_score_norm(dense_vector, sparse_vector, alpha=0.75)
query_response = index.query(
namespace="example-namespace",
top_k=10,
vector=hdense,
sparse_vector=hsparse
)
Ok ok ok but why would we care about this? How does this improve search?
Problems:
Open source vector database
Fusion vs reranking
Fusion vs reranking