Pinecone and hybrid search

Text

Pinecone

https://docs.pinecone.io/reference/api/data-plane/upsert

Storing text in metadata

https://docs.pinecone.io/guides/data/filter-with-metadata#supported-metadata-types

Cardinality is an important factor to consider when managing an index, as high cardinality can lead to performance issues, pod fullness, and a reduction in the number of possible vectors that can fit per pod.

Bla bla lots of unique values in metadata bad

Storing text in metadata

https://docs.pinecone.io/guides/data/filter-with-metadata#supported-metadata-types

For pod-based indexes, Pinecone indexes all metadata by default. When metadata contains many unique values, pod-based indexes will consume significantly more memory, which can lead to performance issues, pod fullness, and a reduction in the number of possible vectors that fit per pod.

Bla bla lots of unique values in metadata bad if indexed

What is cardinality?

Cardinality in databases refers to the uniqueness of data values contained in a column. It essentially measures how many distinct values exist in a column compared to the total number of rows in a table.

So does it affect us here since we are not filtering by text? Well the documentation says that

For pod-based indexes, Pinecone indexes all metadata by default.

Is it turned off? Do we want to index potentially millions of words of text?

Resources used

https://docs.pinecone.io/guides/organizations/manage-cost/understanding-cost

Using text-embedding-ada-002 with 1536 dimensions. All of these are given for metadata size of 500 bytes.

Read units:

fetch: 10 records = 1 RU
query: 100k records = 6 - 18 RUs.
list: 1 call = 1 RU

Write units:

upsert: 7 WUs
update: 4-11+ WUs
delete: 7 WUs

Example usage storing/reading text

Data: ~ 14 kb of text, 4 documents, 20 sections (vectors) in total

Start	Only metadata	Metadata that contains text
557 WUs	689 WUs	832 WUs
WUs used	132	143

6000 Tokens, 1536 dimensions - 6 RUs with metadata, 5 RUs without

Overview

Cons:

Limited size - currently 40 kb of metadata https://docs.pinecone.io/guides/data/filter-with-metadata#supported-metadata-size
- Degraded performance if indexed
- LLM/SLM contexts are getting longer so we can pull in larger and larger chunks
Indexing degrades performance
Majority of server resources end up being used for text instead of vectors

Pros:

You avoid making multiple requests
Only manage 1 database

Keywords

KeyBERT

Using default embedding model
https://maartengr.github.io/KeyBERT/faq.html " the default model in KeyBERT ("all-MiniLM-L6-v2")"

def extract_keywords(text, length=3, threshold=0.5):
    kw_model = KeyBERT()
    keywords = kw_model.extract_keywords(text, keyphrase_ngram_range=(1, length), stop_words="english", use_mmr=True, top_n=20)
    output = []
    for k in keywords:
        if k[1] > threshold:
            output.append(k[0])
    return output

Example

Time: 134.5 s, Documents processed: 50, Words processed: 20654
max tokens as 512 according to https://huggingface.co/spaces/mteb/leaderboard
But its own model card https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 says "By default, input text longer than 256 word pieces is truncated."
Large document: 18113 words, 49.3 s

Example

You can specify embedding model. Currently labeling is using all-mpnet-base-v2
https://huggingface.co/sentence-transformers/all-mpnet-base-v2 "By default, input text longer than 384 word pieces is truncated." - Need to check input size!
It's the best one on KeyBERT list https://www.sbert.net/docs/sentence_transformer/pretrained_models.html
all-MiniLM-L12-v2 is also available (~1.3% worse, ~3x faster)

Change embedding model

Embedding models leaderboard https://huggingface.co/spaces/mteb/leaderboard

Rank	Model	Size (M)	Memory usage (GB, fp32)	Dimensions	Max tokens	Benchmark (avg 56 datasets)
102	all-mpnet-base-v2	110	0.41	768	514	57.78
98	jina-embeddings-v2-small-en	33	0.12	512	8192	58
111	all-MiniLM-L12-v2	33	0.12	384	512	56.53

Hybrid search

https://www.pinecone.io/learn/hybrid-search-intro/

Two types of embeddings - dense vectors, sparse vectors
Dense: house -> text-embedding-3-small -> ~100s-1000s of dimensions

..., 0.011934157460927963, -0.06418240070343018, -0.013386393897235394, ...

Sparse: house -> BM25 ->
{'indices': [2406372295], 'values': [0.7539757220045282]}
1 dimension?
Yes and no. It's actually way more dimensions than dense vector
"Sparse vector values can contain up to 1000 non-zero values and 4.2 billion dimensions."
We only store non-zero values.

Sparse vector

Example using BERT tokenizer:

from transformers import BertTokenizerFast  # !pip install transformers

tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
inputs = tokenizer(
   some_text, padding=True, truncation=True,
   max_length=512
)
inputs.keys()

Inputs contain:

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])
# Where input_ids are
[101, 16984, 3526, 2331, 1006, 7473, 2094, ...]

Tokens seen in text. Transform into:

{101: 1, 16984: 1, 3526: 2, 2331: 2, 1006: 10, ... }

BM25

From Pinecone: "We recommend using either BM25 or SPLADE sparse vectors."
BM25 - Best match 25 - term-based ranking model
Scores documents based on their term frequency and document length

Term frequency: number of times a particular term appears.
Inverse document frequency: Importance of the term in the entire corpus.
- Cardiology corpus: "heart" would have low importance, you can't really find document by typing "heart" in the search box
- "atrial natriuretic peptide" specific, higher importance
Document length normalization: Address the impact of document length on relevance scoring. Avoid bias for longer documents.
Query term saturation: Mitigate impact of repeating same term (cat cat example)
Read more: https://archive.is/mlzYJ

BM25

Implementation in Pinecone https://docs.pinecone.io/guides/data/encode-sparse-vectors

from pinecone_text.sparse import BM25Encoder

data = ["Some text", "This other text", "The quick brown fox jumps over a lazy dog"]
bm25 = BM25Encoder()
bm25.fit(data)
# Or using default dataset
bm25 = BM25Encoder.default() # parameters fitted based on MS MARCO dataset
sparse_vector = dm25.encode_documents(text)

Output:

{"indices": [102, 18, 12, ...], "values": [0.21, 0.38, 0.15, ...]}

To encode vector for hybrid search you do:

query_space_vector = bm25.encode_queries(text)

Is it slow?

Data: basically all of Moby Dick.
Words: 200k+
Vector size: ~13k non-zero dimensions
Time: 2.1 s

For a few sentences it's basically instant (few ms).

Usage in Pinecone

https://docs.pinecone.io/guides/data/query-sparse-dense-vectors Upserting

upsert_response = index.upsert(
  vectors=[
    {'id': 'vec1',
      'values': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8],
      'metadata': {'genre': 'drama'},
      'sparse_values': {
          'indices': [1, 5],
          'values': [0.5, 0.5]
      }},....
  ],
  namespace='example-namespace'
)

You can add weight to the vectors:

def hybrid_score_norm(dense, sparse, alpha: float):
    """Hybrid score using a convex combination

    alpha * dense + (1 - alpha) * sparse

    Args:
        dense: Array of floats representing
        sparse: a dict of `indices` and `values`
        alpha: scale between 0 and 1
    """
    if alpha < 0 or alpha > 1:
        raise ValueError("Alpha must be between 0 and 1")
    hs = {
        'indices': sparse['indices'],
        'values':  [v * (1 - alpha) for v in sparse['values']]
    }
    return [v * alpha for v in dense], hs

Query:

sparse_vector = {
   'indices': [10, 45, 16],
   'values':  [0.5, 0.5, 0.2]
}
dense_vector = [0.1, 0.2, 0.3]

hdense, hsparse = hybrid_score_norm(dense_vector, sparse_vector, alpha=0.75)

query_response = index.query(
    namespace="example-namespace",
    top_k=10,
    vector=hdense,
    sparse_vector=hsparse
)

Ok ok ok but why would we care about this? How does this improve search?

Example: https://towardsdatascience.com/the-untold-side-of-rag-addressing-its-challenges-in-domain-specific-searches-808956e3ecc8 or https://archive.is/2aPTG
https://www.carsales.com.au/, search tool that scans tens of thousands of car-related articles

Embedded Google Search
RAG using dense embedding (chunking, overlap, append title)

Problems:

Search: "Mazda CX-9 2018 review"
Top results: "2019 Mazda CX-9: Video review", "Mazda CX-9 2017 Review" - wrong car
Recency of articles overlooked. Searching "Toyota Corolla" returns 10 year old articles.
General questions return specific articles. "What is a hybrid car?" finds specific hybrid models.

Solutions:
- Hybrid search
- Hierarchical document ranking - boost based on title
- Year boosting
- Instructor large dense embedding - task instruction + embedding