Pinecone and hybrid search

Text

Pinecone

Storing text in metadata

Cardinality is an important factor to consider when managing an index, as high cardinality can lead to performance issues, pod fullness, and a reduction in the number of possible vectors that can fit per pod.

Bla bla lots of unique values in metadata bad

Storing text in metadata

For pod-based indexes, Pinecone indexes all metadata by default. When metadata contains many unique values, pod-based indexes will consume significantly more memory, which can lead to performance issues, pod fullness, and a reduction in the number of possible vectors that fit per pod.

Bla bla lots of unique values in metadata bad if indexed

What is cardinality?

Cardinality in databases refers to the uniqueness of data values contained in a column. It essentially measures how many distinct values exist in a column compared to the total number of rows in a table.

So does it affect us here since we are not filtering by text? Well the documentation says that

For pod-based indexes, Pinecone indexes all metadata by default.

Is it turned off? Do we want to index potentially millions of words of text?

Resources used

Using text-embedding-ada-002 with 1536 dimensions. All of these are given for metadata size of 500 bytes.

Read units:

  • fetch: 10 records = 1 RU
  • query: 100k records = 6 - 18 RUs.
  • list: 1 call = 1 RU

Write units:

  • upsert: 7 WUs
  • update: 4-11+ WUs
  • delete: 7 WUs

Example usage storing/reading text

Data: ~ 14 kb of text, 4 documents, 20 sections (vectors) in total

Start Only metadata Metadata that contains text
557 WUs 689 WUs 832 WUs
WUs used 132 143

6000 Tokens, 1536 dimensions - 6 RUs with metadata, 5 RUs without

Overview

Cons:

Pros:

  • You avoid making multiple requests
  • Only manage 1 database

Keywords

def extract_keywords(text, length=3, threshold=0.5):
    kw_model = KeyBERT()
    keywords = kw_model.extract_keywords(text, keyphrase_ngram_range=(1, length), stop_words="english", use_mmr=True, top_n=20)
    output = []
    for k in keywords:
        if k[1] > threshold:
            output.append(k[0])
    return output

Example

Example

Change embedding model

Rank Model Size (M) Memory usage (GB, fp32) Dimensions Max tokens Benchmark (avg 56 datasets)
102 all-mpnet-base-v2 110 0.41 768 514 57.78
98 jina-embeddings-v2-small-en 33 0.12 512 8192 58
111 all-MiniLM-L12-v2 33 0.12 384 512 56.53

Hybrid search

  • Two types of embeddings - dense vectors, sparse vectors
  • Dense: house -> text-embedding-3-small -> ~100s-1000s of dimensions

..., 0.011934157460927963, -0.06418240070343018, -0.013386393897235394, ...

Sparse vector

Example using BERT tokenizer:

from transformers import BertTokenizerFast  # !pip install transformers

tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
inputs = tokenizer(
   some_text, padding=True, truncation=True,
   max_length=512
)
inputs.keys()

Inputs contain:

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])
# Where input_ids are
[101, 16984, 3526, 2331, 1006, 7473, 2094, ...]

Tokens seen in text. Transform into:

{101: 1, 16984: 1, 3526: 2, 2331: 2, 1006: 10, ... }

BM25

  • From Pinecone: "We recommend using either BM25 or SPLADE sparse vectors."
  • BM25 - Best match 25 - term-based ranking model
  • Scores documents based on their term frequency and document length
  • Term frequency: number of times a particular term appears.
  • Inverse document frequency: Importance of the term in the entire corpus.
    • Cardiology corpus: "heart" would have low importance, you can't really find document by typing "heart" in the search box
    • "atrial natriuretic peptide" specific, higher importance
  • Document length normalization: Address the impact of document length on relevance scoring. Avoid bias for longer documents.
  • Query term saturation: Mitigate impact of repeating same term (cat cat example)
  • Read more: https://archive.is/mlzYJ

BM25

from pinecone_text.sparse import BM25Encoder

data = ["Some text", "This other text", "The quick brown fox jumps over a lazy dog"]
bm25 = BM25Encoder()
bm25.fit(data)
# Or using default dataset
bm25 = BM25Encoder.default() # parameters fitted based on MS MARCO dataset
sparse_vector = dm25.encode_documents(text)

Output:

{"indices": [102, 18, 12, ...], "values": [0.21, 0.38, 0.15, ...]}

To encode vector for hybrid search you do:

query_space_vector = bm25.encode_queries(text)

Is it slow?

  • Data: basically all of Moby Dick.
  • Words: 200k+
  • Vector size: ~13k non-zero dimensions
  • Time: 2.1 s

For a few sentences it's basically instant (few ms).

Usage in Pinecone

upsert_response = index.upsert(
  vectors=[
    {'id': 'vec1',
      'values': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8],
      'metadata': {'genre': 'drama'},
      'sparse_values': {
          'indices': [1, 5],
          'values': [0.5, 0.5]
      }},....
  ],
  namespace='example-namespace'
)

You can add weight to the vectors:

def hybrid_score_norm(dense, sparse, alpha: float):
    """Hybrid score using a convex combination

    alpha * dense + (1 - alpha) * sparse

    Args:
        dense: Array of floats representing
        sparse: a dict of `indices` and `values`
        alpha: scale between 0 and 1
    """
    if alpha < 0 or alpha > 1:
        raise ValueError("Alpha must be between 0 and 1")
    hs = {
        'indices': sparse['indices'],
        'values':  [v * (1 - alpha) for v in sparse['values']]
    }
    return [v * alpha for v in dense], hs

Query:

sparse_vector = {
   'indices': [10, 45, 16],
   'values':  [0.5, 0.5, 0.2]
}
dense_vector = [0.1, 0.2, 0.3]

hdense, hsparse = hybrid_score_norm(dense_vector, sparse_vector, alpha=0.75)

query_response = index.query(
    namespace="example-namespace",
    top_k=10,
    vector=hdense,
    sparse_vector=hsparse
)

Ok ok ok but why would we care about this? How does this improve search?

  • Embedded Google Search
  • RAG using dense embedding (chunking, overlap, append title)

Problems:

  • Search: "Mazda CX-9 2018 review"
  • Top results: "2019 Mazda CX-9: Video review", "Mazda CX-9 2017 Review" - wrong car
  • Recency of articles overlooked. Searching "Toyota Corolla" returns 10 year old articles.
  • General questions return specific articles. "What is a hybrid car?" finds specific hybrid models.
  • Solutions:
    • Hybrid search
    • Hierarchical document ranking - boost based on title
    • Year boosting
    • Instructor large dense embedding - task instruction + embedding

Pinecone and hybrid search

By Sasa Trivic

Pinecone and hybrid search

Storing text inside Pinecone metadata. Keyword extraction. Sparse vectors and hybrid search.

  • 416