RAG
attention, 2014/09
seq2seq model, 2014/09
Transformer, 2017/06
GPT-2, 2019
BERT, 2018/10
Prompt Engineering, 2018/06
Lewis, Patrick, et al. "Retrieval-augmented generation for knowledge-intensive nlp tasks." Advances in Neural Information Processing Systems 33 (2020): 9459-9474.
RAG, 2020/05
RAG for knowledge intensive tasks
1. parametric memory: a pre-trained seq2seq model
2. non-parametric memory: a dense vector index of Wikipedia
e.g. word embedding
embedding of x
向量搜尋
embedding of 文件zi
向量搜尋前n名
Lewis, Patrick, et al. "Retrieval-augmented generation for knowledge-intensive nlp tasks." Advances in Neural Information Processing Systems 33 (2020): 9459-9474.
1. Temperature(以OpenAI為例)
介於 0至2之間,數值越高生成結果越隨機。
語言模型生成內容時可調參數(非所有模型都支援):兼具正確性與創造性
2. Top-P (Nucleus Sampling, 以OpenAI為例)
p介於0.1至1之間,以0.1為例,代表生成「下一個token時」只考慮排序後,前10%的tokens。
3. Top-K (Top-K Sampling)
k為一正整數,以32為例,代表生成「下一個token時」只考慮排序後,前32名的tokens。
Lewis, Patrick, et al. "Retrieval-augmented generation for knowledge-intensive nlp tasks." Advances in Neural Information Processing Systems 33 (2020): 9459-9474.
non-parametric memory: a dense vector index of Wikipedia
內部知識
外部知識
語言模型
vector embedding
或
graph embedding
問題
解答
parametric memory: a pre-trained seq2seq model
實作方式
內部知識
Factual Knowledge
語言模型
vector embedding
或
graph embedding
實作方式
RAG
MCP
External Tool Access
Model Weight Adjustment
Finetune
Procedural Knowledge
Skills
Ingestion
Retrieval
Re-Rank
Generate
Data Ingestion: 將資料從外部來源移入內部Repository
text-embedding-3-small、Cohere Embedding、BGE 或 mE5 模型 import hashlib
from datetime import datetime
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
class VectorIngestionPipeline:
def __init__(self, vector_db, embedding_model):
self.vector_db = vector_db
self.embeddings = embedding_model
# 保持良好的切分大小
self.text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)
def process_document(self, file_path, metadata=None):
# 安全處理預設 metadata,避免 None 導致報錯
base_metadata = metadata or {}
if "source" not in base_metadata:
base_metadata["source"] = file_path
# 1. 提取 PDF 文字
loader = PyPDFLoader(file_path)
documents = loader.load()
# 2. 切分區塊
chunks = self.text_splitter.split_documents(documents)
# 提取所有區塊的文字列表,用於一次性批量生成向量
texts = [chunk.page_content for chunk in chunks]
# 3. 批量生成向量 (大幅減少 API 呼叫延遲)
embeddings = self.embeddings.embed_documents(texts)
# 4. 準備批量寫入的資料
vectors_to_upsert = []
for i, chunk in enumerate(chunks):
# 雜湊值計算
content_hash = hashlib.sha256(chunk.page_content.encode()).hexdigest()
# 擴充元資料:必須包含原始文字 "text"
enriched_metadata = {
**base_metadata,
"text": chunk.page_content,
"chunk_id": i,
"page": chunk.metadata.get("page", 0), # 保留 PDF 原本的頁碼資訊
"timestamp": datetime.now().isoformat(),
"content_hash": content_hash
}
vector_id = f"{base_metadata['source']}_{i}"
vectors_to_upsert.append((vector_id, embeddings[i], enriched_metadata))
# 5. 單次批量寫入資料庫 (效能提升數十倍)
self.vector_db.upsert(vectors=vectors_to_upsert)參考範例pseudo code
from langchain.text_splitter import RecursiveCharacterTextSplitter
import spacy
class SmartChunker:
def __init__(self):
self.nlp = spacy.load("en_core_web_sm")
def semantic_chunking(self, text, max_chunk_size=1000):
"""Chunk by sentences while respecting max size"""
doc = self.nlp(text)
sentences = [sent.text.strip() for sent in doc.sents]
chunks = []
current_chunk = ""
for sentence in sentences:
if len(current_chunk + sentence) <= max_chunk_size:
current_chunk += sentence + " "
else:
if current_chunk:
chunks.append(current_chunk.strip())
current_chunk = sentence + " "
if current_chunk:
chunks.append(current_chunk.strip())
return chunks# Example: Contextual chunking with metadata enrichment
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=800,
chunk_overlap=200,
separators=["\n## ", "\n### ", "\n\n", "\n", " "]
)
chunks = splitter.split_documents(documents)
# Enrich each chunk with hierarchical context
for chunk in chunks:
chunk.metadata["section_title"] = extract_parent_heading(chunk)
chunk.metadata["doc_summary"] = doc_level_summary
chunk.metadata["source"] = document.metadata["source"]1. What is the natural unit of meaning in your corpus?
2. What is the average complexity of your users’ queries?
3. What is your latency budget for retrieval?
data特性
query特性
Pavan Belagatti, Vector Embeddings Explained for Developers!
Central to many NLP, recommendation, and search algorithms.
數值
物件、文字、圖像...
Vector Space: semantic similarity
Barančíková, P., & Bojar, O. (2019). In search for linear relations in sentence embedding spaces.
pip install -U transformers torchfrom transformers import AutoTokenizer, AutoModel
import torch
def get_huggingface_embedding(text,
model_name='sentence-transformers/all-MiniLM-L6-v2'):
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
inputs = tokenizer(text, return_tensors="pt", padding=True,
truncation=True, max_length=512)
with torch.no_grad():
outputs = model(**inputs)
# You can choose how to derive the final embeddings, e.g., mean pooling
embeddings = outputs.last_hidden_state.mean(dim=1).squeeze().numpy()
return embeddings
# Example usage
text = "Pavan is a developer evangelist."
embedding_huggingface = get_huggingface_embedding(text)
print(embedding_huggingface)from sentence_transformers import SentenceTransformer
# 1. Load a pretrained Sentence Transformer model
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
# The sentences to encode
sentences = [
"The weather is lovely today.",
"It's so sunny outside!",
"He drove to the stadium.",
]
# 2. Calculate embeddings by calling model.encode()
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]
# 3. Calculate the embedding similarities
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.6660, 0.1046],
# [0.6660, 1.0000, 0.1411],
# [0.1046, 0.1411, 1.0000]])pip install -U sentence-transformersall-MiniLM-L6-v2
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("google/embeddinggemma-300m")
# Run inference with queries and documents
query = "Which planet is known as the Red Planet?"
documents = [
"Venus is often called Earth's twin because of its similar size and proximity.",
"Mars, known for its reddish appearance, is often referred to as the Red Planet.",
"Jupiter, the largest planet in our solar system, has a prominent red spot.",
"Saturn, famous for its rings, is sometimes mistaken for the Red Planet."
]
query_embeddings = model.encode_query(query)
document_embeddings = model.encode_document(documents)
print(query_embeddings.shape, document_embeddings.shape)
# (768,) (4, 768)
# Compute similarities to determine a ranking
similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)
# tensor([[0.3011, 0.6359, 0.4930, 0.4889]])
pip install -U sentence-transformersgoogle/embeddinggemma-300m
from langchain_community.embeddings import HuggingFaceEmbeddings
# Initialize Multilingual-E5
embeddings = HuggingFaceEmbeddings(
model_name="intfloat/multilingual-e5-base",
model_kwargs={"device": "cpu"}
)
# E5 requires 'query: ' and 'passage: ' prefixes
query_text = "query: 什麼是區塊鏈?"
doc_texts = [
"passage: 區塊鏈是一種去中心化的分散式帳本技術。",
"passage: 台北101是台灣著名的地標建築。"
]
query_vector = embeddings.embed_query(query_text)
doc_vectors = embeddings.embed_documents(doc_texts)intfloat/multilingual-e5-base: 繁體中文適用
import os
from langchain_google_genai import GoogleGenAIEmbeddings
# 初始化 Gemini 向量模型,目前最新通用推薦型號為 text-embedding-004
embeddings = GoogleGenAIEmbeddings(
model="models/text-embedding-004"
)
# 1. 轉換單一搜尋問題(Query)
query_text = "台灣高鐵的票價如何查詢?"
query_vector = embeddings.embed_query(query_text)
# 2. 轉換多筆資料庫文本(Documents)
documents = [
"台灣高鐵提供商務車廂與標準車廂,票價依據乘車距離計算。",
"台北到高雄的標準車廂對號座全票票價為新台幣 1,490 元。",
"高雄捷運是服務高雄市、屏東縣及台南市的捷運系統。"
]
doc_vectors = embeddings.embed_documents(documents)
# 3. 輸出檢驗資訊
print(f"向量維度大小: {len(query_vector)}") # text-embedding-004 預設維度為 768
print(f"成功生成 {len(doc_vectors)} 筆文本向量。")text-embedding-004
需要google ai api-key
* 使用時,問題必須手動加上 "query: " 前綴,資料庫文本必須加上 "passage: " 前綴,否則準確度會大幅下滑。
*
問題: Semantic similarity 與 factual relevance並非同一件事
Vector similarity
Semantic similarity
解法:
exact search(lexical search) + vector search
reranker 排除離題的context
+
BM25, TF-IDF
Semantic Search
# Hybrid search with RRF fusion
from rank_bm25 import BM25Okapi
import numpy as np
# BM25 lexical retrieval
bm25 = BM25Okapi(tokenized_corpus)
lexical_scores = bm25.get_scores(tokenized_query)
lexical_top_k = np.argsort(lexical_scores)[-20:][::-1]
# Dense vector retrieval
vector_results = vector_store.similarity_search(query, k=20)
# Reciprocal Rank Fusion
def reciprocal_rank_fusion(ranked_lists, k=60):
fused_scores = {}
for ranked_list in ranked_lists:
for rank, doc_id in enumerate(ranked_list):
fused_scores[doc_id] = fused_scores.get(doc_id, 0) + 1 / (k + rank + 1)
return sorted(fused_scores, key=fused_scores.get, reverse=True)
final_ranking = reciprocal_rank_fusion([lexical_top_k, vector_doc_ids])
Press enter or click to view image in full size
pip install rank_bm25RRF(Reciprocal Rank Fusion) Ranker
不同搜尋方式的排名
Users are terrible at writing query.
Query Transformation
from langchain_openai import ChatOpenAI
from langchain.prompts import PromptTemplate
import os
from dotenv import load_dotenv
# Load environment variables from a .env file
load_dotenv()
# Set the OpenAI API key environment variable
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')
# 或改使用local llmre_write_llm = ChatOpenAI(temperature=0, model_name="gpt-4o", max_tokens=4000)
# Create a prompt template for query rewriting
query_rewrite_template = """You are an AI assistant tasked with reformulating user queries to improve
retrieval in a RAG system. Given the original query, rewrite it to be more specific, detailed, and
likely to retrieve relevant information.
Original query: {original_query}
Rewritten query:"""
query_rewrite_prompt = PromptTemplate(input_variables=["original_query"], template=query_rewrite_template)
# Create an LLMChain for query rewriting
query_rewriter = query_rewrite_prompt | re_write_llm
def rewrite_query(original_query):
"""
Rewrite the original query to improve retrieval.
Args:
original_query (str): The original user query
Returns:
str: The rewritten query
"""
response = query_rewriter.invoke(original_query)
return response.content# example query over the understanding climate change dataset
original_query = "What are the impacts of climate change on the environment?"
rewritten_query = rewrite_query(original_query)
print("Original query:", original_query)
print("\nRewritten query:", rewritten_query)
Demostration
Original query: What are the impacts of climate change on the environment?
Rewritten query: What are the specific effects of climate change on various ecosystems, including changes in temperature, precipitation patterns, sea levels, and biodiversity?
step_back_llm = ChatOpenAI(temperature=0, model_name="gpt-4o", max_tokens=4000)
# Create a prompt template for step-back prompting
step_back_template = """You are an AI assistant tasked with generating broader, more general queries to improve context retrieval in a RAG system.
Given the original query, generate a step-back query that is more general and can help retrieve relevant background information.
Original query: {original_query}
Step-back query:"""
step_back_prompt = PromptTemplate(
input_variables=["original_query"],
template=step_back_template
)
# Create an LLMChain for step-back prompting
step_back_chain = step_back_prompt | step_back_llm
def generate_step_back_query(original_query):
"""
Generate a step-back query to retrieve broader context.
Args:
original_query (str): The original user query
Returns:
str: The step-back query
"""
response = step_back_chain.invoke(original_query)
return response.content
You are an expert of world knowledge. I am going to ask you a question.
Your response should be comprehensive and not contradicted with the following
context if they are relevant. Otherwise, ignore them if they are not relevant.
{normal_context}
{step_back_context}
Original Question: {question}
Answer:The other step back prompt template
# example query over the understanding climate change dataset
original_query = "What are the impacts of climate change on the environment?"
step_back_query = generate_step_back_query(original_query)
print("Original query:", original_query)
print("\nStep-back query:", step_back_query)
Demostration
Original query: What are the impacts of climate change on the environment?
Step-back query: What are the general effects of climate change?
sub_query_llm = ChatOpenAI(temperature=0, model_name="gpt-4o", max_tokens=4000)
# Create a prompt template for sub-query decomposition
subquery_decomposition_template = """You are an AI assistant tasked with breaking down complex queries into simpler sub-queries for a RAG system.
Given the original query, decompose it into 2-4 simpler sub-queries that, when answered together, would provide a comprehensive response to the original query.
Original query: {original_query}
example: What are the impacts of climate change on the environment?
Sub-queries:
1. What are the impacts of climate change on biodiversity?
2. How does climate change affect the oceans?
3. What are the effects of climate change on agriculture?
4. What are the impacts of climate change on human health?"""
subquery_decomposition_prompt = PromptTemplate(
input_variables=["original_query"],
template=subquery_decomposition_template
)
# Create an LLMChain for sub-query decomposition
subquery_decomposer_chain = subquery_decomposition_prompt | sub_query_llm
def decompose_query(original_query: str):
"""
Decompose the original query into simpler sub-queries.
Args:
original_query (str): The original complex query
Returns:
List[str]: A list of simpler sub-queries
"""
response = subquery_decomposer_chain.invoke(original_query).content
sub_queries = [q.strip() for q in response.split('\n') if q.strip() and not q.strip().startswith('Sub-queries:')]
return sub_queries# example query over the understanding climate change dataset
original_query = "What are the impacts of climate change on the environment?"
sub_queries = decompose_query(original_query)
print("\nSub-queries:")
for i, sub_query in enumerate(sub_queries, 1):
print(sub_query)
Demostration
Sub-queries:
Original query: What are the impacts of climate change on the environment?
1. How does climate change affect biodiversity and ecosystems?
2. What are the impacts of climate change on oceanic conditions and marine life?
3. How does climate change influence weather patterns and extreme weather events?
4. What are the effects of climate change on terrestrial environments, such as forests and deserts?