Introduction to Agentic AI

RAG

What is RAG?

RAG = Retrieval Augmented Generation
Combines LLM with external data via vector databases
Used in AI chatbots, recommendation systems, etc.

Retrieval Augmented Generation the origin

attention, 2014/09

seq2seq model, 2014/09

Transformer, 2017/06

GPT-2, 2019

BERT, 2018/10

Prompt Engineering, 2018/06

Lewis, Patrick, et al. "Retrieval-augmented generation for knowledge-intensive nlp tasks." Advances in Neural Information Processing Systems 33 (2020): 9459-9474.

RAG, 2020/05

RAG for knowledge intensive tasks

1. parametric memory: a pre-trained seq2seq model

2. non-parametric memory: a dense vector index of Wikipedia

e.g. word embedding

embedding of x

向量搜尋

embedding of 文件zi

向量搜尋前n名

RAG for Knowledge Intensive Tasks

Lewis, Patrick, et al. "Retrieval-augmented generation for knowledge-intensive nlp tasks." Advances in Neural Information Processing Systems 33 (2020): 9459-9474.

1. Temperature(以OpenAI為例)
介於 0至2之間，數值越高生成結果越隨機。

語言模型生成內容時可調參數(非所有模型都支援)：兼具正確性與創造性

2. Top-P (Nucleus Sampling, 以OpenAI為例)

p介於0.1至1之間，以0.1為例，代表生成「下一個token時」只考慮排序後，前10%的tokens。

3. Top-K (Top-K Sampling)

k為一正整數，以32為例，代表生成「下一個token時」只考慮排序後，前32名的tokens。

RAG for Knowledge Intensive Tasks

Lewis, Patrick, et al. "Retrieval-augmented generation for knowledge-intensive nlp tasks." Advances in Neural Information Processing Systems 33 (2020): 9459-9474.

non-parametric memory: a dense vector index of Wikipedia

內部知識

外部知識

語言模型

vector embedding

或

graph embedding

問題

解答

parametric memory: a pre-trained seq2seq model

實作方式

RAG and other techniques for LLM rethink

內部知識

Factual Knowledge

語言模型

vector embedding

或

graph embedding

實作方式

RAG

MCP

External Tool Access

Model Weight Adjustment

Finetune

Procedural Knowledge

Skills

LLM

Source 1, Image designed by Magnific, Source 3, Source4

RAG and other techniques for LLM rethink

Source: Deep Dive SKILL.md (Part 1/2)

Skill.md: Procedural knowledge of LLM

Source: https://agentskills.io/home

Skill Folder的內容:

Skill.md: Procedural knowledge of LLM

Source: Deep Dive SKILL.md (Part 1/2)

how to use it

rules: constraints, safety

Skill.md: Procedural knowledge of LLM progressive disclosure

Image source: The Complete Guide to AI Agent Skills: From Prototype to Production

Progressive disclosure
- work at scale
- load only the required skills and when it is really required

Threats:

Prompt injection
Tool poisoning
Hidden malware

Legal issuses

How to mitigate?

allowed-tools field
license field
...

Implementation Issues

RAG Pipeline architecture 2026

DevQuill Insights, (2026) RAG Beyond the Basics: Five Retrieval Patterns That Turn Chatbots Into Knowledge Engines, Medium

Ingestion

Retrieval

Re-Rank

Generate

Data Ingestion, Data Warehoue & Data Lake

Data Ingestion: 將資料從外部來源移入內部Repository

data warehouse: relational, structured
data lake: original format with schema-on-read capability

Source

Preparation and Ingestion Pipeline

資料萃取 (Extraction Layer)
- 從 PDF、HTML、Word 或 Markdown 中提取純文字。
- 主流工具：PyMuPDF、Unstructured、LlamaParse
文本切塊 (Chunking & Preprocessing)
- 將長文章切成固定或語意完整的片段(清理：移除 HTML 標籤、特殊符號與重複雜訊 )
- 策略：固定長度（帶重疊區間）、依 Markdown 標題、或Semantic Chunking
嵌入向量生成 (Embedding Generation)
- 將chunks轉換為高維度數值向量（抓取語意特徵）。
- 常用模型：OpenAI text-embedding-3-small、Cohere Embedding、BGE 或 mE5 模型
Metadata增強 (Metadata Enrichment)
- 為每個chunk加上標籤，以利後續精準篩選。
- 常見標籤：來源檔案名稱、頁碼、章節標題、創建日期、關鍵字。
向量寫入 (Vector Insertion)
- 將向量連同metadata與原始文字，一起寫入向量資料庫。
- 主流資料庫：Pinecone、Milvus、Chroma、Qdrant、pgvector

Ingestion Pipeline: loading & others

import hashlib
from datetime import datetime
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings

class VectorIngestionPipeline:
    def __init__(self, vector_db, embedding_model):
        self.vector_db = vector_db
        self.embeddings = embedding_model
        # 保持良好的切分大小
        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000,
            chunk_overlap=200
        )

    def process_document(self, file_path, metadata=None):
        # 安全處理預設 metadata，避免 None 導致報錯
        base_metadata = metadata or {}
        if "source" not in base_metadata:
            base_metadata["source"] = file_path

        # 1. 提取 PDF 文字
        loader = PyPDFLoader(file_path)
        documents = loader.load()

        # 2. 切分區塊
        chunks = self.text_splitter.split_documents(documents)
        
        # 提取所有區塊的文字列表，用於一次性批量生成向量
        texts = [chunk.page_content for chunk in chunks]
        
        # 3. 批量生成向量 (大幅減少 API 呼叫延遲)
        embeddings = self.embeddings.embed_documents(texts)

        # 4. 準備批量寫入的資料
        vectors_to_upsert = []
        for i, chunk in enumerate(chunks):
            # 雜湊值計算
            content_hash = hashlib.sha256(chunk.page_content.encode()).hexdigest()
            
            # 擴充元資料：必須包含原始文字 "text"
            enriched_metadata = {
                **base_metadata,
                "text": chunk.page_content, 
                "chunk_id": i,
                "page": chunk.metadata.get("page", 0), # 保留 PDF 原本的頁碼資訊
                "timestamp": datetime.now().isoformat(),
                "content_hash": content_hash
            }
            
            vector_id = f"{base_metadata['source']}_{i}"
            vectors_to_upsert.append((vector_id, embeddings[i], enriched_metadata))
        
        # 5. 單次批量寫入資料庫 (效能提升數十倍)
        self.vector_db.upsert(vectors=vectors_to_upsert)

參考範例pseudo code

Ingestion: Smart Chunking

Naive Chunking: 固定長度(512) + overlapping
Recursive Chunking(Semantic Chunking): Sentence based, Paragraph based.
Context Chunking: use another LLM to break the content into chunks , and then augment the context to them.

Ingestion: Semantic Chunking

from langchain.text_splitter import RecursiveCharacterTextSplitter
import spacy
class SmartChunker:
    def __init__(self):
        self.nlp = spacy.load("en_core_web_sm")
        
    def semantic_chunking(self, text, max_chunk_size=1000):
        """Chunk by sentences while respecting max size"""
        doc = self.nlp(text)
        sentences = [sent.text.strip() for sent in doc.sents]    
        chunks = []
        current_chunk = ""        
        for sentence in sentences:
            if len(current_chunk + sentence) <= max_chunk_size:
                current_chunk += sentence + " "
            else:
                if current_chunk:
                    chunks.append(current_chunk.strip())
                current_chunk = sentence + " "        
        if current_chunk:
            chunks.append(current_chunk.strip())            
        return chunks

DevQuill Insights, (2026) RAG Beyond the Basics: Five Retrieval Patterns That Turn Chatbots Into Knowledge Engines, Medium

Ingestion: Context Chunking

# Example: Contextual chunking with metadata enrichment
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=200,
    separators=["\n## ", "\n### ", "\n\n", "\n", " "]
)

chunks = splitter.split_documents(documents)

# Enrich each chunk with hierarchical context
for chunk in chunks:
    chunk.metadata["section_title"] = extract_parent_heading(chunk)
    chunk.metadata["doc_summary"] = doc_level_summary
    chunk.metadata["source"] = document.metadata["source"]

1. What is the natural unit of meaning in your corpus?

2. What is the average complexity of your users’ queries?

3. What is your latency budget for retrieval?

data特性

query特性

DevQuill Insights, (2026) RAG Beyond the Basics: Five Retrieval Patterns That Turn Chatbots Into Knowledge Engines, Medium

Summary of Chunking

DevQuill Insights, (2026) RAG Beyond the Basics: Five Retrieval Patterns That Turn Chatbots Into Knowledge Engines, Medium

Vector Embeddings types

Word embeddings
- used to represent words in NLP
- Word2Vec, GloVe, FastText
Sentence and document embeddings
- semantic meaning of sentences and documents.
- BERT, Doc2Vec
Graph embeddings
- nodes and edges of graphs in vector space
- link prediction, node classification.
Image embeddings
- images in a compact vector form
- image recognition, image classification.

Pavan Belagatti, Vector Embeddings Explained for Developers!

Vector Embeddings

Central to many NLP, recommendation, and search algorithms.

數值

物件、文字、圖像...

Source

Vector Embeddings semantic similarity

Vector Space: semantic similarity

Barančíková, P., & Bojar, O. (2019). In search for linear relations in sentence embedding spaces.

Vector Embeddings creating embeddings using Huggingface

pip install -U transformers torch

from transformers import AutoTokenizer, AutoModel
import torch

def get_huggingface_embedding(text, 
model_name='sentence-transformers/all-MiniLM-L6-v2'):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModel.from_pretrained(model_name)

    inputs = tokenizer(text, return_tensors="pt", padding=True, 
truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    # You can choose how to derive the final embeddings, e.g., mean pooling
    embeddings = outputs.last_hidden_state.mean(dim=1).squeeze().numpy()
    return embeddings

# Example usage
text = "Pavan is a developer evangelist."
embedding_huggingface = get_huggingface_embedding(text)
print(embedding_huggingface)

Embedding: Sentence Transformers

from sentence_transformers import SentenceTransformer

# 1. Load a pretrained Sentence Transformer model
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
# The sentences to encode
sentences = [
    "The weather is lovely today.",
    "It's so sunny outside!",
    "He drove to the stadium.",
]
# 2. Calculate embeddings by calling model.encode()
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# 3. Calculate the embedding similarities
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.6660, 0.1046],
#         [0.6660, 1.0000, 0.1411],
#         [0.1046, 0.1411, 1.0000]])

pip install -U sentence-transformers

all-MiniLM-L6-v2

Embedding: Sentence Transformers

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("google/embeddinggemma-300m")
# Run inference with queries and documents
query = "Which planet is known as the Red Planet?"
documents = [
    "Venus is often called Earth's twin because of its similar size and proximity.",
    "Mars, known for its reddish appearance, is often referred to as the Red Planet.",
    "Jupiter, the largest planet in our solar system, has a prominent red spot.",
    "Saturn, famous for its rings, is sometimes mistaken for the Red Planet."
]
query_embeddings = model.encode_query(query)
document_embeddings = model.encode_document(documents)
print(query_embeddings.shape, document_embeddings.shape)
# (768,) (4, 768)

# Compute similarities to determine a ranking
similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)
# tensor([[0.3011, 0.6359, 0.4930, 0.4889]])

pip install -U sentence-transformers

google/embeddinggemma-300m

Embedding: E5

from langchain_community.embeddings import HuggingFaceEmbeddings

# Initialize Multilingual-E5
embeddings = HuggingFaceEmbeddings(
    model_name="intfloat/multilingual-e5-base",
    model_kwargs={"device": "cpu"}
)

# E5 requires 'query: ' and 'passage: ' prefixes
query_text = "query: 什麼是區塊鏈？"
doc_texts = [
    "passage: 區塊鏈是一種去中心化的分散式帳本技術。",
    "passage: 台北101是台灣著名的地標建築。"
]

query_vector = embeddings.embed_query(query_text)
doc_vectors = embeddings.embed_documents(doc_texts)

intfloat/multilingual-e5-base: 繁體中文適用

Embedding: gemini API

import os
from langchain_google_genai import GoogleGenAIEmbeddings

# 初始化 Gemini 向量模型，目前最新通用推薦型號為 text-embedding-004
embeddings = GoogleGenAIEmbeddings(
    model="models/text-embedding-004"
)
# 1. 轉換單一搜尋問題（Query）
query_text = "台灣高鐵的票價如何查詢？"
query_vector = embeddings.embed_query(query_text)
# 2. 轉換多筆資料庫文本（Documents）
documents = [
    "台灣高鐵提供商務車廂與標準車廂，票價依據乘車距離計算。",
    "台北到高雄的標準車廂對號座全票票價為新台幣 1,490 元。",
    "高雄捷運是服務高雄市、屏東縣及台南市的捷運系統。"
]
doc_vectors = embeddings.embed_documents(documents)

# 3. 輸出檢驗資訊
print(f"向量維度大小: {len(query_vector)}") # text-embedding-004 預設維度為 768
print(f"成功生成 {len(doc_vectors)} 筆文本向量。")

text-embedding-004

需要google ai api-key

Embedding comparison

* 使用時，問題必須手動加上 "query: " 前綴，資料庫文本必須加上 "passage: " 前綴，否則準確度會大幅下滑。

Ranking of Vector DBMS

https://db-engines.com/en/ranking/vector+dbms

Ranking of Vector DBMS

https://db-engines.com/en/ranking/vector+dbms/all

Retriever hybrid search

Hybrid Search: An Overview

Retriever hybrid search

問題: Semantic similarity 與 factual relevance並非同一件事

Vector similarity

Semantic similarity

解法:

exact search(lexical search) + vector search

reranker 排除離題的context

BM25, TF-IDF

Semantic Search

Retriever hybrid search

# Hybrid search with RRF fusion
from rank_bm25 import BM25Okapi
import numpy as np

# BM25 lexical retrieval
bm25 = BM25Okapi(tokenized_corpus)
lexical_scores = bm25.get_scores(tokenized_query)
lexical_top_k = np.argsort(lexical_scores)[-20:][::-1]
# Dense vector retrieval
vector_results = vector_store.similarity_search(query, k=20)

# Reciprocal Rank Fusion
def reciprocal_rank_fusion(ranked_lists, k=60):
    fused_scores = {}
    for ranked_list in ranked_lists:
        for rank, doc_id in enumerate(ranked_list):
            fused_scores[doc_id] = fused_scores.get(doc_id, 0) + 1 / (k + rank + 1)
    return sorted(fused_scores, key=fused_scores.get, reverse=True)

final_ranking = reciprocal_rank_fusion([lexical_top_k, vector_doc_ids])
Press enter or click to view image in full size

pip install rank_bm25

DevQuill Insights, (2026) RAG Beyond the Basics: Five Retrieval Patterns That Turn Chatbots Into Knowledge Engines, Medium

Retriever hybrid search

RRF(Reciprocal Rank Fusion) Ranker

https://milvus.io/docs/zh-hant/rrf-ranker.md

不同搜尋方式的排名

User's Query Transformation

Users are terrible at writing query.

https://www.langchain.com/blog/query-transformations

Query Transformation

Query Rewriting
Step-back Prompting
Sub-query Decomposition

Query Transformation query rewriting/ rewrite-retrieve-read

https://www.langchain.com/blog/query-transformations#rewrite-retrieve-read
Ma, Xinbei, et al. "Query rewriting in retrieval-augmented large language models." Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023.

Query Transformation step-back prompting

https://www.langchain.com/blog/query-transformations#rewrite-retrieve-read
Zheng, Huaixiu Steven, et al. "Take a step back: Evoking reasoning via abstraction in large language models." International Conference on Learning Representations. Vol. 2024. 2024.

Pelayo Arbués, (2025), Advanced RAG Techniques: An Illustrated Overview

Query Transformation Subquery decomposition

Query Transformation Implementation

Import libraries and set environment variables

from langchain_openai import ChatOpenAI
from langchain.prompts import PromptTemplate

import os
from dotenv import load_dotenv

# Load environment variables from a .env file
load_dotenv()

# Set the OpenAI API key environment variable
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')
# 或改使用local llm

Ashish Kumar Singh, (2025), Query Transformations for Improved Retrieval in RAG Systems, Medium

Query Transformation Implementation-query rewriting

re_write_llm = ChatOpenAI(temperature=0, model_name="gpt-4o", max_tokens=4000)

# Create a prompt template for query rewriting
query_rewrite_template = """You are an AI assistant tasked with reformulating user queries to improve 
retrieval in a RAG system. Given the original query, rewrite it to be more specific, detailed, and 
likely to retrieve relevant information.

Original query: {original_query}

Rewritten query:"""

query_rewrite_prompt = PromptTemplate(input_variables=["original_query"], template=query_rewrite_template)

# Create an LLMChain for query rewriting
query_rewriter = query_rewrite_prompt | re_write_llm

def rewrite_query(original_query):
    """
    Rewrite the original query to improve retrieval.    
    Args:
    original_query (str): The original user query    
    Returns:
    str: The rewritten query
    """
    response = query_rewriter.invoke(original_query)
    return response.content

Ashish Kumar Singh, (2025), Query Transformations for Improved Retrieval in RAG Systems, Medium

Query Transformation Implementation-query rewriting

# example query over the understanding climate change dataset
original_query = "What are the impacts of climate change on the environment?"
rewritten_query = rewrite_query(original_query)
print("Original query:", original_query)
print("\nRewritten query:", rewritten_query)

Demostration

Original query: What are the impacts of climate change on the environment?

Rewritten query: What are the specific effects of climate change on various ecosystems, including changes in temperature, precipitation patterns, sea levels, and biodiversity?

Ashish Kumar Singh, (2025), Query Transformations for Improved Retrieval in RAG Systems, Medium

Query Transformation Implementation: Step back prompting

step_back_llm = ChatOpenAI(temperature=0, model_name="gpt-4o", max_tokens=4000)


# Create a prompt template for step-back prompting
step_back_template = """You are an AI assistant tasked with generating broader, more general queries to improve context retrieval in a RAG system.
Given the original query, generate a step-back query that is more general and can help retrieve relevant background information.

Original query: {original_query}

Step-back query:"""

step_back_prompt = PromptTemplate(
    input_variables=["original_query"],
    template=step_back_template
)

# Create an LLMChain for step-back prompting
step_back_chain = step_back_prompt | step_back_llm

def generate_step_back_query(original_query):
    """
    Generate a step-back query to retrieve broader context.
    
    Args:
    original_query (str): The original user query
    
    Returns:
    str: The step-back query
    """
    response = step_back_chain.invoke(original_query)
    return response.content

Ashish Kumar Singh, (2025), Query Transformations for Improved Retrieval in RAG Systems, Medium

Query Transformation Implementation: Step back prompting

You are an expert of world knowledge. I am going to ask you a question. 
Your response should be comprehensive and not contradicted with the following 
context if they are relevant. Otherwise, ignore them if they are not relevant.

{normal_context}
{step_back_context}

Original Question: {question}
Answer:

Ashish Kumar Singh, (2025), Query Transformations for Improved Retrieval in RAG Systems, Medium

The other step back prompt template

Query Transformation Implementation: Step back prompting

# example query over the understanding climate change dataset
original_query = "What are the impacts of climate change on the environment?"
step_back_query = generate_step_back_query(original_query)
print("Original query:", original_query)
print("\nStep-back query:", step_back_query)

Demostration

Original query: What are the impacts of climate change on the environment?

Step-back query: What are the general effects of climate change?

Ashish Kumar Singh, (2025), Query Transformations for Improved Retrieval in RAG Systems, Medium

Query Transformation Sub-query Decomposition

sub_query_llm = ChatOpenAI(temperature=0, model_name="gpt-4o", max_tokens=4000)

# Create a prompt template for sub-query decomposition
subquery_decomposition_template = """You are an AI assistant tasked with breaking down complex queries into simpler sub-queries for a RAG system.
Given the original query, decompose it into 2-4 simpler sub-queries that, when answered together, would provide a comprehensive response to the original query.

Original query: {original_query}

example: What are the impacts of climate change on the environment?

Sub-queries:
1. What are the impacts of climate change on biodiversity?
2. How does climate change affect the oceans?
3. What are the effects of climate change on agriculture?
4. What are the impacts of climate change on human health?"""


subquery_decomposition_prompt = PromptTemplate(
    input_variables=["original_query"],
    template=subquery_decomposition_template
)

# Create an LLMChain for sub-query decomposition
subquery_decomposer_chain = subquery_decomposition_prompt | sub_query_llm

def decompose_query(original_query: str):
    """
    Decompose the original query into simpler sub-queries.
    
    Args:
    original_query (str): The original complex query
    
    Returns:
    List[str]: A list of simpler sub-queries
    """
    response = subquery_decomposer_chain.invoke(original_query).content
    sub_queries = [q.strip() for q in response.split('\n') if q.strip() and not q.strip().startswith('Sub-queries:')]
    return sub_queries

Ashish Kumar Singh, (2025), Query Transformations for Improved Retrieval in RAG Systems, Medium

Query Transformation Implementation: Step back prompting

# example query over the understanding climate change dataset
original_query = "What are the impacts of climate change on the environment?"
sub_queries = decompose_query(original_query)
print("\nSub-queries:")
for i, sub_query in enumerate(sub_queries, 1):
    print(sub_query)

Demostration

Sub-queries:
Original query: What are the impacts of climate change on the environment?
1. How does climate change affect biodiversity and ecosystems?
2. What are the impacts of climate change on oceanic conditions and marine life?
3. How does climate change influence weather patterns and extreme weather events?
4. What are the effects of climate change on terrestrial environments, such as forests and deserts?

Ashish Kumar Singh, (2025), Query Transformations for Improved Retrieval in RAG Systems, Medium

Divy Yadav, (2025) 9 RAG Architectures Every AI Developer Must Know: A Complete Guide with Examples, Medium

Lesson 6: RAG

By Leuo-Hong Wang

Introduction to Agentic AI

What is RAG?

Retrieval Augmented Generation the origin

RAG for Knowledge Intensive Tasks

RAG for Knowledge Intensive Tasks

RAG and other techniques for LLM rethink

RAG and other techniques for LLM rethink

Skill.md: Procedural knowledge of LLM

Skill.md: Procedural knowledge of LLM

Skill.md: Procedural knowledge of LLM progressive disclosure

Skill.md: Procedural knowledge of LLM Lifecycle

Skill.md: Procedural knowledge of LLM Skill Categories

Skill.md: Procedural knowledge of LLM Security and Trust

Implementation Issues

RAG Pipeline architecture 2026

Data Ingestion, Data Warehoue & Data Lake

Preparation and Ingestion Pipeline

Ingestion Pipeline: loading & others

Ingestion: Smart Chunking

Ingestion: Semantic Chunking

Ingestion: Context Chunking

Summary of Chunking

Vector Embeddings types

Vector Embeddings

Vector Embeddings semantic similarity

Vector Embeddings creating embeddings using Huggingface

Embedding: Sentence Transformers

Embedding: Sentence Transformers

Embedding: E5

Embedding: gemini API

Embedding comparison

Ranking of Vector DBMS

Ranking of Vector DBMS

Retriever hybrid search

Retriever hybrid search

Retriever hybrid search

Retriever hybrid search

User's Query Transformation

Query Transformation query rewriting/ rewrite-retrieve-read

Query Transformation step-back prompting

Query Transformation Subquery decomposition

Query Transformation Implementation

Query Transformation Implementation-query rewriting

Query Transformation Implementation-query rewriting

Query Transformation Implementation: Step back prompting

Query Transformation Implementation: Step back prompting

Query Transformation Implementation: Step back prompting

Query Transformation Sub-query Decomposition

Query Transformation Implementation: Step back prompting

Lesson 6: RAG

Lesson 6: RAG

Leuo-Hong Wang

More from Leuo-Hong Wang