RAG Pipeline Example

RAG (Retrieval-Augmented Generation) is a powerful technique that combines retrieval and generation to enhance the capabilities of language models.

Building a RAG Pipeline from Scratch

Retrieval-Augmented Generation (RAG) is the technique behind most production AI assistants today. Instead of asking an LLM to recall facts from training, you give it relevant context at query time. The result is more accurate, grounded, and up-to-date answers.

This article walks through building one from scratch - every layer, every decision.


Why RAG?

LLMs have two fundamental limitations:

  • Knowledge cutoff - they don't know what happened after training
  • Context size - you can't stuff an entire document library into a prompt

RAG solves both. You store your documents externally, retrieve only the relevant pieces per query, and pass those pieces as context to the LLM.

User question

Embed question → vector

Search vector store → top-k similar chunks

LLM(system_prompt + chunks + question)

Grounded answer

Step 1: Chunking

Before you can embed anything, you need to split your documents into chunks. This is more important than most tutorials admit.

Why not embed the whole document?

  • Embedding models have token limits (typically 512-8192 tokens)
  • Large chunks dilute signal - a 10-page doc embedded as one vector loses specificity
  • Retrieval precision drops - you want to fetch a paragraph, not a chapter

Chunking strategies:

StrategyBest for
Fixed size (e.g. 300 chars)Simple, works everywhere
By sentenceBetter semantic coherence
By section/headingStructured docs like resumes, reports
RecursiveLong-form prose with nested structure

For a resume, splitting by section headings (Work Experience, Skills, Education) gives the cleanest retrieval - each chunk represents a discrete topic.

section_pattern = re.compile(
    r"\n(?=Work Experience|Education|Skills|Publications)",
    re.IGNORECASE
)
sections = section_pattern.split(text)

Step 2: Embeddings

An embedding converts text into a dense float vector. Similar text produces similar vectors - that's what makes semantic search possible.

"Python developer" → [0.12, -0.34, 0.88, ...]  (1536 dimensions)
"skilled in Python" → [0.11, -0.31, 0.85, ...]  (very close)
"banana bread recipe" → [-0.72, 0.14, -0.22, ...] (far away)

Model options:

ModelProviderNotes
text-embedding-3-smallOpenAIBest balance of cost/quality
text-embedding-3-largeOpenAIHigher accuracy, 3x cost
nomic-embed-textOllama (local)Free, runs on your machine
embed-english-v3.0CohereStrong alternative to OpenAI

Important: Use the same model for indexing and querying. Mixing models breaks similarity search.

def embed(texts: list[str]) -> list[list[float]]:
    response = client.embeddings.create(
        input=texts,
        model="text-embedding-3-small"
    )
    return [item.embedding for item in response.data]

Step 3: Vector Store

A vector store indexes your embeddings and lets you search by similarity (cosine distance or dot product) rather than exact match.

Options by use case:

StoreTypeBest for
ChromaDBLocal/embeddedDev, small datasets, no infra
PineconeManaged cloudProduction, scale, no ops
QdrantSelf-hostedProduction + full control
FAISSIn-memoryBatch research, no persistence
pgvectorPostgres extensionAlready using Postgres

For development, ChromaDB with persistence is ideal:

import chromadb
 
client = chromadb.PersistentClient(path=".chroma_db")
collection = client.get_or_create_collection("my_docs")
 
collection.add(
    ids=["chunk_0", "chunk_1"],
    embeddings=[[0.1, 0.2, ...], [0.3, 0.4, ...]],
    documents=["text of chunk 0", "text of chunk 1"],
    metadatas=[{"section": "Skills"}, {"section": "Experience"}]
)

Persistent storage means you only embed once - subsequent runs skip indexing entirely.


Step 4: Retrieval

At query time, embed the question and find the most similar chunks:

query_vector = embed(["What languages does Manoj know?"])[0]
 
results = collection.query(
    query_embeddings=[query_vector],
    n_results=5   # top_k - tune this to your chunk count
)
 
context = "\n---\n".join(results["documents"][0])

Tuning top_k:

  • Too low: miss relevant chunks (especially when one topic is split across multiple chunks)
  • Too high: flood the LLM with noise, degrade answer quality
  • A good default: match or slightly exceed the number of chunks for the most common section

Step 5: Generation

Pass the retrieved context to the LLM with a tight system prompt:

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "system",
            "content": (
                "Answer using ONLY the provided context. "
                "Be concise. Do not explain your reasoning."
            )
        },
        {
            "role": "user",
            "content": f"Context:\n{context}\n\nQuestion: {question}"
        }
    ]
)

System prompt tips:

  • "Answer using ONLY the provided context" - prevents hallucination
  • "Be concise. Do not explain your reasoning." - critical for reasoning models (Qwen3, DeepSeek-R1) which tend to over-explain
  • For Ollama/Qwen3 specifically, also pass "think": False to disable chain-of-thought mode

Making It Modular

A production RAG pipeline should decouple the LLM backend from the vector store backend. Define abstract base classes:

class BaseLLM(ABC):
    @abstractmethod
    def embed(self, texts: list[str]) -> list[list[float]]: ...
 
    @abstractmethod
    def chat(self, system: str, user: str) -> str: ...
 
class BaseVectorStore(ABC):
    @abstractmethod
    def add(self, ids, embeddings, documents, metadatas) -> None: ...
 
    @abstractmethod
    def query(self, embedding, top_k) -> list[str]: ...

Now your pipeline only depends on these interfaces. Swapping OpenAI for Ollama, or ChromaDB for Pinecone, requires zero changes to pipeline logic.


Common Pitfalls

1. Re-embedding on every run Use persistent storage and check if the collection exists before indexing. Embeddings are expensive and slow.

2. top_k too low If one section (e.g. Work Experience) splits into 6 chunks and you only fetch 3, you'll get incomplete answers. Set top_k to cover your densest section.

3. Mixing embedding models Index with text-embedding-3-small, query with text-embedding-3-large = broken results. Always use the same model end-to-end.

4. Reasoning models in verbose mode Qwen3, DeepSeek-R1, and similar models think out loud by default. Pass "think": False in the Ollama request or use /no_think system prompt prefix.

5. Chunk size mismatch Chunks too small = lose context. Chunks too large = lose precision. For most document types, 200-400 tokens per chunk is the sweet spot.


Full Stack Summary

PDF / Docs
    ↓  loader.py
Chunks (text + metadata)
    ↓  llm.embed()
Vectors (float arrays)
    ↓  vectorstore.add()
ChromaDB / Pinecone
    ↓  vectorstore.query()
Top-K Chunks
    ↓  llm.chat()
Grounded Answer

RAG is not magic - it's a retrieval system feeding a generation system. Get the retrieval right (chunking, top_k, embeddings) and the generation almost takes care of itself.