Building a RAG Pipeline from Scratch

Retrieval-Augmented Generation (RAG) is the technique behind most production AI assistants today. Instead of asking an LLM to recall facts from training, you give it relevant context at query time. The result is more accurate, grounded, and up-to-date answers.

This article walks through building one from scratch - every layer, every decision.

Why RAG?

LLMs have two fundamental limitations:

Knowledge cutoff - they don't know what happened after training
Context size - you can't stuff an entire document library into a prompt

RAG solves both. You store your documents externally, retrieve only the relevant pieces per query, and pass those pieces as context to the LLM.

User question
     ↓
Embed question → vector
     ↓
Search vector store → top-k similar chunks
     ↓
LLM(system_prompt + chunks + question)
     ↓
Grounded answer

Step 1: Chunking

Before you can embed anything, you need to split your documents into chunks. This is more important than most tutorials admit.

Why not embed the whole document?

Embedding models have token limits (typically 512-8192 tokens)
Large chunks dilute signal - a 10-page doc embedded as one vector loses specificity
Retrieval precision drops - you want to fetch a paragraph, not a chapter

Chunking strategies:

Strategy	Best for
Fixed size (e.g. 300 chars)	Simple, works everywhere
By sentence	Better semantic coherence
By section/heading	Structured docs like resumes, reports
Recursive	Long-form prose with nested structure

For a resume, splitting by section headings (Work Experience, Skills, Education) gives the cleanest retrieval - each chunk represents a discrete topic.

section_pattern = re.compile(
    r"\n(?=Work Experience|Education|Skills|Publications)",
    re.IGNORECASE
)
sections = section_pattern.split(text)

Step 2: Embeddings

An embedding converts text into a dense float vector. Similar text produces similar vectors - that's what makes semantic search possible.

"Python developer" → [0.12, -0.34, 0.88, ...]  (1536 dimensions)
"skilled in Python" → [0.11, -0.31, 0.85, ...]  (very close)
"banana bread recipe" → [-0.72, 0.14, -0.22, ...] (far away)

Model options:

Model	Provider	Notes
`text-embedding-3-small`	OpenAI	Best balance of cost/quality
`text-embedding-3-large`	OpenAI	Higher accuracy, 3x cost
`nomic-embed-text`	Ollama (local)	Free, runs on your machine
`embed-english-v3.0`	Cohere	Strong alternative to OpenAI

Important: Use the same model for indexing and querying. Mixing models breaks similarity search.

def embed(texts: list[str]) -> list[list[float]]:
    response = client.embeddings.create(
        input=texts,
        model="text-embedding-3-small"
    )
    return [item.embedding for item in response.data]

Step 3: Vector Store

A vector store indexes your embeddings and lets you search by similarity (cosine distance or dot product) rather than exact match.

Options by use case:

Store	Type	Best for
ChromaDB	Local/embedded	Dev, small datasets, no infra
Pinecone	Managed cloud	Production, scale, no ops
Qdrant	Self-hosted	Production + full control
FAISS	In-memory	Batch research, no persistence
pgvector	Postgres extension	Already using Postgres

For development, ChromaDB with persistence is ideal:

import chromadb
 
client = chromadb.PersistentClient(path=".chroma_db")
collection = client.get_or_create_collection("my_docs")
 
collection.add(
    ids=["chunk_0", "chunk_1"],
    embeddings=[[0.1, 0.2, ...], [0.3, 0.4, ...]],
    documents=["text of chunk 0", "text of chunk 1"],
    metadatas=[{"section": "Skills"}, {"section": "Experience"}]
)

Persistent storage means you only embed once - subsequent runs skip indexing entirely.

Step 4: Retrieval

At query time, embed the question and find the most similar chunks:

query_vector = embed(["What languages does Manoj know?"])[0]
 
results = collection.query(
    query_embeddings=[query_vector],
    n_results=5   # top_k - tune this to your chunk count
)
 
context = "\n---\n".join(results["documents"][0])

Tuning top_k:

Too low: miss relevant chunks (especially when one topic is split across multiple chunks)
Too high: flood the LLM with noise, degrade answer quality
A good default: match or slightly exceed the number of chunks for the most common section

Step 5: Generation

Pass the retrieved context to the LLM with a tight system prompt:

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "system",
            "content": (
                "Answer using ONLY the provided context. "
                "Be concise. Do not explain your reasoning."
            )
        },
        {
            "role": "user",
            "content": f"Context:\n{context}\n\nQuestion: {question}"
        }
    ]
)

System prompt tips:

"Answer using ONLY the provided context" - prevents hallucination
"Be concise. Do not explain your reasoning." - critical for reasoning models (Qwen3, DeepSeek-R1) which tend to over-explain
For Ollama/Qwen3 specifically, also pass "think": False to disable chain-of-thought mode

Making It Modular

A production RAG pipeline should decouple the LLM backend from the vector store backend. Define abstract base classes:

class BaseLLM(ABC):
    @abstractmethod
    def embed(self, texts: list[str]) -> list[list[float]]: ...
 
    @abstractmethod
    def chat(self, system: str, user: str) -> str: ...
 
class BaseVectorStore(ABC):
    @abstractmethod
    def add(self, ids, embeddings, documents, metadatas) -> None: ...
 
    @abstractmethod
    def query(self, embedding, top_k) -> list[str]: ...

Now your pipeline only depends on these interfaces. Swapping OpenAI for Ollama, or ChromaDB for Pinecone, requires zero changes to pipeline logic.

Common Pitfalls

1. Re-embedding on every run Use persistent storage and check if the collection exists before indexing. Embeddings are expensive and slow.

2. top_k too low If one section (e.g. Work Experience) splits into 6 chunks and you only fetch 3, you'll get incomplete answers. Set top_k to cover your densest section.

3. Mixing embedding models Index with text-embedding-3-small, query with text-embedding-3-large = broken results. Always use the same model end-to-end.

4. Reasoning models in verbose mode Qwen3, DeepSeek-R1, and similar models think out loud by default. Pass "think": False in the Ollama request or use /no_think system prompt prefix.

5. Chunk size mismatch Chunks too small = lose context. Chunks too large = lose precision. For most document types, 200-400 tokens per chunk is the sweet spot.

Full Stack Summary

PDF / Docs
    ↓  loader.py
Chunks (text + metadata)
    ↓  llm.embed()
Vectors (float arrays)
    ↓  vectorstore.add()
ChromaDB / Pinecone
    ↓  vectorstore.query()
Top-K Chunks
    ↓  llm.chat()
Grounded Answer

RAG is not magic - it's a retrieval system feeding a generation system. Get the retrieval right (chunking, top_k, embeddings) and the generation almost takes care of itself.