Most RAG tutorials start with LangChain or LlamaIndex. I wanted to understand what’s actually happening underneath, so I built one from scratch with just an embedding model, a vector store, and a language model.

The architecture

It’s simpler than the frameworks make it seem:

  1. Chunk your documents into passages (~500 tokens each)
  2. Embed each chunk using a sentence transformer
  3. Store the embeddings in a vector database
  4. At query time, embed the question, find the k nearest chunks, and pass them as context to the LLM
import numpy as np
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

def embed(texts: list[str]) -> np.ndarray:
    return model.encode(texts, normalize_embeddings=True)

def search(query: str, index: np.ndarray, chunks: list[str], k: int = 5):
    query_vec = embed([query])
    scores = index @ query_vec.T
    top_k = np.argsort(scores.flatten())[-k:][::-1]
    return [(chunks[i], scores[i]) for i in top_k]

What I learned

Chunking strategy matters more than the model. Overlapping windows with semantic boundary detection (paragraph breaks, headers) consistently outperform fixed-size chunks.

Re-ranking is worth the cost. A simple cross-encoder re-ranker on the top-20 results before passing to the LLM improved answer quality noticeably.

Evaluation is the hard part. Building the retrieval pipeline takes a day. Figuring out if it’s actually working well takes weeks.