Most RAG tutorials start with LangChain or LlamaIndex. I wanted to understand what’s actually happening underneath, so I built one from scratch with just an embedding model, a vector store, and a language model.
The architecture
It’s simpler than the frameworks make it seem:
- Chunk your documents into passages (~500 tokens each)
- Embed each chunk using a sentence transformer
- Store the embeddings in a vector database
- At query time, embed the question, find the k nearest chunks, and pass them as context to the LLM
import numpy as np
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
def embed(texts: list[str]) -> np.ndarray:
return model.encode(texts, normalize_embeddings=True)
def search(query: str, index: np.ndarray, chunks: list[str], k: int = 5):
query_vec = embed([query])
scores = index @ query_vec.T
top_k = np.argsort(scores.flatten())[-k:][::-1]
return [(chunks[i], scores[i]) for i in top_k]
What I learned
Chunking strategy matters more than the model. Overlapping windows with semantic boundary detection (paragraph breaks, headers) consistently outperform fixed-size chunks.
Re-ranking is worth the cost. A simple cross-encoder re-ranker on the top-20 results before passing to the LLM improved answer quality noticeably.
Evaluation is the hard part. Building the retrieval pipeline takes a day. Figuring out if it’s actually working well takes weeks.