Local RAG Pipeline: Query Your Documents With AI — No Cloud Required
A local RAG pipeline is a retrieval-augmented generation system that runs entirely on your own hardware, embedding your documents and answering questions from them without sending any data to external cloud services. It combines vector search with a locally-hosted language model so you get the power of document-aware AI with zero data leaving your machine.
If you’ve been watching the AI landscape, you already know that every major cloud provider wants to be the middleman between you and your own documents. OpenAI wants your PDFs. Google wants your notes. AWS wants your contracts. A local RAG pipeline is how you tell all of them to get lost. This guide is going to show you exactly how to build one from scratch — using free, open-source tools — so you can query your own data, on your own terms, with your own hardware doing the work.
Whether you’re a developer protecting client data, a researcher handling sensitive material, or just someone who’s tired of paying per-token to ask questions about files you already own, this walkthrough will get you running. We’ll cover architecture, tooling, real Python code, embedding model selection, and performance tuning. By the end, you’ll have a fully functional local RAG pipeline that costs nothing per query and leaks nothing to anyone.
Key Takeaways
- A local RAG pipeline lets you query your own documents with AI — completely offline, zero API costs
- Stack: Ollama (LLM + embeddings) + ChromaDB (vectors) + LangChain (orchestration) = 50 lines of Python
- Best embedding model for local use: nomic-embed-text (fast, 768-dim) or BGE-M3 (63.0 MTEB, multilingual)
- Chunk sizes of 500-1000 tokens with 200 overlap give optimal retrieval accuracy
- Total cost: $0. Runs on 16GB RAM. No GPU required (but helps with speed)
Why Build a Local RAG Pipeline Instead of Using Cloud Services
Let’s be direct: the moment you upload your documents to a cloud RAG service, you’ve made a decision about trust. You’ve decided that the company hosting that service — and their subprocessors, and their security team, and their compliance framework — are all good enough to be custodians of your data. For some use cases, that’s a fine tradeoff. For most real-world use cases involving anything sensitive, it’s a terrible one.

Building a local RAG pipeline solves this at the root. Your documents never leave your machine. There’s no API key to rotate after a breach, no data processing agreement to read, no Terms of Service update that quietly grants someone new rights to your content. The privacy argument alone is compelling enough, but the practical arguments stack up fast too.
- Cost: OpenAI’s text-embedding-3-small costs $0.02 per million tokens. Text-embedding-3-large costs $0.13 per million tokens. A local RAG pipeline using nomic-embed-text through Ollama costs exactly $0 per million tokens, forever.
- Speed: Cloud embedding requires a round-trip network request for every query. A local pipeline running on even modest hardware eliminates that latency entirely. For interactive applications, this difference is immediately felt.
- Offline capability: Your local RAG pipeline works in a plane, in a SCIF, in a data center with no internet access, in a remote office with unreliable connectivity. Cloud RAG does not.
- Control: You choose the embedding model. You choose the chunk size. You choose the LLM. You choose what gets indexed. Nobody else gets a vote.
The trust problem with cloud RAG is particularly acute for certain document categories. Legal contracts, medical records, company financials, source code, personal communications — these are exactly the documents you most want to query with AI, and they’re exactly the documents you least want living on someone else’s server. Stanford HAI’s research on AI privacy has consistently highlighted that users systematically underestimate how their data is used once it leaves their control. A local RAG pipeline makes that entire risk category disappear.
If you already know how to run a local LLM, you have most of the foundation you need. RAG is essentially a local LLM with a long-term memory system attached to it — and that memory system is what we’re building today.
How a Local RAG Pipeline Works (Architecture)
Before you write a single line of code, you need a clear mental model of what a local RAG pipeline actually does. The architecture is more straightforward than most people expect, and once you understand each component, the implementation choices will make intuitive sense.

A local RAG pipeline moves through six distinct stages every time it processes documents and answers a question:
- Document Loader: Reads your source files — PDFs, text files, Markdown, HTML, Word docs — and extracts raw text content from them.
- Chunking: Splits that raw text into smaller pieces of manageable size. Language models have context windows; you need your retrieval units to be small enough to be meaningful but large enough to contain useful context.
- Embedding: Converts each text chunk into a high-dimensional numerical vector using an embedding model. This vector represents the semantic meaning of the chunk in a mathematical space where similar meanings are geometrically close to each other.
- Vector Store: Stores those embeddings in a database optimized for similarity search. When you want to find relevant chunks, you search this store.
- Retrieval: At query time, your question gets embedded using the same model, and the vector store returns the K most similar chunks from your document corpus.
- LLM Generation: Those retrieved chunks get injected into a prompt as context, and your local LLM generates an answer grounded in your actual documents rather than hallucinating from training data alone.
The chunking parameters matter enormously. The research consensus and practical experience both point to 500–1000 tokens as the optimal chunk size for most document types. Go smaller and you lose context. Go larger and you dilute the semantic signal that makes retrieval accurate. Overlap between chunks should be around 200 tokens — roughly 10–20% of your chunk size. This overlap ensures that ideas spanning chunk boundaries don’t get lost.
On embedding dimensions: nomic-embed-text produces 768-dimensional vectors, which are fast to compute and reasonably compact to store. BGE-M3 produces 1024-dimensional vectors, giving you more expressive semantic space at the cost of slightly more compute and storage. For most local RAG pipeline use cases, 768 dimensions is more than sufficient.
The Tool Stack — Everything You Need (All Free and Open Source)
One of the most common objections to building a local RAG pipeline is that it sounds complicated. It’s not. The modern open-source ecosystem has made this genuinely approachable. Here’s the exact stack that works, is actively maintained, and costs nothing.

- Ollama: Handles both LLM inference and embedding generation in a single, clean package. You pull models like Docker images and hit a local REST API. It runs nomic-embed-text and BGE-M3 natively, and it manages model weights, quantization, and hardware acceleration automatically. This is your inference layer.
- ChromaDB: An open-source vector database that’s Python-native, requires zero configuration to get started, and stores data persistently on disk. You can have it running in three lines of Python. For production use it scales well, but for a local RAG pipeline it’s genuinely zero-friction.
- LangChain: The orchestration layer that connects your document loaders, text splitters, embedding calls, vector store operations, and LLM prompting into coherent pipelines. LlamaIndex is a strong alternative with a different philosophy — more structured and data-centric — but LangChain’s community and documentation are unmatched for getting started quickly.
- Python: The glue holding everything together. If you’re comfortable with pip and basic Python, you have everything you need.
On hardware requirements: 16GB of RAM is the practical minimum for running a useful local RAG pipeline with a capable model like Llama 3.1 8B. 32GB is comfortable. A dedicated GPU (NVIDIA with CUDA support is best, Apple Silicon MPS also works well) will dramatically accelerate both embedding generation and LLM inference, but a modern CPU can absolutely handle this workload for personal-scale use cases. The embedding models especially run fine on CPU — nomic-embed-text in particular was designed with CPU inference as a first-class use case.
Want to go further than RAG? The same stack that powers a local RAG pipeline can power a full autonomous agent. Check out our guide on how to build an AI agent without OpenAI for the next level up.
Setting Up Your Local RAG Pipeline Step by Step
Enough theory. Here’s the actual implementation. This assumes you’re on Linux or macOS. Windows users can follow along — Ollama has a Windows installer and everything else works the same through PowerShell or WSL.

Step 1 — Install Ollama and Pull Models
First, install Ollama. On Linux and macOS, the one-liner installer handles everything:
curl -fsSL https://ollama.com/install.sh | sh
Once Ollama is running, pull the LLM and embedding model you’ll use for your local RAG pipeline:
ollama pull llama3.1
ollama pull nomic-embed-text
Verify both are available:
ollama list
You should see both models listed. Ollama starts a local server on http://localhost:11434 automatically. That’s the endpoint your Python code will talk to.
Step 2 — Install ChromaDB and Dependencies
Create a virtual environment (strongly recommended) and install the required packages:
python -m venv rag-env
source rag-env/bin/activate # On Windows: rag-env\Scripts\activate
pip install chromadb langchain langchain-community langchain-ollama pypdf
That’s your entire dependency set. No Docker required, no database server to configure, no API keys to manage. This is what a genuinely self-hosted local RAG pipeline looks like in practice.
Step 3 — Ingest Your Documents
This script loads PDFs from a directory, chunks them, embeds them using Ollama’s nomic-embed-text, and stores everything in a persistent ChromaDB collection:
import os
from langchain_community.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_ollama import OllamaEmbeddings
from langchain_community.vectorstores import Chroma
# Configuration
DOCS_PATH = "./documents" # Put your PDFs here
CHROMA_PATH = "./chroma_db" # Where ChromaDB stores its data
EMBED_MODEL = "nomic-embed-text"
CHUNK_SIZE = 800
CHUNK_OVERLAP = 200
def ingest_documents():
print("Loading documents...")
loader = PyPDFDirectoryLoader(DOCS_PATH)
documents = loader.load()
print(f"Loaded {len(documents)} document pages")
# Split into chunks
splitter = RecursiveCharacterTextSplitter(
chunk_size=CHUNK_SIZE,
chunk_overlap=CHUNK_OVERLAP,
separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks")
# Initialize embeddings via Ollama
embeddings = OllamaEmbeddings(
model=EMBED_MODEL,
base_url="http://localhost:11434"
)
# Create and persist ChromaDB vector store
print("Embedding and storing chunks (this may take a few minutes)...")
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory=CHROMA_PATH,
collection_name="local_rag_collection"
)
print(f"Successfully indexed {len(chunks)} chunks into ChromaDB")
return vectorstore
if __name__ == "__main__":
ingest_documents()
Run this once per document set. ChromaDB persists to disk, so subsequent runs will load from the existing database unless you delete the chroma_db directory.
Step 4 — Query Your Local RAG Pipeline
This is the query interface — the part where your local RAG pipeline actually earns its keep. It retrieves relevant chunks and generates a grounded answer:
from langchain_ollama import OllamaEmbeddings, OllamaLLM
from langchain_community.vectorstores import Chroma
from langchain.prompts import ChatPromptTemplate
CHROMA_PATH = "./chroma_db"
EMBED_MODEL = "nomic-embed-text"
LLM_MODEL = "llama3.1"
TOP_K = 5
PROMPT_TEMPLATE = """
You are a helpful assistant answering questions based strictly on the provided context.
If the context doesn't contain enough information to answer the question, say so clearly.
Do not fabricate information.
Context:
{context}
Question: {question}
Answer:"""
def query_rag(question: str) -> str:
# Load embeddings and vector store
embeddings = OllamaEmbeddings(
model=EMBED_MODEL,
base_url="http://localhost:11434"
)
vectorstore = Chroma(
persist_directory=CHROMA_PATH,
embedding_function=embeddings,
collection_name="local_rag_collection"
)
# Retrieve relevant chunks
retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"k": TOP_K}
)
relevant_docs = retriever.invoke(question)
if not relevant_docs:
return "No relevant documents found for your question."
# Build context from retrieved chunks
context = "\n\n---\n\n".join([doc.page_content for doc in relevant_docs])
# Show sources
sources = set([doc.metadata.get("source", "Unknown") for doc in relevant_docs])
# Generate answer with local LLM
prompt = ChatPromptTemplate.from_template(PROMPT_TEMPLATE)
llm = OllamaLLM(model=LLM_MODEL, base_url="http://localhost:11434")
chain = prompt | llm
answer = chain.invoke({"context": context, "question": question})
print(f"\nSources consulted: {', '.join(sources)}")
return answer
if __name__ == "__main__":
while True:
question = input("\nAsk a question (or 'quit' to exit): ").strip()
if question.lower() in ["quit", "exit", "q"]:
break
if question:
answer = query_rag(question)
print(f"\nAnswer: {answer}")
That’s a complete, working local RAG pipeline. Drop PDFs in the documents folder, run the ingestion script, then run the query script and start asking questions. Everything happens on your hardware. Nothing leaves your network.
Choosing the Right Embedding Model
The embedding model is one of the most consequential choices in your local RAG pipeline. It determines how well semantic similarity is captured — which directly determines whether your retrieval step finds the right documents. Here are the three strongest options available through Ollama right now:

nomic-embed-text is the pragmatic default for most local RAG pipeline deployments. It’s lightweight at 274MB, produces 768-dimensional embeddings, runs efficiently on CPU, and performs well across a wide range of English-language document types. If you’re not sure what to use, start here. Ollama installs it with a single pull command and it just works.
BGE-M3 from BAAI is the power user choice. It scores 63.0 on the MTEB benchmark, supports over 100 languages, handles variable-length documents well, and is released under the MIT license. The 1024-dimensional output gives it more expressive capacity than nomic-embed-text. If you’re building a multilingual local RAG pipeline or working with technical documentation that requires precise semantic matching, BGE-M3 is the upgrade worth making.
mxbai-embed-large from Mixedbread AI rounds out the top tier. It scores exceptionally well on retrieval-specific benchmarks — which is exactly the task your local RAG pipeline needs it for. It produces 1024-dimensional embeddings and is particularly strong at asymmetric retrieval (where your query is short but your document chunks are long — the typical RAG scenario).
Here’s a quick reference comparison:
- nomic-embed-text: 768 dimensions | ~274MB | CPU-friendly | Best for: general English docs, fast iteration
- BGE-M3: 1024 dimensions | ~1.2GB | Multilingual | Best for: multilingual content, technical precision
- mxbai-embed-large: 1024 dimensions | ~670MB | Strong retrieval benchmarks | Best for: asymmetric retrieval tasks
One critical rule: whatever embedding model you use for ingestion, you must use the same model for queries. Embeddings from different models live in incompatible vector spaces. If you switch models, you need to re-embed your entire document corpus. Plan your choice before you ingest at scale.
Performance Tuning Your Local RAG Pipeline
A basic local RAG pipeline works. A tuned local RAG pipeline works well. Here are the levers that matter most:

Chunk size optimization is where most performance gains come from. The 500–1000 token range is the empirically validated sweet spot, but the right answer within that range depends on your documents. Dense technical documentation benefits from smaller chunks (500–600 tokens) so each chunk has a tight semantic focus. Narrative text like legal contracts or research papers benefits from larger chunks (800–1000 tokens) that preserve argumentative flow. Run retrieval quality experiments with a sample of real questions before committing to a chunk size for production ingestion.
Overlap settings prevent information loss at chunk boundaries. A 10–20% overlap — meaning 100–200 tokens for an 800-token chunk — is the standard recommendation. More overlap means more redundancy in your vector store (and slower ingestion) but better coverage of boundary-spanning concepts. Less overlap is faster but risks losing context around split points.
Top-K retrieval controls how many chunks get passed to the LLM as context. K=3 to K=5 is optimal for most local RAG pipeline configurations. Going higher (K=10+) floods the LLM context window with potentially irrelevant material, which can actually hurt answer quality. Going lower (K=1 or K=2) risks missing the best chunks. Start at K=5 and tune based on your observed answer quality.
Re-ranking is a significant quality improvement for a modest complexity cost. After initial similarity retrieval, a re-ranker model scores each retrieved chunk for actual relevance to the specific query, reordering them before passing to the LLM. Cross-encoder models like BGE-Reranker-v2 can be run locally and often catch cases where cosine similarity-based retrieval ranks the wrong chunks first.
GPU acceleration matters most for the LLM inference step, less so for embedding generation. If you have an NVIDIA GPU, Ollama will automatically use CUDA. For Apple Silicon, Ollama uses Metal. Even partial GPU offloading (running some LLM layers on GPU, some on CPU) dramatically improves response time. For embedding specifically, even CPU-only systems can handle several hundred documents per minute with nomic-embed-text.
If you’re running your local RAG pipeline as part of a broader self-hosted infrastructure, pairing it with a self-hosted VPN lets you access it securely from anywhere without exposing it to the public internet — especially useful if you’re deploying on a home server or a private VPS.
If this is the kind of overpriced tool you’re tired of paying for — we built a pirate version. Check the Arsenal.
When Local RAG Is Enough (and When It’s Not)
Being honest about limitations is part of what makes a local RAG pipeline worth building. It’s not the answer to every problem. Knowing where it wins and where it doesn’t will save you from building the wrong system.

Local RAG pipeline wins clearly in these scenarios:
- Personal document libraries: Notes, books, research papers, saved articles. This is the killer use case. Your data, your hardware, $0 per query.
- Company IP and proprietary documentation: Internal wikis, product specs, client contracts, source code documentation. You legally and ethically cannot send this to a cloud embedding service in many jurisdictions.
- Legal documents: Attorney-client privilege and data residency requirements make cloud RAG genuinely problematic for law firms. Local is the only defensible option.
- Medical records: HIPAA in the US, GDPR in Europe, and equivalent regulations globally create significant exposure if patient data touches a third-party embedding endpoint.
- Offline and air-gapped environments: Military, government, and certain industrial contexts require systems that work without internet access. A local RAG pipeline is the only option.
- Cost-sensitive applications: High query volume scenarios where cloud costs would compound quickly — internal search tools, developer tooling, automated pipelines.
Cloud or hybrid RAG makes more sense when:
- You’re indexing 100 million+ documents and need distributed processing infrastructure that single-machine hardware can’t provide.
- You have multi-region teams who all need shared, synchronized access to the same vector store with real-time updates.
- You need enterprise SLAs with uptime guarantees, managed backups, and support contracts that someone in procurement will sign off on.
- Your team genuinely has no one capable of maintaining self-hosted infrastructure (though if you’re reading this guide, that probably isn’t you).
The hybrid approach threads this needle elegantly: embed locally using Ollama so your data never touches a cloud embedding API, but store the resulting vectors in a self-hosted Qdrant instance on a VPS. You get the privacy guarantees of local embedding with the accessibility of a cloud-hosted vector store. The raw document content stays on your local machine; only the numerical embeddings (which can’t be reversed into the original text with any reliability) live on the VPS. This is the architecture pattern worth considering as you scale beyond what a single local machine can serve.
The self-hosting philosophy extends beyond RAG. Once you’ve built a local RAG pipeline, the natural next steps are integrating it with content systems — our guide on WordPress AI content generation self-hosted covers exactly that intersection — or wiring it into automation workflows using our guide on how to automate WordPress without Zapier. If you’re building out a complete self-hosted stack, don’t overlook credential security either; a self-hosted password manager should be the first thing you deploy before anything else.
The central point is this: a local RAG pipeline represents a specific and valuable position on the tradeoff curve between convenience and control. It prioritizes control — over your data, your costs, your privacy, and your infrastructure — at the cost of some convenience in setup and maintenance. For the kinds of documents and use cases where that control genuinely matters, there’s no serious alternative. And for developers who’ve already gone down the path of owning their own infrastructure, it’s the most natural extension of a philosophy that’s already proven its worth.
The tools are free. The hardware you probably already have. The only thing left is to build it.
Pirate Verdict
Every RAG-as-a-service charges you per query to search your own documents. Read that again. Your documents, your questions, their meter running. A local RAG pipeline with Ollama and ChromaDB costs exactly zero per query, runs on hardware you already own, and never sends a single byte of your proprietary data to someone else’s server. The setup takes an afternoon. The savings start on day one. Stop paying a cloud provider to read your own files back to you.
What hardware do I need to run a local RAG pipeline?
A minimum of 16GB RAM and a modern CPU. A GPU is optional but speeds up embedding generation significantly. For the LLM component, 32GB RAM is recommended if running larger models like Llama 3.1 8B. A basic setup with nomic-embed-text and Llama 3.1 runs comfortably on most modern laptops.
How many documents can a local RAG pipeline handle?
ChromaDB can handle millions of embeddings on a single machine. A typical consumer setup can process and store 10,000-50,000 document pages without issues. The limiting factor is usually disk space for the vector database rather than processing power.
Is a local RAG pipeline as accurate as cloud-based solutions like ChatGPT with file uploads?
For domain-specific documents, a properly tuned local RAG pipeline often outperforms generic cloud solutions because you control chunking strategy, embedding model selection, and retrieval parameters. The quality depends on your configuration choices rather than being limited by a one-size-fits-all cloud approach.
Can I use a local RAG pipeline for production applications?
Yes. Tools like Qdrant and ChromaDB are production-grade vector databases. Combined with Ollama serving models via API, you can build production applications that serve multiple users. For higher scale, deploy on a dedicated server or VPS rather than a laptop.
What file types can I use with a local RAG pipeline?
PDFs, plain text, markdown, HTML, Word documents, CSVs, and more. LangChain provides document loaders for over 80 file formats. You can also build custom loaders for proprietary formats using Python.