post - 01
Nov 13, 2025
Exploring Local LLMs and RAG
Locally running large language models is something that I have had interest in messing with for a while now and the furthest I've gone with this before has just been downloading Ollama and running a couple models on my laptop/pc. I was listening to a podcast recently discussing the adoption of enterprise level llm systems and the biggest concern (after cost of course) would be security to these companies. They deal with a lot of data that they do not want to get picked up by anyone else and the alternative to shoveling your dollars over to google, openai, anthropic, etc… could be something like running a lightweight local model within some internal tools, running on user systems, or hosting these models directly on company servers (with various pros and cons for each).
I want to explore creating an application with some local model powering it that could potentially be used in the imagined scenario of a company or person wanting to use ai but too cheap or scared of leaking some data. For now I want to just dive deeper into running things locally and also testing out RAG (retrieval augmented generation) which is essentially being able to organize and provide the context for the model in a way that allows the model to retrieve information from your saved documents and provide better responses on lighter weight models.
My plan for this is to run a stack that all works together:
- The brain - Ollama
- The library - ChromaDB
- The translator - all-MiniLm-L6-v2
- The orchestrator - LlamaIndex
-
We will use Ollama because it is easy to get up and running and honestly I have already done so… It feels like the easiest way to get things rolling and a very simple and straightforward way to get introduced to running your own local llms.
-
ChromaDB will be used as our library or to be more specific our vector database. Data is stored as mathematical representations (vectors). We will use ChromaDB because it is very easy to run and we won't need to run a separate complex database server.
-
I got the recommendation from another post I read about projects like this recommending this model so I'm gonna keep things simple and start by working with all-MiniLm-L6-v2. This model will be used to generate the vectors from our text which will populate the database that our retrieval process will stem from.
-
Lastly LlamaIndex is purpose-built for RAG and it is known for being fast and simple at accomplishing what we need: connecting data, indexing it, and querying it.
Step 1: Setting the Environment
First, I needed to get the "brain" running. I used Ollama, which makes running open-source models incredibly easy.
Install Ollama: I downloaded the installer from ollama.com.
Pull the Model: I chose Meta's Llama 3.1 (8B version). It's efficient and runs well on my machine (you generally need about 8GB of RAM).
ollama pull llama3.1:8b
Next, I set up my Python environment by installing the orchestrator, the vector database, and the embedding tools.
pip install llamaindex chromadb sentence-transformers
Step 2: The "Indexing" Phase
The first challenge in RAG is teaching the AI what you know. You don't actually "train" the model; instead, you create a searchable index of your documents.
I created a folder called ./data and dropped in a few test files (a PDF of project specs and a text file with dummy client emails). Then, I wrote this script, index.py, to "feed" the database.
index.py
from llamaindex.core import VectorStoreIndex, SimpleDirectoryReader
from llamaindex.vector_stores.chroma import ChromaVectorStore
from llamaindex.core import StorageContext
import chromadb
# 1. Load local documents from the './data' folder
print("Loading documents...")
documents = SimpleDirectoryReader("./data").load_data()
# 2. Create a persistent ChromaDB database on disk
db = chromadb.PersistentClient(path="./chroma_db")
chroma_collection = db.get_or_create_collection("my_company_data")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
# 3. Create the index
# This processes the documents, turns them into vectors, and stores them locally
print("Vectorizing data...")
index = VectorStoreIndex.from_documents(
documents, storage_context=storage_context
)
print("Indexing complete! Your local data is now in the vector database.")
The Result: After running this, a new folder appeared called chroma_db. This contained the mathematical "fingerprints" of my proprietary data.
Step 3: The Retrieval Phase (Asking Questions)
Now for the fun part: actually chatting with the data.
I needed a second script that connects to that database and the local Llama 3.1 model simultaneously. Here is query.py:
query.py
from llamaindex.core import VectorStoreIndex
from llamaindex.vector_stores.chroma import ChromaVectorStore
from llamaindex.core import StorageContext
from llamaindex.llms.ollama import Ollama
import chromadb
# 1. Connect to the existing local database
db = chromadb.PersistentClient(path="./chroma_db")
chroma_collection = db.get_collection("my_company_data")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
index = VectorStoreIndex.from_vector_store(vector_store=vector_store)
# 2. Connect to the local LLM via Ollama
llm = Ollama(model="llama3.1:8b", request_timeout=120.0)
# 3. Create the query engine using the local model
query_engine = index.as_query_engine(llm=llm)
# 4. Ask a question based on your documents
question = "What is the name of the client in project-specs.pdf?"
print(f"Asking: {question}")
response = query_engine.query(question)
print(f"Answer: {response}")
How It Actually Works
When I ran the query script, the response was instant and accurate. But what really impressed me was the security of the process. Here is what happened under the hood:
- Vector Search: My question was converted into a vector locally.
- Retrieval: ChromaDB found the specific paragraph in project-specs.pdf that contained the answer.
- Synthesis: LlamaIndex combined my question + that specific paragraph into a prompt.
- Generation: That prompt was sent to the local Llama 3.1 model to generate the final English sentence.