After one too many headlines about AI companies training on user data, I decided to explore running LLMs locally. What started as a privacy experiment became my preferred way to use AI for sensitive work. Here's everything I learned.
Why Local LLMs?
Three compelling reasons:
- Privacy: Your prompts never leave your machine
- Cost: After hardware investment, inference is free
- Offline access: Works on airplanes, in secure facilities, anywhere
The Hardware Reality Check
Let's be honest about what you need:
| Model Size | Minimum RAM | Recommended GPU |
|---|---|---|
| 7B parameters | 8GB | GTX 1080 / M1 Mac |
| 13B parameters | 16GB | RTX 3090 / M1 Pro |
| 70B parameters | 64GB | 2x RTX 4090 / M2 Ultra |
The good news: 7B models are surprisingly capable and run on most modern laptops.
Getting Started with Ollama
# Install Ollama (macOS/Linux)
curl https://ollama.ai/install.sh | sh
# Download and run Mistral 7B
ollama run mistral
# Or Llama 3
ollama run llama3
# That's literally it. You're now running AI locally.
Ollama handles model downloading, quantization, and inference optimization automatically. It's the easiest way to get started.
Using Local Models in Your Code
import ollama
# Simple chat
response = ollama.chat(
model='llama3',
messages=[{
'role': 'user',
'content': 'Explain quantum entanglement in simple terms'
}]
)
print(response['message']['content'])
# Streaming for real-time output
for chunk in ollama.chat(
model='mistral',
messages=[{'role': 'user', 'content': 'Write a short story about a robot'}],
stream=True
):
print(chunk['message']['content'], end='', flush=True)
Advanced: Running with llama.cpp
For maximum performance and control, use llama.cpp directly:
# Clone and build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make
# Download a quantized model (much smaller than original)
# I recommend GGUF format from HuggingFace
# Run inference
./main -m models/mistral-7b-v0.1.Q4_K_M.gguf -p "What is the capital of France?" -n 100
Quantization: The Secret to Running Big Models
Quantization reduces model precision (and size) with minimal quality loss. A 7B model:
- FP16 (original): ~14GB
- Q8_0 (high quality): ~7GB
- Q4_K_M (great balance): ~4GB
- Q2_K (aggressive): ~2.5GB
I typically use Q4_K_M. It's about 95% of the original quality at 30% of the size.
Building a Local RAG System
from langchain_community.llms import Ollama
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA
# All local – nothing leaves your machine
llm = Ollama(model="llama3")
embeddings = OllamaEmbeddings(model="nomic-embed-text")
# Create vector store
vectorstore = Chroma(embedding_function=embeddings, persist_directory="./db")
# Add your documents
vectorstore.add_texts([
"Company policy: Remote work is allowed 3 days per week...",
"Benefits include health insurance and 401k match..."
])
# Query
qa = RetrievalQA.from_chain_type(
llm=llm,
retriever=vectorstore.as_retriever()
)
result = qa.run("What is the remote work policy?")
print(result)
Performance Tips I've Learned
- Use Metal on Mac: Enable GPU acceleration (`cmake -DLLAMA_METAL=on`)
- Batch similar requests: Keeping the model loaded is faster than reloading
- Start with 7B models: Optimize your prompts before scaling up
- Consider the M2/M3 Macs: Unified memory is perfect for LLMs
When Local Makes Sense
Use local LLMs when:
- Data is sensitive (medical, financial, legal)
- You need offline access
- Cost predictability matters (high volume)
- Latency requirements are strict
Stick with cloud APIs when:
- You need cutting-edge capabilities (GPT-4 level)
- Hardware investment isn't feasible
- Uptime requirements are extreme
Local LLMs have gone from "interesting experiment" to "practical tool" in just one year. If privacy matters to you, now is the time to explore them.