Generative AI

Running LLMs Locally: A Complete Guide to Privacy-First AI

📅 December 08, 2025 ⏱️ 2 min read 👁️ 4 views 🏷️ Generative AI

After one too many headlines about AI companies training on user data, I decided to explore running LLMs locally. What started as a privacy experiment became my preferred way to use AI for sensitive work. Here's everything I learned.

Why Local LLMs?

Three compelling reasons:

Privacy: Your prompts never leave your machine
Cost: After hardware investment, inference is free
Offline access: Works on airplanes, in secure facilities, anywhere

The Hardware Reality Check

Let's be honest about what you need:

Model Size	Minimum RAM	Recommended GPU
7B parameters	8GB	GTX 1080 / M1 Mac
13B parameters	16GB	RTX 3090 / M1 Pro
70B parameters	64GB	2x RTX 4090 / M2 Ultra

The good news: 7B models are surprisingly capable and run on most modern laptops.

Getting Started with Ollama


# Install Ollama (macOS/Linux)
curl https://ollama.ai/install.sh | sh

# Download and run Mistral 7B
ollama run mistral

# Or Llama 3
ollama run llama3

# That's literally it. You're now running AI locally.

Ollama handles model downloading, quantization, and inference optimization automatically. It's the easiest way to get started.

Using Local Models in Your Code


import ollama

# Simple chat
response = ollama.chat(
    model='llama3',
    messages=[{
        'role': 'user',
        'content': 'Explain quantum entanglement in simple terms'
    }]
)
print(response['message']['content'])

# Streaming for real-time output
for chunk in ollama.chat(
    model='mistral',
    messages=[{'role': 'user', 'content': 'Write a short story about a robot'}],
    stream=True
):
    print(chunk['message']['content'], end='', flush=True)

Advanced: Running with llama.cpp

For maximum performance and control, use llama.cpp directly:


# Clone and build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make

# Download a quantized model (much smaller than original)
# I recommend GGUF format from HuggingFace

# Run inference
./main -m models/mistral-7b-v0.1.Q4_K_M.gguf        -p "What is the capital of France?"        -n 100

Quantization: The Secret to Running Big Models

Quantization reduces model precision (and size) with minimal quality loss. A 7B model:

FP16 (original): ~14GB
Q8_0 (high quality): ~7GB
Q4_K_M (great balance): ~4GB
Q2_K (aggressive): ~2.5GB

I typically use Q4_K_M. It's about 95% of the original quality at 30% of the size.

Building a Local RAG System


from langchain_community.llms import Ollama
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA

# All local – nothing leaves your machine
llm = Ollama(model="llama3")
embeddings = OllamaEmbeddings(model="nomic-embed-text")

# Create vector store
vectorstore = Chroma(embedding_function=embeddings, persist_directory="./db")

# Add your documents
vectorstore.add_texts([
    "Company policy: Remote work is allowed 3 days per week...",
    "Benefits include health insurance and 401k match..."
])

# Query
qa = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vectorstore.as_retriever()
)

result = qa.run("What is the remote work policy?")
print(result)

Performance Tips I've Learned

Use Metal on Mac: Enable GPU acceleration (`cmake -DLLAMA_METAL=on`)
Batch similar requests: Keeping the model loaded is faster than reloading
Start with 7B models: Optimize your prompts before scaling up
Consider the M2/M3 Macs: Unified memory is perfect for LLMs

When Local Makes Sense

Use local LLMs when:

Data is sensitive (medical, financial, legal)
You need offline access
Cost predictability matters (high volume)
Latency requirements are strict

Stick with cloud APIs when:

You need cutting-edge capabilities (GPT-4 level)
Hardware investment isn't feasible
Uptime requirements are extreme

Local LLMs have gone from "interesting experiment" to "practical tool" in just one year. If privacy matters to you, now is the time to explore them.

🏷️ Tags:

local LLM Ollama privacy AI self-hosted AI llama.cpp