{
"loading": true
"progress": ...
}
JSON Formatters Pro
Generative AI

Running LLMs Locally: A Complete Guide to Privacy-First AI

📅 December 08, 2025 ⏱️ 2 min read 👁️ 4 views 🏷️ Generative AI

After one too many headlines about AI companies training on user data, I decided to explore running LLMs locally. What started as a privacy experiment became my preferred way to use AI for sensitive work. Here's everything I learned.

Why Local LLMs?

Three compelling reasons:

  1. Privacy: Your prompts never leave your machine
  2. Cost: After hardware investment, inference is free
  3. Offline access: Works on airplanes, in secure facilities, anywhere

The Hardware Reality Check

Let's be honest about what you need:

Model SizeMinimum RAMRecommended GPU
7B parameters8GBGTX 1080 / M1 Mac
13B parameters16GBRTX 3090 / M1 Pro
70B parameters64GB2x RTX 4090 / M2 Ultra

The good news: 7B models are surprisingly capable and run on most modern laptops.

Getting Started with Ollama


# Install Ollama (macOS/Linux)
curl https://ollama.ai/install.sh | sh

# Download and run Mistral 7B
ollama run mistral

# Or Llama 3
ollama run llama3

# That's literally it. You're now running AI locally.

Ollama handles model downloading, quantization, and inference optimization automatically. It's the easiest way to get started.

Using Local Models in Your Code


import ollama

# Simple chat
response = ollama.chat(
    model='llama3',
    messages=[{
        'role': 'user',
        'content': 'Explain quantum entanglement in simple terms'
    }]
)
print(response['message']['content'])

# Streaming for real-time output
for chunk in ollama.chat(
    model='mistral',
    messages=[{'role': 'user', 'content': 'Write a short story about a robot'}],
    stream=True
):
    print(chunk['message']['content'], end='', flush=True)

Advanced: Running with llama.cpp

For maximum performance and control, use llama.cpp directly:


# Clone and build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make

# Download a quantized model (much smaller than original)
# I recommend GGUF format from HuggingFace

# Run inference
./main -m models/mistral-7b-v0.1.Q4_K_M.gguf        -p "What is the capital of France?"        -n 100

Quantization: The Secret to Running Big Models

Quantization reduces model precision (and size) with minimal quality loss. A 7B model:

  • FP16 (original): ~14GB
  • Q8_0 (high quality): ~7GB
  • Q4_K_M (great balance): ~4GB
  • Q2_K (aggressive): ~2.5GB

I typically use Q4_K_M. It's about 95% of the original quality at 30% of the size.

Building a Local RAG System


from langchain_community.llms import Ollama
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA

# All local – nothing leaves your machine
llm = Ollama(model="llama3")
embeddings = OllamaEmbeddings(model="nomic-embed-text")

# Create vector store
vectorstore = Chroma(embedding_function=embeddings, persist_directory="./db")

# Add your documents
vectorstore.add_texts([
    "Company policy: Remote work is allowed 3 days per week...",
    "Benefits include health insurance and 401k match..."
])

# Query
qa = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vectorstore.as_retriever()
)

result = qa.run("What is the remote work policy?")
print(result)

Performance Tips I've Learned

  1. Use Metal on Mac: Enable GPU acceleration (`cmake -DLLAMA_METAL=on`)
  2. Batch similar requests: Keeping the model loaded is faster than reloading
  3. Start with 7B models: Optimize your prompts before scaling up
  4. Consider the M2/M3 Macs: Unified memory is perfect for LLMs

When Local Makes Sense

Use local LLMs when:

  • Data is sensitive (medical, financial, legal)
  • You need offline access
  • Cost predictability matters (high volume)
  • Latency requirements are strict

Stick with cloud APIs when:

  • You need cutting-edge capabilities (GPT-4 level)
  • Hardware investment isn't feasible
  • Uptime requirements are extreme

Local LLMs have gone from "interesting experiment" to "practical tool" in just one year. If privacy matters to you, now is the time to explore them.

🏷️ Tags:
local LLM Ollama privacy AI self-hosted AI llama.cpp

📚 Related Articles