🧠 Comprehensive Guide to Running Local AI Models

Version 3.0 — April 2026

A community resource for developers working under internet restrictions

Philosophy: Download once, run forever offline. Everything on your own device, no data leaves your machine, no monthly subscriptions.

🇮🇷 نسخه فارسی | 📋 Contributing Guide

Introduction
Realistic Expectations
Quick Decision Matrix
Hardware
Inference Tools
Recommended Models
Step-by-Step Setup
Offline Transfer (USB / Sneakernet)
LAN Mirror
Optimization
RAG & Advanced Techniques
Practical Use Cases
Iran Download Sources
Network Issues & Censorship
Quick Reference Tables
FAQ
Troubleshooting
Contributing
Changelog

🎯 1. Introduction

This guide is written for developers working in environments with limited or censored internet access who want to use AI without depending on cloud services.

Why Local AI?

Privacy: No data leaves your system
Independence: No need for stable internet
Cost: No monthly fees after initial download
Customization: Models can be fine-tuned
Learning: Deeper understanding of how AI works

⚖️ 2. Realistic Expectations

Local models in 2026 have reached a remarkable level. The gap with cloud models (Claude, GPT-5) has narrowed significantly, but still exists.

Local models excel at: everyday chat, coding assistance, translation, boilerplate generation, RAG on your own documents.

Cloud models still lead in: complex multi-step reasoning, high-level creative tasks, very difficult math/algorithmic problems.

Key insight: A good local model + RAG on relevant documents is fully competitive for most daily work tasks.

Real-World Speed Benchmarks

Hardware	Model	Speed
Raspberry Pi 5 (8GB)	Gemma 3 1B Q4	~7-8 tok/s
Laptop CPU (16GB RAM)	Qwen 7B Q4	~3-8 tok/s
Laptop with 8GB GPU	Qwen 7B Q4	~30-50 tok/s
Desktop RTX 4090 (24GB)	Qwen 32B Q4	~25-40 tok/s

🗺️ 3. Quick Decision Matrix

What I Have	What to Install	Which Model
Raspberry Pi only	llama.cpp	Gemma 3 1B or Qwen 3B
Laptop, no GPU (8GB)	LM Studio	Phi-4 Mini 3.8B Q4
Laptop, no GPU (16GB)	Ollama	Qwen3 7B Q4
Gaming laptop (8GB GPU)	Ollama + Continue.dev	Qwen3 7B Q4
Desktop GPU 24GB	Ollama/vLLM + Open WebUI	Qwen3 32B Q4
Multi-GPU / Server	vLLM/SGLang	Qwen3.5-122B-A10B MoE
Mac Apple Silicon 32GB+	Ollama (MLX)	Qwen3.5-35B-A3B

🖥️ 4. Hardware

🍓 Raspberry Pi

Pi 5 with 8GB RAM minimum for acceptable experience
SSD via USB 3.0 (not SD Card!) — at least 64GB
Active cooling required
Models up to 3B work well (4-8 tok/s), 7B+ is impractical

💻 Laptop

RAM	Recommended Model	Speed
8GB	Phi-4 Mini 3.8B Q4_K_M	5-10 tok/s
16GB	Qwen3 7B Q4_K_M	3-8 tok/s
32GB	Qwen3 14B Q4_K_M	3-6 tok/s

🎮 Gaming Desktop

8-12GB VRAM: 7-14B models comfortably
24GB VRAM: 32B models, sweet spot for serious work
Multi-GPU: Required for large MoE models (DeepSeek-V3, Qwen3.5-397B)

🍎 Apple Silicon

Unified Memory: RAM and VRAM shared — can run larger models
Ollama 0.19 (April 2026): Built on MLX framework, ~1.6x prefill and ~2x decode speedup
M2 Pro/Max 32GB: Qwen3.5-35B-A3B works well
M5 chips see the largest improvements

🛠️ 5. Inference Tools

Tool	Type	Best For	Notes
Ollama	CLI/Server	Quick local setup	Simplest to start
LM Studio	GUI	Beginners	Visual interface
llama.cpp	Engine	Dense models, CPU/GPU	Fastest for dense; not for MoE
vLLM	Engine	Production, MoE	High throughput
SGLang	Engine	MoE, DeepSeek, agents	Often faster than vLLM for MoE
Open WebUI	Web UI	Team access	Works with any backend
Jan	Desktop	Offline use	No Docker needed
Llamafile	Standalone	Portable	Single executable

📦 6. Recommended Models (April 2026)

Small (≤3B)

Gemma 3 1B, Qwen3.5 0.8B/2B, Phi-4 Mini 3.8B, SmolLM2 1.7B

Medium (7B-14B)

Qwen3 7B (best all-rounder), Qwen3.5 9B (multimodal, 262K), Llama 3.3 8B, Gemma 3 12B, Qwen 2.5 Coder 14B

Large (32B-70B)

Qwen3 32B, Qwen3.5 27B, Gemma 4 31B (new April 2026), Qwen 2.5 Coder 32B

MoE

Qwen3.5-35B-A3B, Qwen3.5-122B-A10B, Llama 4 Scout/Maverick, DeepSeek-V3

Code-Specific

Qwen 2.5 Coder (7B/14B/32B), DeepSeek-Coder-V2

⚠️ GPT-OSS Warning

20B version: severe hallucination issues
120B version: impractical without heavy quantization that degrades quality
Recommendation: Use Qwen3/3.5 or DeepSeek instead

🚀 7. Step-by-Step Setup

Linux/Mac

curl -fsSL https://ollama.com/install.sh | sh
ollama run qwen3:7b

Windows

Download from ollama.com/download/windows, install, then ollama run qwen3:7b

Raspberry Pi

curl -fsSL https://ollama.com/install.sh | sh
ollama pull gemma3:1b

📦 8. Offline Transfer

Transfer models via USB: copy ~/.ollama/models/ to USB, set OLLAMA_MODELS=/path/to/usb/models on target machine.

For GGUF files: download from HuggingFace, use directly with llama.cpp or import into LM Studio.

⚡ 10. Optimization

Quantization Quick Guide

Q4_K_M — golden standard, start here
Q5_K_M — if you have spare RAM/VRAM
Q3_K_M — when RAM is very limited
FP16 — only for professional GPUs

Key Settings

Flash Attention: OLLAMA_FLASH_ATTENTION=1
KV Cache Quantization for longer contexts
Speculative Decoding for 2-3x speed boost
Qwen3.5 supports Multi-Token Prediction (MTP)

🔍 11. RAG & Advanced Techniques

Tools: Open WebUI (built-in RAG), AnythingLLM, PrivateGPT, Kotaemon

IDE Integration: Continue.dev (VS Code/JetBrains), Cline (VS Code), Aider (CLI)

📊 15. Quick Reference Tables

Quantization Selection

Situation	Use
Plenty of VRAM/RAM	Q5_K_M
Best balance	Q4_K_M (default)
Very low RAM	Q3_K_M
Professional GPU	FP16 or Q8_0

❓ 16. FAQ

Q: Minimum hardware to start? A: Laptop with 8GB RAM. Phi-4 Mini 3.8B Q4 runs at 5-10 tok/s on CPU.

Q: Worth it without a GPU? A: Yes! 3-7B models work well on CPU. Slower but fully usable.

Q: Which model for multilingual? A: Qwen3 and Qwen3.5 support 200+ languages.

Q: Does AMD GPU work? A: Yes, with ROCm on Linux. Ollama supports ROCm 7.

📜 License

This project is licensed under CC BY-SA 4.0.

Built with ❤️ by the community

Arman-g7 | April 2026

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
CONTRIBUTING.md		CONTRIBUTING.md
FA.md		FA.md
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

🧠 Comprehensive Guide to Running Local AI Models

Table of Contents

🎯 1. Introduction

⚖️ 2. Realistic Expectations

Real-World Speed Benchmarks

🗺️ 3. Quick Decision Matrix

🖥️ 4. Hardware

🍓 Raspberry Pi

💻 Laptop

🎮 Gaming Desktop

🍎 Apple Silicon

🛠️ 5. Inference Tools

📦 6. Recommended Models (April 2026)

Small (≤3B)

Medium (7B-14B)

Large (32B-70B)

MoE

Code-Specific

⚠️ GPT-OSS Warning

🚀 7. Step-by-Step Setup

Linux/Mac

Windows

Raspberry Pi

📦 8. Offline Transfer

⚡ 10. Optimization

Quantization Quick Guide

Key Settings

🔍 11. RAG & Advanced Techniques

📊 15. Quick Reference Tables

Quantization Selection

❓ 16. FAQ

📜 License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages