Version 3.0 — April 2026
A community resource for developers working under internet restrictions
Philosophy: Download once, run forever offline. Everything on your own device, no data leaves your machine, no monthly subscriptions.
🇮🇷 نسخه فارسی | 📋 Contributing Guide
- Introduction
- Realistic Expectations
- Quick Decision Matrix
- Hardware
- Inference Tools
- Recommended Models
- Step-by-Step Setup
- Offline Transfer (USB / Sneakernet)
- LAN Mirror
- Optimization
- RAG & Advanced Techniques
- Practical Use Cases
- Iran Download Sources
- Network Issues & Censorship
- Quick Reference Tables
- FAQ
- Troubleshooting
- Contributing
- Changelog
This guide is written for developers working in environments with limited or censored internet access who want to use AI without depending on cloud services.
Why Local AI?
- Privacy: No data leaves your system
- Independence: No need for stable internet
- Cost: No monthly fees after initial download
- Customization: Models can be fine-tuned
- Learning: Deeper understanding of how AI works
Local models in 2026 have reached a remarkable level. The gap with cloud models (Claude, GPT-5) has narrowed significantly, but still exists.
Local models excel at: everyday chat, coding assistance, translation, boilerplate generation, RAG on your own documents.
Cloud models still lead in: complex multi-step reasoning, high-level creative tasks, very difficult math/algorithmic problems.
Key insight: A good local model + RAG on relevant documents is fully competitive for most daily work tasks.
| Hardware | Model | Speed |
|---|---|---|
| Raspberry Pi 5 (8GB) | Gemma 3 1B Q4 | ~7-8 tok/s |
| Laptop CPU (16GB RAM) | Qwen 7B Q4 | ~3-8 tok/s |
| Laptop with 8GB GPU | Qwen 7B Q4 | ~30-50 tok/s |
| Desktop RTX 4090 (24GB) | Qwen 32B Q4 | ~25-40 tok/s |
| What I Have | What to Install | Which Model |
|---|---|---|
| Raspberry Pi only | llama.cpp | Gemma 3 1B or Qwen 3B |
| Laptop, no GPU (8GB) | LM Studio | Phi-4 Mini 3.8B Q4 |
| Laptop, no GPU (16GB) | Ollama | Qwen3 7B Q4 |
| Gaming laptop (8GB GPU) | Ollama + Continue.dev | Qwen3 7B Q4 |
| Desktop GPU 24GB | Ollama/vLLM + Open WebUI | Qwen3 32B Q4 |
| Multi-GPU / Server | vLLM/SGLang | Qwen3.5-122B-A10B MoE |
| Mac Apple Silicon 32GB+ | Ollama (MLX) | Qwen3.5-35B-A3B |
- Pi 5 with 8GB RAM minimum for acceptable experience
- SSD via USB 3.0 (not SD Card!) — at least 64GB
- Active cooling required
- Models up to 3B work well (4-8 tok/s), 7B+ is impractical
| RAM | Recommended Model | Speed |
|---|---|---|
| 8GB | Phi-4 Mini 3.8B Q4_K_M | 5-10 tok/s |
| 16GB | Qwen3 7B Q4_K_M | 3-8 tok/s |
| 32GB | Qwen3 14B Q4_K_M | 3-6 tok/s |
- 8-12GB VRAM: 7-14B models comfortably
- 24GB VRAM: 32B models, sweet spot for serious work
- Multi-GPU: Required for large MoE models (DeepSeek-V3, Qwen3.5-397B)
- Unified Memory: RAM and VRAM shared — can run larger models
- Ollama 0.19 (April 2026): Built on MLX framework, ~1.6x prefill and ~2x decode speedup
- M2 Pro/Max 32GB: Qwen3.5-35B-A3B works well
- M5 chips see the largest improvements
| Tool | Type | Best For | Notes |
|---|---|---|---|
| Ollama | CLI/Server | Quick local setup | Simplest to start |
| LM Studio | GUI | Beginners | Visual interface |
| llama.cpp | Engine | Dense models, CPU/GPU | Fastest for dense; not for MoE |
| vLLM | Engine | Production, MoE | High throughput |
| SGLang | Engine | MoE, DeepSeek, agents | Often faster than vLLM for MoE |
| Open WebUI | Web UI | Team access | Works with any backend |
| Jan | Desktop | Offline use | No Docker needed |
| Llamafile | Standalone | Portable | Single executable |
Gemma 3 1B, Qwen3.5 0.8B/2B, Phi-4 Mini 3.8B, SmolLM2 1.7B
Qwen3 7B (best all-rounder), Qwen3.5 9B (multimodal, 262K), Llama 3.3 8B, Gemma 3 12B, Qwen 2.5 Coder 14B
Qwen3 32B, Qwen3.5 27B, Gemma 4 31B (new April 2026), Qwen 2.5 Coder 32B
Qwen3.5-35B-A3B, Qwen3.5-122B-A10B, Llama 4 Scout/Maverick, DeepSeek-V3
Qwen 2.5 Coder (7B/14B/32B), DeepSeek-Coder-V2
- 20B version: severe hallucination issues
- 120B version: impractical without heavy quantization that degrades quality
- Recommendation: Use Qwen3/3.5 or DeepSeek instead
curl -fsSL https://ollama.com/install.sh | sh
ollama run qwen3:7bDownload from ollama.com/download/windows, install, then ollama run qwen3:7b
curl -fsSL https://ollama.com/install.sh | sh
ollama pull gemma3:1bTransfer models via USB: copy ~/.ollama/models/ to USB, set OLLAMA_MODELS=/path/to/usb/models on target machine.
For GGUF files: download from HuggingFace, use directly with llama.cpp or import into LM Studio.
- Q4_K_M — golden standard, start here
- Q5_K_M — if you have spare RAM/VRAM
- Q3_K_M — when RAM is very limited
- FP16 — only for professional GPUs
- Flash Attention:
OLLAMA_FLASH_ATTENTION=1 - KV Cache Quantization for longer contexts
- Speculative Decoding for 2-3x speed boost
- Qwen3.5 supports Multi-Token Prediction (MTP)
Tools: Open WebUI (built-in RAG), AnythingLLM, PrivateGPT, Kotaemon
IDE Integration: Continue.dev (VS Code/JetBrains), Cline (VS Code), Aider (CLI)
| Situation | Use |
|---|---|
| Plenty of VRAM/RAM | Q5_K_M |
| Best balance | Q4_K_M (default) |
| Very low RAM | Q3_K_M |
| Professional GPU | FP16 or Q8_0 |
Q: Minimum hardware to start? A: Laptop with 8GB RAM. Phi-4 Mini 3.8B Q4 runs at 5-10 tok/s on CPU.
Q: Worth it without a GPU? A: Yes! 3-7B models work well on CPU. Slower but fully usable.
Q: Which model for multilingual? A: Qwen3 and Qwen3.5 support 200+ languages.
Q: Does AMD GPU work? A: Yes, with ROCm on Linux. Ollama supports ROCm 7.
This project is licensed under CC BY-SA 4.0.
Built with ❤️ by the community
Arman-g7 | April 2026