Skip to content

DeCaffeth/local-ai-guide

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Last Updated License Contributions Welcome

🧠 Comprehensive Guide to Running Local AI Models

Version 3.0 — April 2026

A community resource for developers working under internet restrictions

Philosophy: Download once, run forever offline. Everything on your own device, no data leaves your machine, no monthly subscriptions.

🇮🇷 نسخه فارسی | 📋 Contributing Guide


Table of Contents

  1. Introduction
  2. Realistic Expectations
  3. Quick Decision Matrix
  4. Hardware
  5. Inference Tools
  6. Recommended Models
  7. Step-by-Step Setup
  8. Offline Transfer (USB / Sneakernet)
  9. LAN Mirror
  10. Optimization
  11. RAG & Advanced Techniques
  12. Practical Use Cases
  13. Iran Download Sources
  14. Network Issues & Censorship
  15. Quick Reference Tables
  16. FAQ
  17. Troubleshooting
  18. Contributing
  19. Changelog

🎯 1. Introduction

This guide is written for developers working in environments with limited or censored internet access who want to use AI without depending on cloud services.

Why Local AI?

  • Privacy: No data leaves your system
  • Independence: No need for stable internet
  • Cost: No monthly fees after initial download
  • Customization: Models can be fine-tuned
  • Learning: Deeper understanding of how AI works

⚖️ 2. Realistic Expectations

Local models in 2026 have reached a remarkable level. The gap with cloud models (Claude, GPT-5) has narrowed significantly, but still exists.

Local models excel at: everyday chat, coding assistance, translation, boilerplate generation, RAG on your own documents.

Cloud models still lead in: complex multi-step reasoning, high-level creative tasks, very difficult math/algorithmic problems.

Key insight: A good local model + RAG on relevant documents is fully competitive for most daily work tasks.

Real-World Speed Benchmarks

Hardware Model Speed
Raspberry Pi 5 (8GB) Gemma 3 1B Q4 ~7-8 tok/s
Laptop CPU (16GB RAM) Qwen 7B Q4 ~3-8 tok/s
Laptop with 8GB GPU Qwen 7B Q4 ~30-50 tok/s
Desktop RTX 4090 (24GB) Qwen 32B Q4 ~25-40 tok/s

🗺️ 3. Quick Decision Matrix

What I Have What to Install Which Model
Raspberry Pi only llama.cpp Gemma 3 1B or Qwen 3B
Laptop, no GPU (8GB) LM Studio Phi-4 Mini 3.8B Q4
Laptop, no GPU (16GB) Ollama Qwen3 7B Q4
Gaming laptop (8GB GPU) Ollama + Continue.dev Qwen3 7B Q4
Desktop GPU 24GB Ollama/vLLM + Open WebUI Qwen3 32B Q4
Multi-GPU / Server vLLM/SGLang Qwen3.5-122B-A10B MoE
Mac Apple Silicon 32GB+ Ollama (MLX) Qwen3.5-35B-A3B

🖥️ 4. Hardware

🍓 Raspberry Pi

  • Pi 5 with 8GB RAM minimum for acceptable experience
  • SSD via USB 3.0 (not SD Card!) — at least 64GB
  • Active cooling required
  • Models up to 3B work well (4-8 tok/s), 7B+ is impractical

💻 Laptop

RAM Recommended Model Speed
8GB Phi-4 Mini 3.8B Q4_K_M 5-10 tok/s
16GB Qwen3 7B Q4_K_M 3-8 tok/s
32GB Qwen3 14B Q4_K_M 3-6 tok/s

🎮 Gaming Desktop

  • 8-12GB VRAM: 7-14B models comfortably
  • 24GB VRAM: 32B models, sweet spot for serious work
  • Multi-GPU: Required for large MoE models (DeepSeek-V3, Qwen3.5-397B)

🍎 Apple Silicon

  • Unified Memory: RAM and VRAM shared — can run larger models
  • Ollama 0.19 (April 2026): Built on MLX framework, ~1.6x prefill and ~2x decode speedup
  • M2 Pro/Max 32GB: Qwen3.5-35B-A3B works well
  • M5 chips see the largest improvements

🛠️ 5. Inference Tools

Tool Type Best For Notes
Ollama CLI/Server Quick local setup Simplest to start
LM Studio GUI Beginners Visual interface
llama.cpp Engine Dense models, CPU/GPU Fastest for dense; not for MoE
vLLM Engine Production, MoE High throughput
SGLang Engine MoE, DeepSeek, agents Often faster than vLLM for MoE
Open WebUI Web UI Team access Works with any backend
Jan Desktop Offline use No Docker needed
Llamafile Standalone Portable Single executable

📦 6. Recommended Models (April 2026)

Small (≤3B)

Gemma 3 1B, Qwen3.5 0.8B/2B, Phi-4 Mini 3.8B, SmolLM2 1.7B

Medium (7B-14B)

Qwen3 7B (best all-rounder), Qwen3.5 9B (multimodal, 262K), Llama 3.3 8B, Gemma 3 12B, Qwen 2.5 Coder 14B

Large (32B-70B)

Qwen3 32B, Qwen3.5 27B, Gemma 4 31B (new April 2026), Qwen 2.5 Coder 32B

MoE

Qwen3.5-35B-A3B, Qwen3.5-122B-A10B, Llama 4 Scout/Maverick, DeepSeek-V3

Code-Specific

Qwen 2.5 Coder (7B/14B/32B), DeepSeek-Coder-V2

⚠️ GPT-OSS Warning

  • 20B version: severe hallucination issues
  • 120B version: impractical without heavy quantization that degrades quality
  • Recommendation: Use Qwen3/3.5 or DeepSeek instead

🚀 7. Step-by-Step Setup

Linux/Mac

curl -fsSL https://ollama.com/install.sh | sh
ollama run qwen3:7b

Windows

Download from ollama.com/download/windows, install, then ollama run qwen3:7b

Raspberry Pi

curl -fsSL https://ollama.com/install.sh | sh
ollama pull gemma3:1b

📦 8. Offline Transfer

Transfer models via USB: copy ~/.ollama/models/ to USB, set OLLAMA_MODELS=/path/to/usb/models on target machine.

For GGUF files: download from HuggingFace, use directly with llama.cpp or import into LM Studio.


⚡ 10. Optimization

Quantization Quick Guide

  • Q4_K_M — golden standard, start here
  • Q5_K_M — if you have spare RAM/VRAM
  • Q3_K_M — when RAM is very limited
  • FP16 — only for professional GPUs

Key Settings

  • Flash Attention: OLLAMA_FLASH_ATTENTION=1
  • KV Cache Quantization for longer contexts
  • Speculative Decoding for 2-3x speed boost
  • Qwen3.5 supports Multi-Token Prediction (MTP)

🔍 11. RAG & Advanced Techniques

Tools: Open WebUI (built-in RAG), AnythingLLM, PrivateGPT, Kotaemon

IDE Integration: Continue.dev (VS Code/JetBrains), Cline (VS Code), Aider (CLI)


📊 15. Quick Reference Tables

Quantization Selection

Situation Use
Plenty of VRAM/RAM Q5_K_M
Best balance Q4_K_M (default)
Very low RAM Q3_K_M
Professional GPU FP16 or Q8_0

❓ 16. FAQ

Q: Minimum hardware to start? A: Laptop with 8GB RAM. Phi-4 Mini 3.8B Q4 runs at 5-10 tok/s on CPU.

Q: Worth it without a GPU? A: Yes! 3-7B models work well on CPU. Slower but fully usable.

Q: Which model for multilingual? A: Qwen3 and Qwen3.5 support 200+ languages.

Q: Does AMD GPU work? A: Yes, with ROCm on Linux. Ollama supports ROCm 7.


📜 License

This project is licensed under CC BY-SA 4.0.


Built with ❤️ by the community

Arman-g7 | April 2026

About

🧠 The most comprehensive guide for running local AI models offline — hardware, tools, models, optimization, and practical workflows for restricted-internet environments.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors