AVA is a local-first AI lab built under extreme hardware constraints: a single 4 GB VRAM laptop GPU (NVIDIA RTX A2000), no cloud budget, no large-cluster training. AVA v2 is the first released model from that work: a QLoRA fine-tune of Qwen3.5-2B that achieves 79% on ARC-Challenge and 48% on GSM8K while training and running inference in under 2 GB of VRAM.
But the repo is broader than one released adapter. AVA is the full stack for building local AI from the ground up: scratch architectures, tokenizers, retrieval and memory systems, system-prompt and routing controls, distillation, post-training, evaluation, and fast local serving.
The entire training pipeline, evaluation harness, experiment history, and model weights are open. This README now documents both the released model and the larger AVA research track.
AVA currently spans four layers of work:
ground-up model building: compact scratch models, recurrent-depth variants, tokenizer work, and training infrastructure insrc/avabehavior shaping: tool use, compliance, memory transfer, retrieval, and prompt/routing controlpost-training: QLoRA fine-tuning, teacher distillation, corpus curation, and evaluation on stronger base modelslocal serving systems: quantization, long-context engineering, hybrid CPU/GPU runtimes, and fast/deep model routing
If you want the shortest summary: AVA is a personal local AI repository for building, training, distilling, steering, and serving useful AI systems on consumer hardware.
flowchart LR
A["Exp1-3<br/>Scratch AVA systems"] --> B["Exp4<br/>Qwen3.5-2B QLoRA"]
B --> C["Exp5A<br/>Gemma 4 26B feasibility"]
C --> D["Exp5B<br/>Gemma 4 E4B deep branch"]
D --> E["Exp5C<br/>Gemma 4 E2B fast branch"]
E --> F["Two-tier local runtime<br/>fast + deep routing"]
| Version | Core challenge | What we built | Best result / current status | What remains |
|---|---|---|---|---|
Exp1-3 scratch AVA systems |
Prove useful AI behavior can emerge on a tiny local model | 11M-99M scratch configs, byte/tokenizer experiments, retrieval ensembles, memory-transfer flows, tool/compliance patches | scratch baseline hit 24% ARC; support-bank retrieval reached 91/299 ARC; memory-transfer stress suites reached 87/87 |
scale the strongest ideas into better tokenizers, recurrent-depth variants, stronger post-training, and distillation targets |
Exp4 / AVA-v1 |
Prove QLoRA works on 4 GB VRAM with a real open base model |
Qwen3.5-2B NF4 loading, LoRA pipeline, evaluation harness, agentic harness | 66% ARC, 40% GSM8K on the fast v1 run |
stronger corpus, cleaner alignment, better reasoning coverage |
Exp4 / AVA-v2 |
Turn AVA into a released local assistant | curated 20K corpus, Triton/SDPA training speedups, full eval pipeline, HF model card, GGUF export path | 79% ARC, 48% GSM8K, 42 MB adapter |
longer context, stronger math, RL/post-training, better tool specialization, student distillation |
Exp5 / 26B feasibility |
Make Gemma 4 26B-A4B run on 32 GB RAM / 4 GB VRAM |
streamed int4 loader, TurboQuant bit-packing, MoE offload, cached reload, YaRN experiments | cached reload about 8.6s; warm exact decode improved to about 0.45-0.50 tok/s |
still too slow for default use; keep as systems research branch |
Exp5 / E4B deep branch |
Keep Gemma 4 intelligence practical on the same laptop | bf16 manual split, TurboQuant-enabled stack, YaRN at practical 512K, deep routing path |
about 0.79-0.85 tok/s decode and 2.1-2.2 tok/s total on best runs |
reduce deep-branch latency and switching cost |
Exp5 / E2B fast branch + two-tier runtime |
Make local chat actually feel fast without giving up a deeper path | E2B fast path, llama.cpp backend, explicit quick: / deep: / reason: controls, lazy deep escalation to E4B |
best fast-path result: about 6.39 tok/s decode and 17.74 tok/s total |
unify fast/deep UX further and reduce branch escalation overhead |
Experiment logs:
- scratch and fine-tune history:
experiments/exp4_finetune/EXPERIMENT_LOG.md - fine-tune report:
experiments/exp4_finetune/RESULTS_REPORT.md - Gemma 4 checkpoint log:
experiments/exp5_gemma4/PROGRESS_LOG.md - Gemma 4 results write-up:
experiments/exp5_gemma4/RESULTS.md
---
config:
xyChart:
width: 700
height: 350
---
xychart-beta
title "Best Measured Local Decode Speed by Branch"
x-axis ["26B exact", "E4B deep", "E2B Transformers", "E2B llama.cpp"]
y-axis "tok/s" 0 --> 7
bar [0.50, 0.85, 2.13, 6.39]
| Branch | Role | Practical context target | Best measured decode speed | Notes |
|---|---|---|---|---|
AVA v2 / Qwen3.5-2B |
released fine-tuned assistant | base-model limits | benchmarked via adapter / GGUF | strongest released training result so far |
Gemma 4 26B-A4B |
feasibility and systems research | 256K -> 1M experimental |
about 0.50 tok/s warm |
proves fit, not speed |
Gemma 4 E4B |
deep branch | 512K practical, 1M experimental |
about 0.85 tok/s |
dense deep-path model |
Gemma 4 E2B |
fast branch | 512K practical |
about 6.39 tok/s via llama.cpp |
current recommended fast local path |
Download a GGUF file from Releases or HuggingFace, then:
ollama create ava-v2 -f Modelfile
ollama run ava-v2Works on CPU, Apple Silicon, AMD GPUs, and NVIDIA GPUs. No Python environment required.
# Install
pip install -e .[bench]
pip install peft
# Chat (downloads from HuggingFace automatically)
python scripts/chat.py
# Single question
python scripts/chat.py --prompt "Explain why ice floats on water."
# Use a local adapter
python scripts/chat.py --adapter ./experiments/exp4_finetune/models/AVA-v2from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import torch
bnb_config = BitsAndBytesConfig(
load_in_4bit=True, bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen3.5-2B", quantization_config=bnb_config,
device_map="auto", dtype=torch.bfloat16, attn_implementation="sdpa",
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-2B")
model = PeftModel.from_pretrained(model, "NAME0x0/AVA-v2")
model = model.merge_and_unload()
messages = [{"role": "user", "content": "Explain why ice floats on water."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))Requires: Python 3.10+, NVIDIA GPU with 4+ GB VRAM, CUDA support.
| Benchmark | Qwen3.5-2B Base | AVA v1 (5K SFT) | AVA v2 (20K SFT) | Improvement vs Base |
|---|---|---|---|---|
| ARC-Challenge | 66.0% | 66.0% | 79.0% | +13.0pp |
| GSM8K | 28.0% | 40.0% | 48.0% | +20.0pp |
| Metric | AVA v1 | AVA v2 |
|---|---|---|
| Training corpus | 5,237 examples | 20,741 examples |
| Final train loss | 1.0185 | 0.4145 |
| Training time | 251 min | 100.5 min |
| Trainable parameters | 10,911,744 (0.58% of 1.89B) | 10,911,744 (0.58% of 1.89B) |
| Peak VRAM usage | 1.81 GB | 1.81 GB |
| Steps/second | 0.04 | 0.43 |
| Effective batch size | 8 | 8 |
| Learning rate | 2e-4 (cosine) | 1.5e-4 (cosine) |
| LoRA rank | 16 | 16 |
| LoRA alpha | 32 | 32 |
| Max sequence length | 384 tokens | 384 tokens |
| Epochs | 1 | 1 |
AVA v2 trained 10.7x faster than v1 per step thanks to Triton kernel compilation for SDPA attention. The 4x larger corpus with augmented science and reasoning data was the key driver behind the ARC breakthrough (v1 showed zero ARC improvement over base).
All scores from official model cards and technical reports. Evaluation protocols vary by source (shot count, prompting). AVA v2 scores are 0-shot.
---
config:
xyChart:
width: 800
height: 400
---
xychart-beta
title "ARC-Challenge Accuracy by Model"
x-axis ["TinyLlama 1.1B", "SmolLM2 1.7B", "Gemma2 2B", "Qwen2.5 1.5B", "Mistral 7B", "Llama3.2 1B-IT", "Llama3.2 3B", "Llama3.2 3B-IT", "AVA v2 2B", "Phi-4-mini 3.8B", "Phi-3.5-mini 3.8B"]
y-axis "Accuracy (%)" 0 --> 100
bar [30.1, 52, 55.7, 54.7, 55.5, 59.4, 69.1, 78.6, 79, 83.7, 84.6]
---
config:
xyChart:
width: 800
height: 400
---
xychart-beta
title "GSM8K Accuracy by Model"
x-axis ["TinyLlama 1.1B", "Gemma2 2B", "Qwen3.5 2B Base", "Llama3.2 1B-IT", "AVA v2 2B", "SmolLM2 1.7B", "Mistral 7B", "Qwen2.5 1.5B", "Llama3.2 3B-IT", "Qwen2.5 3B", "Phi-3.5 3.8B", "Phi-4 3.8B"]
y-axis "Accuracy (%)" 0 --> 100
bar [2, 24.3, 28, 44.4, 48, 48.2, 52.2, 68.5, 77.7, 79.1, 86.2, 88.6]
Key takeaways:
- AVA v2's 79% ARC-Challenge at 2B parameters exceeds Llama 3.2 3B-Instruct (78.6% at 3B) and beats Mistral-7B (55.5% at 7B) by 23.5 percentage points
- On GSM8K, AVA v2 reaches 48% -- competitive with SmolLM2-1.7B-Instruct (48.2%) and ahead of Llama 3.2 1B-Instruct (44.4%)
- The ARC result is particularly notable because it was achieved with a 42 MB LoRA adapter, not a full model retrain
AVA v2 loss trajectory over 2,593 training steps:
Step Loss LR
20 1.118 1.47e-5
100 1.072 5.85e-5
300 1.046 1.09e-4
500 1.030 1.39e-4
700 1.057 1.49e-4
1000 1.002 1.43e-4
1500 0.954 1.12e-4
2000 0.942 6.50e-5
2260 0.937 3.68e-5 <- all-time low
2500 0.971 5.17e-7
2593 0.414 0.00e+0 <- final (epoch average)
Most AI progress in 2025-2026 happens on clusters with hundreds or thousands of GPUs. AVA asks a different question: what can you build with a single laptop GPU and no budget?
The answer is more than most people expect:
- 79% ARC-Challenge at 2B params beats Llama 3.2 3B-Instruct (78.6% at 3B) and Mistral 7B (55.5%) -- models that required orders of magnitude more compute to train
- The entire adapter is 42 MB. The full training run uses 1.81 GB VRAM and finishes in 100 minutes
- Nothing here requires special hardware, cloud access, or corporate resources. Anyone with a modern laptop can reproduce these results
This matters because:
-
Democratization is real, not theoretical. Most "democratize AI" projects still require cloud GPUs. AVA trains and runs on hardware that students, researchers in developing countries, and hobbyists already own.
-
Data quality dominates compute. AVA v1 (5K examples) showed zero ARC improvement. AVA v2 (20K curated examples) jumped +13pp. The difference was not more compute -- it was better data. This validates the emerging consensus from Phi-4, LIMO, and DeepSeek that careful data curation can substitute for scale.
-
QLoRA makes fine-tuning accessible. Training 0.58% of parameters in 4-bit precision means a 2B model fits in under 2 GB. This opens a path where anyone can specialize a frontier-class base model for their domain without touching a cloud console.
-
The research is reproducible. Every script, corpus recipe, config file, and evaluation harness is in this repo. The model card has exact versions of every dependency. Run the scripts, get the numbers.
AVA is not trying to compete with GPT-4 or Claude. It is proving that meaningful AI capability -- strong science reasoning, solid math, reliable tool use -- can emerge from constraints that would have been considered impossible two years ago.
AVA v2 is a QLoRA (Quantized Low-Rank Adaptation) fine-tune of Qwen/Qwen3.5-2B:
- Base model: Qwen3.5-2B (1.89B parameters), loaded in 4-bit NF4 quantization via BitsAndBytes
- Adapter: LoRA rank 16, alpha 32, applied to all attention and MLP projections (
q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj) - Trainable parameters: 10.9M out of 1.89B total (0.58%)
- Adapter size: 42 MB (safetensors format)
The v2 corpus contains 20,741 prompt-response pairs across:
- Math reasoning (GSM8K-style step-by-step solutions)
- Science comprehension (ARC, SciQ, OpenBookQA-style)
- General instruction following
- Tool use and code generation
- Augmented with teacher-distilled examples for harder reasoning chains
- GPU: NVIDIA RTX A2000 Laptop (4 GB VRAM, Ampere GA107, compute capability 8.6)
- Training VRAM: 1.81 GB peak
- Inference VRAM: 1.74 GB
- All training, evaluation, and inference run on a single consumer laptop
- Python 3.10+
- NVIDIA GPU with CUDA support (4 GB+ VRAM)
- Visual Studio with C++ Build Tools (for Triton kernel compilation on Windows)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130
pip install transformers==5.3.0 peft==0.18.1 bitsandbytes==0.49.2 datasets accelerateOn Linux, Triton installs normally. On Windows:
pip install triton-windows==3.6.0.post26Triton requires a C compiler. Set the CC environment variable to your MSVC cl.exe path:
# Find your cl.exe path (example for VS 2022/2026):
# C:\Program Files\Microsoft Visual Studio\18\Community\VC\Tools\MSVC\14.51.36014\bin\Hostx64\x64\cl.exe
set CC="C:\Program Files\Microsoft Visual Studio\18\Community\VC\Tools\MSVC\<version>\bin\Hostx64\x64\cl.exe"pip install flash-linear-attention==0.4.2FLA requires causal-conv1d. On Windows, use the patched fork:
git clone https://github.com/sdbds/causal-conv1d-for-windows
cd causal-conv1d-for-windows
# Build with MSVC preprocessor fix and target your GPU arch:
pip install . --no-build-isolationImportant: When FLA is installed, you must set attn_implementation="sdpa" in AutoModelForCausalLM.from_pretrained() to avoid FLA's weight restructuring which is incompatible with BitsAndBytes 4-bit quantized weights.
# Download Qwen3.5-2B (or use huggingface_hub)
python -c "from transformers import AutoModelForCausalLM, AutoTokenizer; AutoTokenizer.from_pretrained('Qwen/Qwen3.5-2B'); AutoModelForCausalLM.from_pretrained('Qwen/Qwen3.5-2B')"The training corpus is a JSONL file with prompt and response fields:
{"prompt": "What causes tides on Earth?", "response": "Tides are primarily caused by the gravitational pull of the Moon..."}python -u experiments/exp4_finetune/scripts/finetune_v2_full.py > training.log 2>&1Key training configuration:
# BitsAndBytes 4-bit quantization
BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
# Model loading (SDPA required when FLA is installed)
AutoModelForCausalLM.from_pretrained(
model_path,
quantization_config=bnb_config,
device_map="auto",
dtype=torch.bfloat16,
attn_implementation="sdpa",
)
# Training arguments
TrainingArguments(
per_device_train_batch_size=1,
gradient_accumulation_steps=8,
learning_rate=1.5e-4,
lr_scheduler_type="cosine",
bf16=True,
gradient_checkpointing=True,
optim="paged_adamw_8bit",
eval_strategy="no", # Eval OOMs on 4GB VRAM (248K vocab)
)python -u experiments/exp4_finetune/scripts/benchmark_full.py \
--adapter experiments/exp4_finetune/models/Qwen3.5-2B-AVA-v2 \
--arc-limit 100 --gsm8k-limit 50- QLoRA over scratch training: Our previous 14M scratch model hit 24% ARC and 0% GSM8K. Fine-tuning a 2B model immediately reached 66%/28% baseline, then 79%/48% after SFT.
- Corpus scale matters more than epochs: v1 (5K examples, 1 epoch) showed no ARC improvement. v2 (20K examples, 1 epoch) jumped +13pp. More diverse data beat repeated passes.
- Triton kernel compilation: Installing Triton on Windows for SDPA attention kernels gave a 10.7x speedup (25s/step to 5.8s/step), making the full 20K corpus trainable in 100 minutes.
- Checkpoint resume: HuggingFace Trainer's checkpoint resume saved hours of work across laptop cooldown breaks and crashes.
- FLA with BitsAndBytes: Flash-Linear-Attention tries to restructure attention weights (merging q/k/v into combined projections) which crashes on BnB 4-bit quantized tensors. Workaround: force SDPA mode.
- Inline evaluation: The 248K vocabulary of Qwen3.5 means
logits.float()during eval OOMs on 4 GB. Evaluation must run as a separate step after training. - Unsloth on Windows: Unsloth's fast kernels require Linux. We used vanilla HuggingFace Trainer with manual freeze and gradient checkpointing instead.
- Triton C compiler: Triton needs
CCenv var pointing to MSVCcl.exe. The bundled TinyCC fallback doesn't work reliably. - causal-conv1d: Requires a patched Windows fork with
/Zc:preprocessorMSVC flag. - Output buffering: Python on Windows buffers stdout when redirecting to files. Use
python -ufor real-time training logs. - expandable_segments:
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:Trueis not supported on Windows but doesn't cause errors (just a warning).
Pre-built GGUF files are available from GitHub Releases and HuggingFace.
| File | Quantization | Size | Quality |
|---|---|---|---|
| AVA-v2-Q4_K_M.gguf | Q4_K_M | ~1.5 GB | Recommended — best size/quality balance |
| AVA-v2-Q8_0.gguf | Q8_0 | ~2.0 GB | Near-lossless |
To build GGUF locally:
# Merge adapter and save merged model
python scripts/convert_to_gguf.py --merge-only
# Full pipeline (requires llama.cpp):
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && cmake -B build && cmake --build build --config Release -j$(nproc) && cd ..
python scripts/convert_to_gguf.py --llama-cpp ./llama.cpp --quants Q4_K_M Q8_0
# Use with Ollama:
ollama create ava-v2 -f Modelfile
ollama run ava-v2The GGUF build also runs automatically in CI — trigger it from Actions or publish a GitHub Release.
The current exp5 local-serving work lives in experiments/exp5_gemma4 and is the serving/systems branch of AVA rather than a replacement for AVA v2.
It has split into three tracks:
26B feasibility: streamed26B-A4Bwith TurboQuant/YaRN research and cached reloaddeep dense branch:E4Bon the existing Transformers runtime at a practical512Ktargetfast branch:E2Bthroughllama.cpp, currently the best measured local speed path
Current recommendation on the 4 GB VRAM / 32 GB RAM laptop:
- default fast responder:
E2B - explicit deep / reasoning escalation:
E4B - keep
26Bfor feasibility and systems research, not as the default interactive branch
What remains inside exp5:
- make deep-branch escalation cheaper than today's reload-heavy path
- preserve
512Kas the practical target while keeping1Mas an experimental branch - keep TurboQuant and YaRN as research assets without slowing the default fast path
- decide how much of the custom runtime should stay in Python versus move behind
llama.cpp-style serving
The detailed checkpoint log is in experiments/exp5_gemma4/PROGRESS_LOG.md, and the full experiment write-up is in experiments/exp5_gemma4/RESULTS.md.
Available on HuggingFace: NAME0x0/AVA-v2
The adapter is also stored in this repo in standard PEFT format:
experiments/exp4_finetune/models/AVA-v2/
adapter_config.json # LoRA configuration
adapter_model.safetensors # 42 MB adapter weights (Git LFS)
tokenizer.json # Qwen3.5 tokenizer (Git LFS)
tokenizer_config.json
training_report.json # Training metrics
README.md # HuggingFace model card
| Component | Version | Purpose |
|---|---|---|
| Python | 3.13 | Runtime |
| PyTorch | 2.10.0+cu130 | Tensor computation |
| Transformers | 5.3.0 | Model loading, Trainer |
| PEFT | 0.18.1 | LoRA adapter management |
| BitsAndBytes | 0.49.2 | 4-bit NF4 quantization |
| Triton (Windows) | 3.6.0.post26 | GPU kernel compilation |
| Flash-Linear-Attention | 0.4.2 | Qwen3.5 attention backend |
| causal-conv1d | 1.5.0.post8 | FLA dependency (Windows fork) |
| Datasets | latest | HuggingFace dataset handling |
| Accelerate | latest | Device placement |
AVA/
├── src/ava/ # Core research library (installed as `ava` package)
│ ├── model.py # AVA v3 scratch model (14M param GPT-2 variant)
│ ├── train.py # Training loop for scratch models
│ ├── rl.py # Verifiable reinforcement learning (REINFORCE)
│ ├── config.py # Experiment config loader (YAML-based)
│ ├── cli.py # CLI entry point (`python -m ava.cli`)
│ ├── external_benchmarks.py # ARC-Challenge, GSM8K, PIQA evaluation harness
│ ├── corpus_recipes.py # Corpus materialization from HuggingFace datasets
│ ├── retrieval.py # Sparse retrieval for support-bank ensembles
│ ├── dense_retrieval.py # Dense (embedding-based) retrieval
│ ├── memory.py # External memory with surprise-gated writes
│ ├── tokenizer.py # Byte-level tokenizer + HF tokenizer imports
│ └── ... # Sessions, activity logging, tools, inspection
│
├── experiments/
│ ├── exp4_finetune/ # Experiment 4: released QLoRA / AVA v2 branch
│ │ ├── scripts/ # Training, benchmarking, corpus building scripts
│ │ │ ├── finetune_v2_full.py # AVA v2 training script
│ │ │ ├── benchmark_full.py # ARC + GSM8K evaluation
│ │ │ ├── upload_to_hf.py # Push adapter to HuggingFace
│ │ │ └── ... # Corpus builders, older experiment scripts
│ │ ├── models/AVA-v2/ # Released adapter weights (42 MB)
│ │ ├── EXPERIMENT_LOG.md # Per-run history for v1/v2 fine-tuning
│ │ └── results/ # Pipeline state and evaluation outputs
│ └── exp5_gemma4/ # Experiment 5: local Gemma 4 serving branch
│ ├── engine/ # Streaming/offload, routing, and serving runtime
│ ├── scripts/ # 26B, E4B, E2B, and two-tier runners
│ ├── configs/ # Practical presets for each local branch
│ ├── PROGRESS_LOG.md # Checkpoint log for local serving progress
│ └── RESULTS.md # Detailed write-up of measured outcomes
│
├── scripts/
│ ├── chat.py # Interactive chat with AVA v2
│ ├── convert_to_gguf.py # Merge adapter + convert to GGUF for Ollama
│ └── generate_*.py # Corpus generation utilities
│
├── Modelfile # Ollama model definition (use with GGUF)
│
├── configs/
│ ├── base.yaml # Base model configuration
│ └── experiments/ # YAML configs for each experiment variant
│
├── corpora/ # Training corpora (JSONL format)
│ ├── ava_v2_*/ # v2 fine-tuning data (open mix, post-train, repair)
│ └── ava_v3_*/ # v3 scratch training data
│
├── sessions/ # Experiment tracking (timestamped packets)
│ ├── YYYY-MM-DD-*/ # Session directories with notes, metrics, configs
│ └── activity/ # CLI command logs and benchmark result JSONs
│
├── docs/ # Research documentation
│ ├── ARCHITECTURE.md # System architecture and module design
│ └── RESEARCH_ROADMAP.md # arXiv paper survey mapped to AVA experiments
│
├── tests/ # Test suite (pytest)
│ ├── test_model.py # Model architecture tests
│ ├── test_train.py # Training pipeline tests
│ ├── test_experiments.py # Experiment config validation
│ ├── test_external_benchmarks.py # Benchmark harness tests
│ └── ... # Retrieval, memory, tokenizer, corpus tests
│
├── .github/workflows/ci.yml # CI: lint, test, adapter integrity checks
└── .github/workflows/gguf.yml # CI: build GGUF, quantize, upload to HF + releases
AVA is a research project exploring how far a single-GPU setup can push AI capability. The roadmap reflects real constraints: everything must train and run on 4 GB VRAM or at least degrade gracefully onto the same local machine.
- Scratch-trained 14M model (experiments 1-3): established baselines, proved architectural ideas
- QLoRA fine-tuning pipeline on Qwen3.5-2B (experiment 4)
- AVA v2: 79% ARC-Challenge, 48% GSM8K with 20K curated examples
- Sparse retrieval ensemble for science reasoning (91/299 ARC with support banks)
- Gemma 4 26B local feasibility branch with streamed load, TurboQuant, and cached reload
- Gemma 4 two-tier local runtime with a fast E2B branch and a deeper E4B branch
- Full reproducibility: open weights, open data, open code
- Extended sequence length training (384 -> 1024+ tokens) for longer reasoning chains
- Make the deep local branch cheaper to enter than the current reload-heavy E2B -> E4B switch
- Stabilize
512Kas the practical dense serving target while keeping1Mexperimental - Post-training with verifiable RL (math and science verification rewards)
- Improved GSM8K through chain-of-thought curriculum
- DPO/RLHF alignment for instruction following quality
- Tool use specialization (calculator, code interpreter)
- Multimodal extension via compact vision encoder (following Penguin-VL approach)
- Structured external memory for continual learning
- Multi-benchmark evaluation: MMLU, HumanEval, TruthfulQA
- Model distillation: compress AVA gains into a smaller student
- A general-purpose assistant that runs entirely on consumer hardware
- Multilingual support starting with Urdu and Arabic
- On-device deployment (mobile/edge) through further quantization
- Community-driven corpus contributions and benchmark extensions
AVA's earlier experiments are preserved in the repo and still matter to the current direction:
- A compact 11M checkpoint with strong internal tool/compliance behavior
- A memory-transfer system achieving 87/87 on stress suites
- A science-first sparse retrieval ensemble reaching 91/299 on ARC-Challenge
- Tokenizer research showing Qwen's tokenizer compresses AVA data at 0.24x byte ratio
These experiments informed the pivot to fine-tuning and later to Gemma 4 serving: the scratch model's 24% ARC ceiling made it clear that parameter count and pre-trained knowledge matter, but the retrieval/memory/tooling work remains part of AVA's long-term architecture rather than discarded prior work.
# Install
python -m pip install -e .[dev,bench]
# Run tests
python -m pytest tests/ -q
# Chat with AVA v2
python scripts/chat.py
# Run benchmarks
python -u experiments/exp4_finetune/scripts/benchmark_full.py \
--adapter experiments/exp4_finetune/models/Qwen3.5-2B-AVA-v2This project is open source. The AVA v2 adapter weights are released under the same license as the base model (Qwen License).
@misc{ava-v2-2026,
title={AVA v2: QLoRA Fine-tuning Under Extreme VRAM Constraints},
author={Muhammad Afsah Mumtaz},
year={2026},
url={https://github.com/NAME0x0/AVA}
}