AVA

AVA is a local-first AI lab built under extreme hardware constraints: a single 4 GB VRAM laptop GPU (NVIDIA RTX A2000), no cloud budget, no large-cluster training. AVA v2 is the first released model from that work: a QLoRA fine-tune of Qwen3.5-2B that achieves 79% on ARC-Challenge and 48% on GSM8K while training and running inference in under 2 GB of VRAM.

But the repo is broader than one released adapter. AVA is the full stack for building local AI from the ground up: scratch architectures, tokenizers, retrieval and memory systems, system-prompt and routing controls, distillation, post-training, evaluation, and fast local serving.

The entire training pipeline, evaluation harness, experiment history, and model weights are open. This README now documents both the released model and the larger AVA research track.

What AVA Is

AVA currently spans four layers of work:

ground-up model building: compact scratch models, recurrent-depth variants, tokenizer work, and training infrastructure in src/ava
behavior shaping: tool use, compliance, memory transfer, retrieval, and prompt/routing control
post-training: QLoRA fine-tuning, teacher distillation, corpus curation, and evaluation on stronger base models
local serving systems: quantization, long-context engineering, hybrid CPU/GPU runtimes, and fast/deep model routing

If you want the shortest summary: AVA is a personal local AI repository for building, training, distilling, steering, and serving useful AI systems on consumer hardware.

Experiment Map

flowchart LR
    A["Exp1-3<br/>Scratch AVA systems"] --> B["Exp4<br/>Qwen3.5-2B QLoRA"]
    B --> C["Exp5A<br/>Gemma 4 26B feasibility"]
    C --> D["Exp5B<br/>Gemma 4 E4B deep branch"]
    D --> E["Exp5C<br/>Gemma 4 E2B fast branch"]
    E --> F["Two-tier local runtime<br/>fast + deep routing"]

Version	Core challenge	What we built	Best result / current status	What remains
`Exp1-3` scratch AVA systems	Prove useful AI behavior can emerge on a tiny local model	11M-99M scratch configs, byte/tokenizer experiments, retrieval ensembles, memory-transfer flows, tool/compliance patches	scratch baseline hit `24% ARC`; support-bank retrieval reached `91/299 ARC`; memory-transfer stress suites reached `87/87`	scale the strongest ideas into better tokenizers, recurrent-depth variants, stronger post-training, and distillation targets
`Exp4 / AVA-v1`	Prove QLoRA works on `4 GB` VRAM with a real open base model	Qwen3.5-2B NF4 loading, LoRA pipeline, evaluation harness, agentic harness	`66% ARC`, `40% GSM8K` on the fast v1 run	stronger corpus, cleaner alignment, better reasoning coverage
`Exp4 / AVA-v2`	Turn AVA into a released local assistant	curated 20K corpus, Triton/SDPA training speedups, full eval pipeline, HF model card, GGUF export path	`79% ARC`, `48% GSM8K`, `42 MB` adapter	longer context, stronger math, RL/post-training, better tool specialization, student distillation
`Exp5 / 26B feasibility`	Make Gemma 4 `26B-A4B` run on `32 GB RAM / 4 GB VRAM`	streamed int4 loader, TurboQuant bit-packing, MoE offload, cached reload, YaRN experiments	cached reload about `8.6s`; warm exact decode improved to about `0.45-0.50 tok/s`	still too slow for default use; keep as systems research branch
`Exp5 / E4B deep branch`	Keep Gemma 4 intelligence practical on the same laptop	bf16 manual split, TurboQuant-enabled stack, YaRN at practical `512K`, deep routing path	about `0.79-0.85 tok/s` decode and `2.1-2.2 tok/s` total on best runs	reduce deep-branch latency and switching cost
`Exp5 / E2B fast branch + two-tier runtime`	Make local chat actually feel fast without giving up a deeper path	E2B fast path, `llama.cpp` backend, explicit `quick:` / `deep:` / `reason:` controls, lazy deep escalation to E4B	best fast-path result: about `6.39 tok/s` decode and `17.74 tok/s` total	unify fast/deep UX further and reduce branch escalation overhead

Experiment logs:

scratch and fine-tune history: experiments/exp4_finetune/EXPERIMENT_LOG.md
fine-tune report: experiments/exp4_finetune/RESULTS_REPORT.md
Gemma 4 checkpoint log: experiments/exp5_gemma4/PROGRESS_LOG.md
Gemma 4 results write-up: experiments/exp5_gemma4/RESULTS.md

Progress Snapshot

---
config:
    xyChart:
        width: 700
        height: 350
---
xychart-beta
    title "Best Measured Local Decode Speed by Branch"
    x-axis ["26B exact", "E4B deep", "E2B Transformers", "E2B llama.cpp"]
    y-axis "tok/s" 0 --> 7
    bar [0.50, 0.85, 2.13, 6.39]

Branch	Role	Practical context target	Best measured decode speed	Notes
`AVA v2 / Qwen3.5-2B`	released fine-tuned assistant	base-model limits	benchmarked via adapter / GGUF	strongest released training result so far
`Gemma 4 26B-A4B`	feasibility and systems research	`256K -> 1M` experimental	about `0.50 tok/s` warm	proves fit, not speed
`Gemma 4 E4B`	deep branch	`512K` practical, `1M` experimental	about `0.85 tok/s`	dense deep-path model
`Gemma 4 E2B`	fast branch	`512K` practical	about `6.39 tok/s` via `llama.cpp`	current recommended fast local path

Try AVA v2

Option 1: Ollama (easiest — no Python needed)

Download a GGUF file from Releases or HuggingFace, then:

ollama create ava-v2 -f Modelfile
ollama run ava-v2

Works on CPU, Apple Silicon, AMD GPUs, and NVIDIA GPUs. No Python environment required.

Option 2: Python chat script

# Install
pip install -e .[bench]
pip install peft

# Chat (downloads from HuggingFace automatically)
python scripts/chat.py

# Single question
python scripts/chat.py --prompt "Explain why ice floats on water."

# Use a local adapter
python scripts/chat.py --adapter ./experiments/exp4_finetune/models/AVA-v2

Option 3: Python API

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3.5-2B", quantization_config=bnb_config,
    device_map="auto", dtype=torch.bfloat16, attn_implementation="sdpa",
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-2B")
model = PeftModel.from_pretrained(model, "NAME0x0/AVA-v2")
model = model.merge_and_unload()

messages = [{"role": "user", "content": "Explain why ice floats on water."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Requires: Python 3.10+, NVIDIA GPU with 4+ GB VRAM, CUDA support.

Results

AVA v2 Benchmark Scores

Benchmark	Qwen3.5-2B Base	AVA v1 (5K SFT)	AVA v2 (20K SFT)	Improvement vs Base
ARC-Challenge	66.0%	66.0%	79.0%	+13.0pp
GSM8K	28.0%	40.0%	48.0%	+20.0pp

Training Statistics

Metric	AVA v1	AVA v2
Training corpus	5,237 examples	20,741 examples
Final train loss	1.0185	0.4145
Training time	251 min	100.5 min
Trainable parameters	10,911,744 (0.58% of 1.89B)	10,911,744 (0.58% of 1.89B)
Peak VRAM usage	1.81 GB	1.81 GB
Steps/second	0.04	0.43
Effective batch size	8	8
Learning rate	2e-4 (cosine)	1.5e-4 (cosine)
LoRA rank	16	16
LoRA alpha	32	32
Max sequence length	384 tokens	384 tokens
Epochs	1	1

AVA v2 trained 10.7x faster than v1 per step thanks to Triton kernel compilation for SDPA attention. The 4x larger corpus with augmented science and reasoning data was the key driver behind the ARC breakthrough (v1 showed zero ARC improvement over base).

Comparison to Other Models

All scores from official model cards and technical reports. Evaluation protocols vary by source (shot count, prompting). AVA v2 scores are 0-shot.

ARC-Challenge (Science Reasoning)

---
config:
    xyChart:
        width: 800
        height: 400
---
xychart-beta
    title "ARC-Challenge Accuracy by Model"
    x-axis ["TinyLlama 1.1B", "SmolLM2 1.7B", "Gemma2 2B", "Qwen2.5 1.5B", "Mistral 7B", "Llama3.2 1B-IT", "Llama3.2 3B", "Llama3.2 3B-IT", "AVA v2 2B", "Phi-4-mini 3.8B", "Phi-3.5-mini 3.8B"]
    y-axis "Accuracy (%)" 0 --> 100
    bar [30.1, 52, 55.7, 54.7, 55.5, 59.4, 69.1, 78.6, 79, 83.7, 84.6]

GSM8K (Math Reasoning)

---
config:
    xyChart:
        width: 800
        height: 400
---
xychart-beta
    title "GSM8K Accuracy by Model"
    x-axis ["TinyLlama 1.1B", "Gemma2 2B", "Qwen3.5 2B Base", "Llama3.2 1B-IT", "AVA v2 2B", "SmolLM2 1.7B", "Mistral 7B", "Qwen2.5 1.5B", "Llama3.2 3B-IT", "Qwen2.5 3B", "Phi-3.5 3.8B", "Phi-4 3.8B"]
    y-axis "Accuracy (%)" 0 --> 100
    bar [2, 24.3, 28, 44.4, 48, 48.2, 52.2, 68.5, 77.7, 79.1, 86.2, 88.6]

Key takeaways:

AVA v2's 79% ARC-Challenge at 2B parameters exceeds Llama 3.2 3B-Instruct (78.6% at 3B) and beats Mistral-7B (55.5% at 7B) by 23.5 percentage points
On GSM8K, AVA v2 reaches 48% -- competitive with SmolLM2-1.7B-Instruct (48.2%) and ahead of Llama 3.2 1B-Instruct (44.4%)
The ARC result is particularly notable because it was achieved with a 42 MB LoRA adapter, not a full model retrain

Loss Curve

AVA v2 loss trajectory over 2,593 training steps:

Step     Loss     LR
  20     1.118    1.47e-5
 100     1.072    5.85e-5
 300     1.046    1.09e-4
 500     1.030    1.39e-4
 700     1.057    1.49e-4
1000     1.002    1.43e-4
1500     0.954    1.12e-4
2000     0.942    6.50e-5
2260     0.937    3.68e-5  <- all-time low
2500     0.971    5.17e-7
2593     0.414    0.00e+0  <- final (epoch average)

Why This Matters

Most AI progress in 2025-2026 happens on clusters with hundreds or thousands of GPUs. AVA asks a different question: what can you build with a single laptop GPU and no budget?

The answer is more than most people expect:

79% ARC-Challenge at 2B params beats Llama 3.2 3B-Instruct (78.6% at 3B) and Mistral 7B (55.5%) -- models that required orders of magnitude more compute to train
The entire adapter is 42 MB. The full training run uses 1.81 GB VRAM and finishes in 100 minutes
Nothing here requires special hardware, cloud access, or corporate resources. Anyone with a modern laptop can reproduce these results

This matters because:

Democratization is real, not theoretical. Most "democratize AI" projects still require cloud GPUs. AVA trains and runs on hardware that students, researchers in developing countries, and hobbyists already own.
Data quality dominates compute. AVA v1 (5K examples) showed zero ARC improvement. AVA v2 (20K curated examples) jumped +13pp. The difference was not more compute -- it was better data. This validates the emerging consensus from Phi-4, LIMO, and DeepSeek that careful data curation can substitute for scale.
QLoRA makes fine-tuning accessible. Training 0.58% of parameters in 4-bit precision means a 2B model fits in under 2 GB. This opens a path where anyone can specialize a frontier-class base model for their domain without touching a cloud console.
The research is reproducible. Every script, corpus recipe, config file, and evaluation harness is in this repo. The model card has exact versions of every dependency. Run the scripts, get the numbers.

AVA is not trying to compete with GPT-4 or Claude. It is proving that meaningful AI capability -- strong science reasoning, solid math, reliable tool use -- can emerge from constraints that would have been considered impossible two years ago.

How It Works

Architecture

AVA v2 is a QLoRA (Quantized Low-Rank Adaptation) fine-tune of Qwen/Qwen3.5-2B:

Base model: Qwen3.5-2B (1.89B parameters), loaded in 4-bit NF4 quantization via BitsAndBytes
Adapter: LoRA rank 16, alpha 32, applied to all attention and MLP projections (q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj)
Trainable parameters: 10.9M out of 1.89B total (0.58%)
Adapter size: 42 MB (safetensors format)

Training Data

The v2 corpus contains 20,741 prompt-response pairs across:

Math reasoning (GSM8K-style step-by-step solutions)
Science comprehension (ARC, SciQ, OpenBookQA-style)
General instruction following
Tool use and code generation
Augmented with teacher-distilled examples for harder reasoning chains

Hardware

GPU: NVIDIA RTX A2000 Laptop (4 GB VRAM, Ampere GA107, compute capability 8.6)
Training VRAM: 1.81 GB peak
Inference VRAM: 1.74 GB
All training, evaluation, and inference run on a single consumer laptop

Reproducing the Results

Prerequisites

Python 3.10+
NVIDIA GPU with CUDA support (4 GB+ VRAM)
Visual Studio with C++ Build Tools (for Triton kernel compilation on Windows)

Step 1: Install Dependencies

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130
pip install transformers==5.3.0 peft==0.18.1 bitsandbytes==0.49.2 datasets accelerate

Step 2: Install Triton (for SDPA kernel acceleration)

On Linux, Triton installs normally. On Windows:

pip install triton-windows==3.6.0.post26

Triton requires a C compiler. Set the CC environment variable to your MSVC cl.exe path:

# Find your cl.exe path (example for VS 2022/2026):
# C:\Program Files\Microsoft Visual Studio\18\Community\VC\Tools\MSVC\14.51.36014\bin\Hostx64\x64\cl.exe
set CC="C:\Program Files\Microsoft Visual Studio\18\Community\VC\Tools\MSVC\<version>\bin\Hostx64\x64\cl.exe"

Step 3: Install Flash-Linear-Attention (optional, for Qwen3.5 fast path)

pip install flash-linear-attention==0.4.2

FLA requires causal-conv1d. On Windows, use the patched fork:

git clone https://github.com/sdbds/causal-conv1d-for-windows
cd causal-conv1d-for-windows
# Build with MSVC preprocessor fix and target your GPU arch:
pip install . --no-build-isolation

Important: When FLA is installed, you must set attn_implementation="sdpa" in AutoModelForCausalLM.from_pretrained() to avoid FLA's weight restructuring which is incompatible with BitsAndBytes 4-bit quantized weights.

Step 4: Download the Base Model

# Download Qwen3.5-2B (or use huggingface_hub)
python -c "from transformers import AutoModelForCausalLM, AutoTokenizer; AutoTokenizer.from_pretrained('Qwen/Qwen3.5-2B'); AutoModelForCausalLM.from_pretrained('Qwen/Qwen3.5-2B')"

Step 5: Prepare Training Data

The training corpus is a JSONL file with prompt and response fields:

{"prompt": "What causes tides on Earth?", "response": "Tides are primarily caused by the gravitational pull of the Moon..."}

Step 6: Train

python -u experiments/exp4_finetune/scripts/finetune_v2_full.py > training.log 2>&1

Key training configuration:

# BitsAndBytes 4-bit quantization
BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

# Model loading (SDPA required when FLA is installed)
AutoModelForCausalLM.from_pretrained(
    model_path,
    quantization_config=bnb_config,
    device_map="auto",
    dtype=torch.bfloat16,
    attn_implementation="sdpa",
)

# Training arguments
TrainingArguments(
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    learning_rate=1.5e-4,
    lr_scheduler_type="cosine",
    bf16=True,
    gradient_checkpointing=True,
    optim="paged_adamw_8bit",
    eval_strategy="no",  # Eval OOMs on 4GB VRAM (248K vocab)
)

Step 7: Benchmark

python -u experiments/exp4_finetune/scripts/benchmark_full.py \
    --adapter experiments/exp4_finetune/models/Qwen3.5-2B-AVA-v2 \
    --arc-limit 100 --gsm8k-limit 50

Lessons Learned

What worked

QLoRA over scratch training: Our previous 14M scratch model hit 24% ARC and 0% GSM8K. Fine-tuning a 2B model immediately reached 66%/28% baseline, then 79%/48% after SFT.
Corpus scale matters more than epochs: v1 (5K examples, 1 epoch) showed no ARC improvement. v2 (20K examples, 1 epoch) jumped +13pp. More diverse data beat repeated passes.
Triton kernel compilation: Installing Triton on Windows for SDPA attention kernels gave a 10.7x speedup (25s/step to 5.8s/step), making the full 20K corpus trainable in 100 minutes.
Checkpoint resume: HuggingFace Trainer's checkpoint resume saved hours of work across laptop cooldown breaks and crashes.

What didn't work

FLA with BitsAndBytes: Flash-Linear-Attention tries to restructure attention weights (merging q/k/v into combined projections) which crashes on BnB 4-bit quantized tensors. Workaround: force SDPA mode.
Inline evaluation: The 248K vocabulary of Qwen3.5 means logits.float() during eval OOMs on 4 GB. Evaluation must run as a separate step after training.
Unsloth on Windows: Unsloth's fast kernels require Linux. We used vanilla HuggingFace Trainer with manual freeze and gradient checkpointing instead.

Windows-specific issues

Triton C compiler: Triton needs CC env var pointing to MSVC cl.exe. The bundled TinyCC fallback doesn't work reliably.
causal-conv1d: Requires a patched Windows fork with /Zc:preprocessor MSVC flag.
Output buffering: Python on Windows buffers stdout when redirecting to files. Use python -u for real-time training logs.
expandable_segments: PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True is not supported on Windows but doesn't cause errors (just a warning).

Model Weights

GGUF for Ollama / llama.cpp

Pre-built GGUF files are available from GitHub Releases and HuggingFace.

File	Quantization	Size	Quality
AVA-v2-Q4_K_M.gguf	Q4_K_M	~1.5 GB	Recommended — best size/quality balance
AVA-v2-Q8_0.gguf	Q8_0	~2.0 GB	Near-lossless

To build GGUF locally:

# Merge adapter and save merged model
python scripts/convert_to_gguf.py --merge-only

# Full pipeline (requires llama.cpp):
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && cmake -B build && cmake --build build --config Release -j$(nproc) && cd ..
python scripts/convert_to_gguf.py --llama-cpp ./llama.cpp --quants Q4_K_M Q8_0

# Use with Ollama:
ollama create ava-v2 -f Modelfile
ollama run ava-v2

The GGUF build also runs automatically in CI — trigger it from Actions or publish a GitHub Release.

Experiment 5: Practical Local Gemma 4 Serving

The current exp5 local-serving work lives in experiments/exp5_gemma4 and is the serving/systems branch of AVA rather than a replacement for AVA v2.

It has split into three tracks:

26B feasibility: streamed 26B-A4B with TurboQuant/YaRN research and cached reload
deep dense branch: E4B on the existing Transformers runtime at a practical 512K target
fast branch: E2B through llama.cpp, currently the best measured local speed path

Current recommendation on the 4 GB VRAM / 32 GB RAM laptop:

default fast responder: E2B
explicit deep / reasoning escalation: E4B
keep 26B for feasibility and systems research, not as the default interactive branch

What remains inside exp5:

make deep-branch escalation cheaper than today's reload-heavy path
preserve 512K as the practical target while keeping 1M as an experimental branch
keep TurboQuant and YaRN as research assets without slowing the default fast path
decide how much of the custom runtime should stay in Python versus move behind llama.cpp-style serving

The detailed checkpoint log is in experiments/exp5_gemma4/PROGRESS_LOG.md, and the full experiment write-up is in experiments/exp5_gemma4/RESULTS.md.

LoRA Adapter (42 MB)

Available on HuggingFace: NAME0x0/AVA-v2

The adapter is also stored in this repo in standard PEFT format:

experiments/exp4_finetune/models/AVA-v2/
  adapter_config.json       # LoRA configuration
  adapter_model.safetensors # 42 MB adapter weights (Git LFS)
  tokenizer.json            # Qwen3.5 tokenizer (Git LFS)
  tokenizer_config.json
  training_report.json      # Training metrics
  README.md                 # HuggingFace model card

Software Stack

Component	Version	Purpose
Python	3.13	Runtime
PyTorch	2.10.0+cu130	Tensor computation
Transformers	5.3.0	Model loading, Trainer
PEFT	0.18.1	LoRA adapter management
BitsAndBytes	0.49.2	4-bit NF4 quantization
Triton (Windows)	3.6.0.post26	GPU kernel compilation
Flash-Linear-Attention	0.4.2	Qwen3.5 attention backend
causal-conv1d	1.5.0.post8	FLA dependency (Windows fork)
Datasets	latest	HuggingFace dataset handling
Accelerate	latest	Device placement

Repository Layout

AVA/
├── src/ava/                    # Core research library (installed as `ava` package)
│   ├── model.py                #   AVA v3 scratch model (14M param GPT-2 variant)
│   ├── train.py                #   Training loop for scratch models
│   ├── rl.py                   #   Verifiable reinforcement learning (REINFORCE)
│   ├── config.py               #   Experiment config loader (YAML-based)
│   ├── cli.py                  #   CLI entry point (`python -m ava.cli`)
│   ├── external_benchmarks.py  #   ARC-Challenge, GSM8K, PIQA evaluation harness
│   ├── corpus_recipes.py       #   Corpus materialization from HuggingFace datasets
│   ├── retrieval.py            #   Sparse retrieval for support-bank ensembles
│   ├── dense_retrieval.py      #   Dense (embedding-based) retrieval
│   ├── memory.py               #   External memory with surprise-gated writes
│   ├── tokenizer.py            #   Byte-level tokenizer + HF tokenizer imports
│   └── ...                     #   Sessions, activity logging, tools, inspection
│
├── experiments/
│   ├── exp4_finetune/          # Experiment 4: released QLoRA / AVA v2 branch
│   │   ├── scripts/            #   Training, benchmarking, corpus building scripts
│   │   │   ├── finetune_v2_full.py     # AVA v2 training script
│   │   │   ├── benchmark_full.py       # ARC + GSM8K evaluation
│   │   │   ├── upload_to_hf.py         # Push adapter to HuggingFace
│   │   │   └── ...                     # Corpus builders, older experiment scripts
│   │   ├── models/AVA-v2/      #   Released adapter weights (42 MB)
│   │   ├── EXPERIMENT_LOG.md   #   Per-run history for v1/v2 fine-tuning
│   │   └── results/            #   Pipeline state and evaluation outputs
│   └── exp5_gemma4/            # Experiment 5: local Gemma 4 serving branch
│       ├── engine/             #   Streaming/offload, routing, and serving runtime
│       ├── scripts/            #   26B, E4B, E2B, and two-tier runners
│       ├── configs/            #   Practical presets for each local branch
│       ├── PROGRESS_LOG.md     #   Checkpoint log for local serving progress
│       └── RESULTS.md          #   Detailed write-up of measured outcomes
│
├── scripts/
│   ├── chat.py                 # Interactive chat with AVA v2
│   ├── convert_to_gguf.py      # Merge adapter + convert to GGUF for Ollama
│   └── generate_*.py           # Corpus generation utilities
│
├── Modelfile                   # Ollama model definition (use with GGUF)
│
├── configs/
│   ├── base.yaml               # Base model configuration
│   └── experiments/            # YAML configs for each experiment variant
│
├── corpora/                    # Training corpora (JSONL format)
│   ├── ava_v2_*/               #   v2 fine-tuning data (open mix, post-train, repair)
│   └── ava_v3_*/               #   v3 scratch training data
│
├── sessions/                   # Experiment tracking (timestamped packets)
│   ├── YYYY-MM-DD-*/           #   Session directories with notes, metrics, configs
│   └── activity/               #   CLI command logs and benchmark result JSONs
│
├── docs/                       # Research documentation
│   ├── ARCHITECTURE.md         #   System architecture and module design
│   └── RESEARCH_ROADMAP.md     #   arXiv paper survey mapped to AVA experiments
│
├── tests/                      # Test suite (pytest)
│   ├── test_model.py           #   Model architecture tests
│   ├── test_train.py           #   Training pipeline tests
│   ├── test_experiments.py     #   Experiment config validation
│   ├── test_external_benchmarks.py  # Benchmark harness tests
│   └── ...                     #   Retrieval, memory, tokenizer, corpus tests
│
├── .github/workflows/ci.yml   # CI: lint, test, adapter integrity checks
└── .github/workflows/gguf.yml # CI: build GGUF, quantize, upload to HF + releases

Roadmap

AVA is a research project exploring how far a single-GPU setup can push AI capability. The roadmap reflects real constraints: everything must train and run on 4 GB VRAM or at least degrade gracefully onto the same local machine.

Completed

Scratch-trained 14M model (experiments 1-3): established baselines, proved architectural ideas
QLoRA fine-tuning pipeline on Qwen3.5-2B (experiment 4)
AVA v2: 79% ARC-Challenge, 48% GSM8K with 20K curated examples
Sparse retrieval ensemble for science reasoning (91/299 ARC with support banks)
Gemma 4 26B local feasibility branch with streamed load, TurboQuant, and cached reload
Gemma 4 two-tier local runtime with a fast E2B branch and a deeper E4B branch
Full reproducibility: open weights, open data, open code

In Progress

Extended sequence length training (384 -> 1024+ tokens) for longer reasoning chains
Make the deep local branch cheaper to enter than the current reload-heavy E2B -> E4B switch
Stabilize 512K as the practical dense serving target while keeping 1M experimental
Post-training with verifiable RL (math and science verification rewards)
Improved GSM8K through chain-of-thought curriculum

Planned

DPO/RLHF alignment for instruction following quality
Tool use specialization (calculator, code interpreter)
Multimodal extension via compact vision encoder (following Penguin-VL approach)
Structured external memory for continual learning
Multi-benchmark evaluation: MMLU, HumanEval, TruthfulQA
Model distillation: compress AVA gains into a smaller student

Long-Term Vision

A general-purpose assistant that runs entirely on consumer hardware
Multilingual support starting with Urdu and Arabic
On-device deployment (mobile/edge) through further quantization
Community-driven corpus contributions and benchmark extensions

Prior Work

AVA's earlier experiments are preserved in the repo and still matter to the current direction:

A compact 11M checkpoint with strong internal tool/compliance behavior
A memory-transfer system achieving 87/87 on stress suites
A science-first sparse retrieval ensemble reaching 91/299 on ARC-Challenge
Tokenizer research showing Qwen's tokenizer compresses AVA data at 0.24x byte ratio

These experiments informed the pivot to fine-tuning and later to Gemma 4 serving: the scratch model's 24% ARC ceiling made it clear that parameter count and pre-trained knowledge matter, but the retrieval/memory/tooling work remains part of AVA's long-term architecture rather than discarded prior work.

Quick Start

# Install
python -m pip install -e .[dev,bench]

# Run tests
python -m pytest tests/ -q

# Chat with AVA v2
python scripts/chat.py

# Run benchmarks
python -u experiments/exp4_finetune/scripts/benchmark_full.py \
    --adapter experiments/exp4_finetune/models/Qwen3.5-2B-AVA-v2

License

This project is open source. The AVA v2 adapter weights are released under the same license as the base model (Qwen License).

Citation

@misc{ava-v2-2026,
  title={AVA v2: QLoRA Fine-tuning Under Extreme VRAM Constraints},
  author={Muhammad Afsah Mumtaz},
  year={2026},
  url={https://github.com/NAME0x0/AVA}
}

Name		Name	Last commit message	Last commit date
Latest commit History 136 Commits
.github		.github
artifacts/tokenizers		artifacts/tokenizers
configs		configs
corpora		corpora
docs		docs
experiments		experiments
scripts		scripts
sessions		sessions
src/ava		src/ava
tests		tests
tokenizers		tokenizers
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Modelfile		Modelfile
README.md		README.md
VERSION		VERSION
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

AVA

What AVA Is

Experiment Map

Progress Snapshot

Try AVA v2

Option 1: Ollama (easiest — no Python needed)

Option 2: Python chat script

Option 3: Python API

Results

AVA v2 Benchmark Scores

Training Statistics

Comparison to Other Models

ARC-Challenge (Science Reasoning)

GSM8K (Math Reasoning)

Loss Curve

Why This Matters

How It Works

Architecture

Training Data

Hardware

Reproducing the Results

Prerequisites

Step 1: Install Dependencies

Step 2: Install Triton (for SDPA kernel acceleration)

Step 3: Install Flash-Linear-Attention (optional, for Qwen3.5 fast path)

Step 4: Download the Base Model

Step 5: Prepare Training Data

Step 6: Train

Step 7: Benchmark

Lessons Learned

What worked

What didn't work

Windows-specific issues

Model Weights

GGUF for Ollama / llama.cpp

Experiment 5: Practical Local Gemma 4 Serving

LoRA Adapter (42 MB)

Software Stack

Repository Layout

Roadmap

Completed

In Progress

Planned

Long-Term Vision

Prior Work

Quick Start

License

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 11

Uh oh!

Contributors

Uh oh!

Languages