Mac-native citation verification research on Apple Silicon with MLX, tool use, and official HALLMARK benchmarking.
hallmark-mlx is a research-oriented repository for building citation verification systems that use external scholarly tools before making a judgment. The project is optimized for Apple Silicon and MLX LoRA workflows, but the evaluation and reporting stack is general enough to support broader experimentation.
The repository is built around a simple idea: bibliographic truth should come from evidence, not from memorized model weights.
Core capabilities:
- structured verification traces for training and debugging
- tool wrappers for sources like BibTeX Updater, Crossref, OpenAlex, DBLP, ACL Anthology, arXiv, and Semantic Scholar
- MLX LoRA training flows for small Qwen-based tool-using policies
- deterministic controller and finalizer paths for benchmark-facing runs
- official HALLMARK split runners and report generation
- reproducible release bundles for datasets and LoRA adapters
Citation failures are not limited to formatting issues. In practice they include:
- fabricated references
- swapped or partial author lists
- wrong or nonexistent venues
- preprints cited as published papers
- plausible-looking but incorrect metadata
This repository treats citation verification as a grounded decision problem:
- parse the input
- decide which verification actions to run
- collect evidence from external tools
- compare candidate records
- abstain when needed
- emit a calibrated verdict
hallmark-mlx/
├── configs/
├── docs/
├── scripts/
├── skills/
├── src/hallmark_mlx/
│ ├── eval/
│ ├── inference/
│ ├── release/
│ ├── tools/
│ └── training/
├── tests/
└── .lab-book/
Recommended:
uv sync --extra dev --extra mlx --extra weco
uv run pre-commit installIf you are not on Apple Silicon, skip the mlx extra:
uv sync --extra dev --extra weco
uv run pre-commit installBuild a trace dataset:
hallmark-mlx build-dataset \
--config configs/base.yaml \
--input-path data/raw/traces.jsonl \
--output-dir data/processedRun the BibTeX checker wrapper:
hallmark-mlx check-bib references.bib --strictTrain the canonical kept 1.5B policy:
hallmark-mlx train --config configs/train_qwen_1_5b_kept.yamlRun a tracked internal policy eval:
hallmark-mlx eval-policy \
--config configs/train_qwen_1_5b_kept.yaml \
--input-path data/weco/hallmark_dev_compare32_gold_traces.jsonl \
--output-path artifacts/confirm_qwen_kept_compare32_metrics.jsonPublic benchmark claims in this repository are based on official HALLMARK splits only:
dev_publictest_publicstress_test
Internal Weco splits such as search64 and compare32 remain in the codebase for model selection, but they are not presented as public benchmark results.
Key benchmark artifacts:
- docs/reports/hallmark_official_splits.md
- docs/reports/hallmark_submission_readiness.md
- docs/figures/hallmark_official_vs_bibtexupdater.png
The repository includes a repo-local skill for other coding agents:
Use it to:
- rerun kept training configs
- compare the 1.5B and 3B 4-bit Mac-viable models
- refresh confirmed benchmark reports and figures
- prepare Hugging Face release bundles
The project also keeps a lightweight lab book:
Use it for dated notes on experiments, benchmark reruns, Weco settings, and promotion decisions.
HF-ready dataset and model bundles can be prepared with:
uv run python scripts/prepare_hf_release.pyThe current release bundle root is:
artifacts/hf_release/qwen25_1_5b_kept
Lint and test:
uv run ruff check .
uv run pytestRefresh benchmark reports from the confirmed official reruns:
PYTHONPATH=src uv run python scripts/refresh_confirmed_benchmarks.pyPre-commit currently runs:
ruff check --fixruff format- trailing whitespace cleanup
- end-of-file normalization
See CONTRIBUTING.md for setup, style, benchmark rules, and PR expectations.
This project is released under the MIT License.
