Extract executable alpha factors from research PDFs via LLM — Chinese brokerage reports first.
paper2alpha turns a quant research PDF into structured ResearchCard metadata
and generator-ready Python factor code, with automatic static-analysis guardrails
from qtype.
uv sync
export OPENAI_API_KEY=sk-...
uv run p2a run path/to/report.pdf --out ./outOutputs:
out/card.json— structuredResearchCardout/factor_<name>.py— generator code (qtype-clean)out/qtype_report.json— static-analysis report
PDF → PyMuPDF parse → LLM extract (JSON mode) → Pydantic ResearchCard
│
▼
Jinja2 template ← factor stubs ← qtype static check
Create p2a.toml in your working directory:
[llm]
provider = "openai" # "openai" is the default; Anthropic / local via custom LLMClient
model = "gpt-4o-mini"Set the matching env var (OPENAI_API_KEY or ANTHROPIC_API_KEY). API keys
are never read from the TOML file.
- Generated factor bodies are stubs (raise
NotImplementedError); only the metadata, constants, and docstrings are filled in. LLM-bodied generation is scoped to v0.2. - Table / formula image extraction is heuristic — numeric metrics may be missed on scanned PDFs.
- Only OpenAI is wired out of the box. Plug in another provider by writing a
small class (≈20 LOC) against the
paper2alpha.core.llm_client.LLMClientProtocol.
- v0.2: English (arXiv / SSRN) support + LLM-bodied factor code with context-distiller retrieval augmentation.
- v0.3: batch processing + factor deduplication (rank correlation against an existing library).
- v0.4: paper citation-graph traversal.
MIT — see LICENSE. Vendored components (qtype,
context_distiller) retain their upstream licenses; see
LICENSE-VENDORED.md.