Drop-in replacement for AsyncOpenAI that transparently batches requests via the Batch API. Designed for the Doubleword Inference API where batch pricing saves up to 90%. Support for OpenAI's batch API or other compatible APIs is best effort. If you experience any issues, please open an issue.
Batch APIs offer significant cost savings — up to 90% with the Doubleword Inference API (OpenAI offers 50% off with their 24-hour batch window) — but they require you to restructure your code around file uploads and polling. autobatcher lets you keep your existing async code while getting batch pricing automatically.
# Before: regular async calls (full price)
from openai import AsyncOpenAI
client = AsyncOpenAI()
# After: batched calls (up to 90% off with Doubleword Inference API)
from autobatcher import BatchOpenAI
client = BatchOpenAI(base_url="https://api.doubleword.ai/v1")
# Same interface, same code
response = await client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello!"}]
)- Requests are collected over a configurable time window (default: 10 seconds)
- When the window closes or batch size is reached, requests are submitted as a batch
- Results are polled and returned to waiting callers as they complete
- Your code sees normal response objects (
ChatCompletion,CreateEmbeddingResponse,Response)
Different request types (chat completions, embeddings, responses) can be mixed in a single batch — each result is parsed with the correct type automatically.
pip install autobatcherimport asyncio
from autobatcher import BatchOpenAI
async def main():
client = BatchOpenAI(
api_key="sk-...", # or set OPENAI_API_KEY env var
)
response = await client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "What is 2+2?"}],
)
print(response.choices[0].message.content)
await client.close()
asyncio.run(main())async def embed(client: BatchOpenAI):
response = await client.embeddings.create(
model="text-embedding-3-small",
input="Hello, world!",
)
print(response.data[0].embedding[:5])async def respond(client: BatchOpenAI):
response = await client.responses.create(
model="gpt-4o",
input="Explain quantum computing in one sentence.",
)
print(response.output[0].content[0].text)The real power comes when you have many requests:
async def process_many(prompts: list[str]) -> list[str]:
client = BatchOpenAI(batch_size=500, batch_window_seconds=5.0)
async def get_response(prompt: str) -> str:
response = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
)
return response.choices[0].message.content
# All requests are batched together automatically
results = await asyncio.gather(*[get_response(p) for p in prompts])
await client.close()
return resultsDifferent request types are automatically mixed into the same batch:
async def mixed(client: BatchOpenAI):
chat, embedding = await asyncio.gather(
client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello!"}],
),
client.embeddings.create(
model="text-embedding-3-small",
input="Hello!",
),
)async with BatchOpenAI() as client:
response = await client.chat.completions.create(...)autobatcher serve runs a local OpenAI-compatible HTTP proxy. This is useful
when you want to transparently batch traffic from tools that already support an
OpenAI-style base_url, such as evaluation frameworks, SDK consumers, or local
benchmark runners.
autobatcher serve \
--base-url https://api.doubleword.ai/v1 \
--api-key "$DOUBLEWORD_API_KEY" \
--host 127.0.0.1 \
--port 8080 \
--batch-size 1024 \
--batch-window 60 \
--poll-interval 10 \
--completion-window 24hThen point your OpenAI-compatible client at the proxy:
export OPENAI_BASE_URL=http://127.0.0.1:8080/v1
export OPENAI_API_KEY=dummyUse your real Doubleword credential for the proxy's upstream --api-key. The
downstream client still uses a dummy OPENAI_API_KEY because it is only talking
to the local OpenAI-compatible proxy.
Supported proxy routes:
| Route | Upstream batched endpoint |
|---|---|
/v1/chat/completions |
/v1/chat/completions |
/v1/embeddings |
/v1/embeddings |
/v1/responses |
/v1/responses |
/health |
local healthcheck |
In serve mode, autobatcher emits structured JSON lines to stdout for batch
lifecycle events. These are intended for log collection systems such as
Kubernetes logs, Loki, or Cloud Logging.
Example event:
{
"batch_id": "batch_123",
"completion_window": "24h",
"endpoint": "/v1/chat/completions",
"event": "batch_submitted",
"input_file_id": "file_123",
"metadata": {
"benchmark_id": "bench-2026-04-14",
"github_run_id": "24393857047"
},
"models": ["Qwen/Qwen3.5-397B-A17B-FP8"],
"request_count": 872,
"source": "autobatcher",
"ts": 1776163751.821
}Emitted events currently include:
batch_submittedbatch_progressbatch_completedbatch_terminalbatch_cancel_requestedbatch_cancelled_upstreambatch_cancel_failedclient_closing
You can stamp correlation metadata onto every upstream batch:
autobatcher serve \
--base-url https://api.doubleword.ai/v1 \
--api-key "$DOUBLEWORD_API_KEY" \
--batch-metadata benchmark_id=bench-2026-04-14 \
--batch-metadata github_run_id=24393857047 \
--batch-metadata k8s_job=perf-1234This metadata is passed through to the upstream batches.create(...) call and
is also included in the emitted lifecycle events.
By default, serve mode best-effort cancels any still-active upstream batches
when the proxy shuts down. This is useful for short-lived pods or CI jobs where
the proxy lifetime should own the batch lifetime.
If you want upstream batches to continue running after the proxy exits, use:
autobatcher serve --keep-active-batches-on-close| Parameter | Default | Description |
|---|---|---|
api_key |
None |
OpenAI API key (falls back to OPENAI_API_KEY env var) |
base_url |
None |
API base URL (for proxies or compatible APIs) |
batch_size |
1000 |
Submit batch when this many requests are queued |
batch_window_seconds |
10.0 |
Submit batch after this many seconds |
poll_interval_seconds |
5.0 |
How often to poll for batch completion |
completion_window |
"1h" |
Completion deadline (see below) |
batch_metadata |
None |
Optional metadata attached to each upstream batch |
cancel_active_batches_on_close |
False |
Best-effort cancel active upstream batches when closing the client |
The completion_window controls the deadline and pricing tier:
"1h"(default) — async inference. Faster turnaround than batch mode, still significantly cheaper than real-time. Supported by the Doubleword Inference API only."24h"— batch inference. Maximum cost savings (up to 90% with the Doubleword Inference API, 50% with OpenAI). Use for background jobs like evals, data processing, or bulk extraction where latency doesn't matter. This is the only window OpenAI supports.
| Endpoint | Method | Return type |
|---|---|---|
client.chat.completions.create() |
Chat completions | ChatCompletion |
client.embeddings.create() |
Embeddings | CreateEmbeddingResponse |
client.responses.create() |
Responses API | Response |
- Not suitable for real-time or interactive use cases — batch mode adds latency from the collection window and polling cycle.
- Streaming is not supported. Requests that would normally stream are forced to non-streaming; the serve proxy can re-wrap results as SSE for consuming clients.
- OpenAI only supports
completion_window="24h". The"1h"window is a Doubleword-specific feature. - No automatic escalation to real-time if the completion window elapses — the batch will be marked as expired.
MIT