AState: High-performance State Management for RL

AState is a general-purpose state data management system for RL workloads. It is designed to address several core challenges in RL:

Low I/O efficiency in training and inference；
Insufficient weight synchronization performance；
The lack of robust state recovery for multi-turn conversations and external environment calls.

Within Ant Group, AState has been deployed as a key component of ASystem and is already supporting high-performance weight exchange for large-scale RL training in production. At trillion-parameter scale, AState can complete weight synchronization in about 6 seconds, compared to the minute-level latency seen in many industry solutions.

System Architecture

To address the challenges of large-scale RL weight synchronization, AState provides a unified weight management API for RL workloads. It can support arbitrary model architectures and various deployment and pipeline patterns, without intrusive changes or extra adaptation in the RL framework itself. As shown in Figure 1, AState’s architecture is organized into three layers:

API Layer

Provides tensor-native interfaces that integrate quickly with different training/inference frameworks. The APIs expose one-sided read/write semantics, naturally supporting more complex or asynchronous computation and data exchange flows.
Service Layer

To support multiple deployment patterns and pipeline modes with a single API, AState introduces a middle service layer that offers different weight sync services and protocols, shielding upper layers from the complexity of exchange and scheduling. For example:
- In co-located training/inference, inference nodes can pull weights on demand.
- In off-policy setups, training and inference can perform fully asynchronous weight updates. The service layer also provides tensor sharding management with zero redundancy and synchronization plans that are aware of RL topology and cluster layout.
Transport Layer

This foundational layer provides efficient, scalable data transfer capabilities, including:
- NUMA topology and affinity awareness;
- Multiple transport backends (PCIe / NVLink / RoCE / InfiniBand, etc.) to exploit the full potential of underlying hardware bandwidth and topology.

Features

Unified tensor-level API
- One-sided read/write semantics (put/get/multi_put/multi_get/complete)
- Decoupled from specific training/inference frameworks
High-performance weight synchronization
- Zero-redundancy transfer for row-parallel & column-parallel tensors
- RDMA-based P2P data paths with DMA zero-copy
- In-place weight update on inference side (no extra GPU memory copy)
Topology-aware, scalable design
- NUMA-aware scheduling and CPU/GPU/RDMA affinity
- Multi-channel transport: RDMA/RoCE/IB, shared memory, NCCL etc.
- Global execution planning to avoid hotspots and long-tail latency
RL-native state management (ongoing)
- KV cache persistence & recovery for long-context RL
- Activation caching & memory offload
- Agent state (multi-turn dialog, tool calls) caching and recovery

Getting Started

git clone https://github.com/inclusionAI/asystem-astate.git
cd asystem-astate
make deps install      # installs as a Python library

# development build:
make develop      # build in develop mode + install as Python library
make test

Simple Usage Example

The internal library currently only supports RDMA NICs and does not support the use of soft-RoCE (TCP). A TCP-based demo will be tested after the integration of UCX.

# start train process
cd python/example
bash setup_trainer_env.sh

# start infer process
bash setup_infer_env.sh

Typical usage in RL:

Training side writes updated weights to AState every K steps or per iteration.
Inference side periodically or asynchronously pulls the latest weights via get/multi_get.
AState handles sharding, resharding, transport, and in-place update for you.

Performance

Method	Supported Deployment Modes	Supported Pipeline Modes	Data Redundancy	External Services Required	End-to-End Sync (100B¹)	End-to-End Sync (1T¹)
Load data from FS	All	On-policy	Yes	Yes	~10 min	N/A²
Layer-by-layer gather → NCCL P2P	All?	All?	Yes	Yes	~30 s	~1 min
AState: simultaneous RDMA P2P	All	All	No	No	~4 s	~6 s

$^?$ It may not be the optimal solution. For example, the co-located mode introduces an additional GPU memory offload pipeline; moreover, due to NCCL’s bilateral semantics, training and inference must be synchronized during data transfer, which can effectively cause an off‑policy scheme to degenerate into on‑policy.

$^1$ 100B(FP16) [Training-Inference Parallel Configuration: train: tp=1, etp=1; infer: tp=4];

$^1$ 1T(FP8) [Training-Inference Parallel Configuration: train: tp=1, etp=1; infer: tp=16].

$^2$ Time proportion is excessively long, rendering it nearly unusable.

Roadmap (High-level)

Supports using UCX as the underlying RDMA dependency, adapting to NCCL and the industry ecosystem.
KV cache persistence & fast failover for long-context RL.
Activation caching & GPU memory extension.
Elastic training runtime support.
Unified agent state (dialog + tools) cache & recovery.

Community & Links

📦 GitHub: https://github.com/inclusionAI/asystem-astate
🤗 Hugging Face (InclusionAI models): https://huggingface.co/inclusionAI
🤖 ModelScope: https://www.modelscope.cn/organization/inclusionAI

Contributions via Issues / PRs / Discussions are very welcome. If AState helps your RL system, please consider leaving a ⭐ Star!

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
cmake		cmake
doc/images		doc/images
dockerfiles		dockerfiles
include/astate		include/astate
python		python
scripts		scripts
src		src
utrans_pkg		utrans_pkg
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
CONTRIBUTING.md		CONTRIBUTING.md
DEVELOPMENT.md		DEVELOPMENT.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
build.sh		build.sh
format.sh		format.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AState: High-performance State Management for RL

System Architecture

Features

Getting Started

Simple Usage Example

Performance

Roadmap (High-level)

Community & Links

About

Uh oh!

Releases

Packages

Languages

License

inclusionAI/asystem-astate

Folders and files

Latest commit

History

Repository files navigation

AState: High-performance State Management for RL

System Architecture

Features

Getting Started

Simple Usage Example

Performance

Roadmap (High-level)

Community & Links

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages