Skip to content

Add environment_factory to GRPOTrainer#5093

Open
qgallouedec wants to merge 22 commits intomainfrom
env-grpo
Open

Add environment_factory to GRPOTrainer#5093
qgallouedec wants to merge 22 commits intomainfrom
env-grpo

Conversation

@qgallouedec
Copy link
Member

@qgallouedec qgallouedec commented Feb 13, 2026

What does this PR do?

Adds environment_factory to GRPOTrainer: a new way to train agents with stateful, per-rollout environments.

So far, training agents against environments requires providing a rollout_func. This demands tricky logic from the user, including tool masking, log-probs handling, tokenization/decoding, etc. Plus, in our current examples, prompts are always processed sequentially, which is not ideal for training efficiency.

With environment_factory, we provide a simple way to train on environments.

  1. The key design choice is to treat an environment as a stateful collection of tools. This lets us leverage the existing tool-calling training integration directly.
  2. It also inherently processes prompts in parallel: the trainer creates one environment instance per rollout in the batch and exposes environment methods as tools. Generation and tool calling happen in parallel across the batch, just like standard tool calling: no sequential loop over prompts.
  3. In the longer term, environment_factory may replace rollout_func entirely, as it provides a simpler, more efficient, and more composable interface for environment-based agent training. This PR also includes an example script (echo_env_factory.py) demonstrating the integration with OpenEnv.

How it works?

  • Pass a class (or any callable) as environment_factory. The trainer instantiates one environment per rollout slot.
  • Public methods (excluding reset and _-prefixed) are automatically exposed as tools.
  • reset() is called before each generation batch to clear state.
  • Environment instances are passed to reward functions via the environments kwarg, so rewards can inspect environment state if needed

Example:

from datasets import Dataset
from trl import GRPOConfig, GRPOTrainer

dataset = Dataset.from_dict({
    "prompt": [[{"role": "user", "content": f"Increment the counter by {i}."}] for i in range(1, 7)]
})

def reward_func(environments, **kwargs):
    return [env.counter for env in environments]

class IncrementEnv:
    def reset(self):
        self.counter = 0

    def increment(self, step: int) -> int:
        """
        Increment the internal counter.

        Args:
            step: Value to add to the counter.

        Returns:
            The updated counter value.
        """
        self.counter += step
        return self.counter

trainer = GRPOTrainer(
    model="Qwen/Qwen3-0.6B",
    args=GRPOConfig(chat_template_kwargs={"enable_thinking": False}),
    train_dataset=dataset,
    reward_funcs=reward_func,
    environment_factory=IncrementEnv,
)
trainer.train()

Note

environment_factory requires transformers>=5.2.0.dev0. This feature is experimental and may change or be removed at any time

@qgallouedec qgallouedec changed the title feat: add environment to GRPOTrainer Add environment_factory to GRPOTrainer Feb 13, 2026
@qgallouedec qgallouedec marked this pull request as ready for review February 13, 2026 20:11
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Comment on lines +1163 to +1164
if self.environments is not None:
reward_kwargs["environments"] = self.environments
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We pass the environment to the reward functions, it's useful to access the inner states of the environments.

Comment on lines -428 to -429
self._sync_tool_dict = {}
self._async_tool_dict = {}
Copy link
Member Author

@qgallouedec qgallouedec Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self._sync_tool_dict and self._async_tool_dict are now a list of dicts, and each rollout has its own index.

@burtenshaw
Copy link
Collaborator

Thanks for this work @qgallouedec, really cool. A few thoughts, mostly for future PRs so let me know and I'll take another pass.

For context: On OpenEnv atm, our latest work is mainly trying to isolate all env logic to the env, so that trainer libraries can basically just consume them, act, and get state. This comment has that in mind:

From the pov of OpenEnv users, the wrapper class shouldn't be necessary. Right now users can't pass EchoEnv directly to environment_factory because TRL discovers tools by introspecting wrapper methods (inspect.getmembers), but OpenEnv clients have a different interface (typed actions, result models, context manager lifecycle). I get that the MyEchoEnv wrapper bridges that gap, but I think it'll trip users up.

OpenEnv already solves this on its side via MCP:

  • tools/list returns {name, description, input_schema}; the same tool schemas TRL builds from method introspection
  • tools/call with {name, arguments}; the same dispatch TRL does when routing generated tool calls to methods.
  • Each WebSocket connection creates its own isolated session with its own environment instance, so TRL wouldn't need N factory-created objects, just N connections

If TRL consumed MCP tool schemas directly instead of introspecting methods, I think we could skip the wrapper and users pass an an openenv env instance.

One open question: rewards. OpenEnv carries rewards via StepResult.reward in simulation mode (where reset, step, and mcp messages coexist on the same WebSocket), but TRL's reward function currently expects to query the environment object directly. We would need logic here to consume rewards and it's best to align with rubric efforts here.

In short, I think this is a good agnostic start, but if we take more out of the latest (main) OpenEnv features, we'll unlock a lot of functionality and convenience for users.

@lukehinds
Copy link
Contributor

Thanks for the detailed explanation @qgallouedec. I've been looking at how environment_factory would work for my use case and I think it could be a good fit.

Currently I'm tracking tool execution state per-generation to compute security-related penalties in reward functions (e.g., detecting access to sensitive files, flagging certain HTTP requests). With my PR, I was passing a completion index to tools so they could record which generation triggered each action, then the reward function queries that state.

With environment_factory, the isolation is built-in since each environment instance is scoped to a single rollout. So instead of a shared tracker with index filtering, each environment just accumulates its own state and the reward function reads directly from environments. If I understand it correctly, that's a lot cleaner.

Just one question, for the environments kwarg passed to reward functions - is the ordering guaranteed to match the completions list?

I will try to get time to test this out and give more feedback if I find anything useful.

@qgallouedec
Copy link
Member Author

Just one question, for the environments kwarg passed to reward functions - is the ordering guaranteed to match the completions list?

Yes!

# See the License for the specific language governing permissions and
# limitations under the License.

# pip install git+https://github.com/huggingface/transformers.git@main
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# pip install git+https://github.com/huggingface/transformers.git@main
# /// script
# dependencies = [
# "trl",
# "openenv-echo-env @ git+https://huggingface.co/spaces/qgallouedec/echo_env",
# ]
# ///

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could just rename this file to echo.py and I'll open a follow-up PR updating all of them to the new structure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants

Comments