Add `environment_factory` to `GRPOTrainer` by qgallouedec · Pull Request #5093 · huggingface/trl

qgallouedec · 2026-02-13T02:03:08Z

What does this PR do?

Adds environment_factory to GRPOTrainer: a new way to train agents with stateful, per-rollout environments.

So far, training agents against environments requires providing a rollout_func. This demands tricky logic from the user, including tool masking, log-probs handling, tokenization/decoding, etc. Plus, in our current examples, prompts are always processed sequentially, which is not ideal for training efficiency.

With environment_factory, we provide a simple way to train on environments.

The key design choice is to treat an environment as a stateful collection of tools. This lets us leverage the existing tool-calling training integration directly.
It also inherently processes prompts in parallel: the trainer creates one environment instance per rollout in the batch and exposes environment methods as tools. Generation and tool calling happen in parallel across the batch, just like standard tool calling: no sequential loop over prompts.
In the longer term, environment_factory may replace rollout_func entirely, as it provides a simpler, more efficient, and more composable interface for environment-based agent training. This PR also includes an example script (echo_env_factory.py) demonstrating the integration with OpenEnv.

How it works?

Pass a class (or any callable) as environment_factory. The trainer instantiates one environment per rollout slot.
Public methods (excluding reset and _-prefixed) are automatically exposed as tools.
reset() is called before each generation batch to clear state.
Environment instances are passed to reward functions via the environments kwarg, so rewards can inspect environment state if needed

Example:

from datasets import Dataset
from trl import GRPOConfig, GRPOTrainer

dataset = Dataset.from_dict({
    "prompt": [[{"role": "user", "content": f"Increment the counter by {i}."}] for i in range(1, 7)]
})

def reward_func(environments, **kwargs):
    return [env.counter for env in environments]

class IncrementEnv:
    def reset(self):
        self.counter = 0

    def increment(self, step: int) -> int:
        """
        Increment the internal counter.

        Args:
            step: Value to add to the counter.

        Returns:
            The updated counter value.
        """
        self.counter += step
        return self.counter

trainer = GRPOTrainer(
    model="Qwen/Qwen3-0.6B",
    args=GRPOConfig(chat_template_kwargs={"enable_thinking": False}),
    train_dataset=dataset,
    reward_funcs=reward_func,
    environment_factory=IncrementEnv,
)
trainer.train()

Note

environment_factory requires transformers>=5.2.0.dev0. This feature is experimental and may change or be removed at any time

…ent.py

HuggingFaceDocBuilderDev · 2026-02-13T20:15:02Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…ling for missing tools

qgallouedec · 2026-02-13T21:08:54Z

trl/trainer/grpo_trainer.py

+                    if self.environments is not None:
+                        reward_kwargs["environments"] = self.environments


We pass the environment to the reward functions, it's useful to access the inner states of the environments.

qgallouedec · 2026-02-13T21:10:42Z

trl/trainer/grpo_trainer.py

-        self._sync_tool_dict = {}
-        self._async_tool_dict = {}


self._sync_tool_dict and self._async_tool_dict are now a list of dicts, and each rollout has its own index.

…r tests

burtenshaw · 2026-02-14T07:00:40Z

Thanks for this work @qgallouedec, really cool. A few thoughts, mostly for future PRs so let me know and I'll take another pass.

For context: On OpenEnv atm, our latest work is mainly trying to isolate all env logic to the env, so that trainer libraries can basically just consume them, act, and get state. This comment has that in mind:

From the pov of OpenEnv users, the wrapper class shouldn't be necessary. Right now users can't pass EchoEnv directly to environment_factory because TRL discovers tools by introspecting wrapper methods (inspect.getmembers), but OpenEnv clients have a different interface (typed actions, result models, context manager lifecycle). I get that the MyEchoEnv wrapper bridges that gap, but I think it'll trip users up.

OpenEnv already solves this on its side via MCP:

tools/list returns {name, description, input_schema}; the same tool schemas TRL builds from method introspection
tools/call with {name, arguments}; the same dispatch TRL does when routing generated tool calls to methods.
Each WebSocket connection creates its own isolated session with its own environment instance, so TRL wouldn't need N factory-created objects, just N connections

If TRL consumed MCP tool schemas directly instead of introspecting methods, I think we could skip the wrapper and users pass an an openenv env instance.

One open question: rewards. OpenEnv carries rewards via StepResult.reward in simulation mode (where reset, step, and mcp messages coexist on the same WebSocket), but TRL's reward function currently expects to query the environment object directly. We would need logic here to consume rewards and it's best to align with rubric efforts here.

In short, I think this is a good agnostic start, but if we take more out of the latest (main) OpenEnv features, we'll unlock a lot of functionality and convenience for users.

lukehinds · 2026-02-17T09:36:43Z

Thanks for the detailed explanation @qgallouedec. I've been looking at how environment_factory would work for my use case and I think it could be a good fit.

Currently I'm tracking tool execution state per-generation to compute security-related penalties in reward functions (e.g., detecting access to sensitive files, flagging certain HTTP requests). With my PR, I was passing a completion index to tools so they could record which generation triggered each action, then the reward function queries that state.

With environment_factory, the isolation is built-in since each environment instance is scoped to a single rollout. So instead of a shared tracker with index filtering, each environment just accumulates its own state and the reward function reads directly from environments. If I understand it correctly, that's a lot cleaner.

Just one question, for the environments kwarg passed to reward functions - is the ordering guaranteed to match the completions list?

I will try to get time to test this out and give more feedback if I find anything useful.

qgallouedec · 2026-02-17T12:48:39Z

Just one question, for the environments kwarg passed to reward functions - is the ordering guaranteed to match the completions list?

Yes!

…ironments

… external prompt file

sergiopaniego · 2026-02-19T14:52:07Z

examples/scripts/openenv/echo_env_factory.py

+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# pip install git+https://github.com/huggingface/transformers.git@main


Suggested change

# pip install git+https://github.com/huggingface/transformers.git@main

# /// script

# dependencies = [

# "trl",

# "openenv-echo-env @ git+https://huggingface.co/spaces/qgallouedec/echo_env",

# ]

# ///

sergiopaniego · 2026-02-19T14:53:51Z

examples/scripts/openenv/echo_env_factory.py

We could just rename this file to echo.py and I'll open a follow-up PR updating all of them to the new structure.

qgallouedec added 7 commits February 13, 2026 02:02

feat: add environment to GRPOTrainer

d621f31

fix: update comment for pip install instruction in demo_grpo_environm…

9e54235

…ent.py

Add envrionment to reward kwargs

4df1703

echo env

34d4dbf

one example, and set min version

001bf71

protocol

d87f94c

doc

918fe70

qgallouedec changed the title ~~feat: add environment to GRPOTrainer~~ Add environment_factory to GRPOTrainer Feb 13, 2026

qgallouedec marked this pull request as ready for review February 13, 2026 20:11

Merge branch 'main' into env-grpo

724ca4c

qgallouedec added 2 commits February 13, 2026 20:57

Add test for training with environment factory and improve error hand…

4bc2edb

…ling for missing tools

pr merged + experimental

8f8f463

qgallouedec commented Feb 13, 2026

View reviewed changes

qgallouedec added 6 commits February 13, 2026 21:16

fix error message

aa0e427

clean a bit the init and fix indexing bug

1a4e129

Add environment variable patch for experimental silence in GRPOTraine…

2aec838

…r tests

simplify example

4ac8382

nit

351f82e

nit

524cf8b

qgallouedec requested review from albertvillanova, edbeeching, kashif, lewtun and sergiopaniego February 13, 2026 22:05

This was referenced Feb 16, 2026

Expose generation index to tool callables in GRPOTrainer #4894

Open

Environment integration #5072

Closed

qgallouedec and others added 6 commits February 17, 2026 10:08

Merge branch 'main' into env-grpo

eb0cb2b

Merge branch 'main' into env-grpo

e08d1c2

Merge branch 'main' into env-grpo

abc3042

Merge branch 'main' into env-grpo

41d397a

Update reset methods to accept kwargs for enhanced flexibility in env…

4cde1f1

…ironments

Refactor Wordle training script: integrate prompt directly and remove…

3922c34

… external prompt file

sergiopaniego approved these changes Feb 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `environment_factory` to `GRPOTrainer`#5093

Add `environment_factory` to `GRPOTrainer`#5093
qgallouedec wants to merge 22 commits intomainfrom
env-grpo

qgallouedec commented Feb 13, 2026 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Feb 13, 2026

Uh oh!

qgallouedec Feb 13, 2026

Uh oh!

qgallouedec Feb 13, 2026 •

edited

Loading

Uh oh!

burtenshaw commented Feb 14, 2026

Uh oh!

lukehinds commented Feb 17, 2026

Uh oh!

qgallouedec commented Feb 17, 2026

Uh oh!

sergiopaniego Feb 19, 2026

Uh oh!

sergiopaniego Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Comments

		if self.environments is not None:
		reward_kwargs["environments"] = self.environments

-# pip install git+https://github.com/huggingface/transformers.git@main
+# /// script
+# dependencies = [
+#     "trl",
+#     "openenv-echo-env @ git+https://huggingface.co/spaces/qgallouedec/echo_env",
+# ]
+# ///

Conversation

qgallouedec commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

How it works?

Uh oh!

HuggingFaceDocBuilderDev commented Feb 13, 2026

Uh oh!

qgallouedec Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

qgallouedec Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

burtenshaw commented Feb 14, 2026

Uh oh!

lukehinds commented Feb 17, 2026

Uh oh!

qgallouedec commented Feb 17, 2026

Uh oh!

sergiopaniego Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

sergiopaniego Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Comments

qgallouedec commented Feb 13, 2026 •

edited

Loading

qgallouedec Feb 13, 2026 •

edited

Loading