Conversation
environment_factory to GRPOTrainer
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
…ling for missing tools
| if self.environments is not None: | ||
| reward_kwargs["environments"] = self.environments |
There was a problem hiding this comment.
We pass the environment to the reward functions, it's useful to access the inner states of the environments.
| self._sync_tool_dict = {} | ||
| self._async_tool_dict = {} |
There was a problem hiding this comment.
self._sync_tool_dict and self._async_tool_dict are now a list of dicts, and each rollout has its own index.
|
Thanks for this work @qgallouedec, really cool. A few thoughts, mostly for future PRs so let me know and I'll take another pass. For context: On OpenEnv atm, our latest work is mainly trying to isolate all env logic to the env, so that trainer libraries can basically just consume them, act, and get state. This comment has that in mind: From the pov of OpenEnv users, the wrapper class shouldn't be necessary. Right now users can't pass OpenEnv already solves this on its side via MCP:
If TRL consumed MCP tool schemas directly instead of introspecting methods, I think we could skip the wrapper and users pass an an openenv env instance. One open question: rewards. OpenEnv carries rewards via In short, I think this is a good agnostic start, but if we take more out of the latest ( |
|
Thanks for the detailed explanation @qgallouedec. I've been looking at how Currently I'm tracking tool execution state per-generation to compute security-related penalties in reward functions (e.g., detecting access to sensitive files, flagging certain HTTP requests). With my PR, I was passing a completion index to tools so they could record which generation triggered each action, then the reward function queries that state. With Just one question, for the environments kwarg passed to reward functions - is the ordering guaranteed to match the completions list? I will try to get time to test this out and give more feedback if I find anything useful. |
Yes! |
… external prompt file
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| # pip install git+https://github.com/huggingface/transformers.git@main |
There was a problem hiding this comment.
| # pip install git+https://github.com/huggingface/transformers.git@main | |
| # /// script | |
| # dependencies = [ | |
| # "trl", | |
| # "openenv-echo-env @ git+https://huggingface.co/spaces/qgallouedec/echo_env", | |
| # ] | |
| # /// |
There was a problem hiding this comment.
We could just rename this file to echo.py and I'll open a follow-up PR updating all of them to the new structure.
What does this PR do?
Adds
environment_factorytoGRPOTrainer: a new way to train agents with stateful, per-rollout environments.So far, training agents against environments requires providing a
rollout_func. This demands tricky logic from the user, including tool masking, log-probs handling, tokenization/decoding, etc. Plus, in our current examples, prompts are always processed sequentially, which is not ideal for training efficiency.With
environment_factory, we provide a simple way to train on environments.environment_factorymay replacerollout_funcentirely, as it provides a simpler, more efficient, and more composable interface for environment-based agent training. This PR also includes an example script (echo_env_factory.py) demonstrating the integration with OpenEnv.How it works?
environment_factory. The trainer instantiates one environment per rollout slot.resetand_-prefixed) are automatically exposed as tools.reset()is called before each generation batch to clear state.environmentskwarg, so rewards can inspect environment state if neededExample:
Note
environment_factoryrequires transformers>=5.2.0.dev0. This feature is experimental and may change or be removed at any time