Skip to content

add support for distributed Offline Eval#708

Closed
alexnikulkov wants to merge 1 commit intofacebookresearch:mainfrom
alexnikulkov:export-D42407669
Closed

add support for distributed Offline Eval#708
alexnikulkov wants to merge 1 commit intofacebookresearch:mainfrom
alexnikulkov:export-D42407669

Conversation

@alexnikulkov
Copy link
Contributor

Summary:
Adding support for distributed Offline Eval. This requires maintaining local buffers in each trainer instance and syncing them across all trainers periodically. The sync happens under one of 2 conditions:

  1. When the "critical" weight of data has been consumed (will be set approximately equal to the size of 1-hr partition)
  2. At the end of the training epoch (if data has been consumed since last sync)

Also, updating the FREE pipeline to remove the restriction on number of nodes for Offline Eval runs

Differential Revision: D42407669

@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D42407669

alexnikulkov pushed a commit to alexnikulkov/ReAgent that referenced this pull request Jan 10, 2023
Summary:
Pull Request resolved: facebookresearch#708

Adding support for distributed Offline Eval. This requires maintaining local buffers in each trainer instance and syncing them across all trainers periodically. The sync happens under one of 2 conditions:
1. When the "critical" weight of data has been consumed (will be set approximately equal to the size of 1-hr partition)
2. At the end of the training epoch (if data has been consumed since last sync)

Also, updating the FREE pipeline to remove the restriction on number of nodes for Offline Eval runs

Differential Revision: D42407669

fbshipit-source-id: b48ce0fee5f3b8155cb0189e51988986c169d08f
@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D42407669

Summary:
Pull Request resolved: facebookresearch#708

Adding support for distributed Offline Eval. This requires maintaining local buffers in each trainer instance and syncing them across all trainers periodically. The sync happens under one of 2 conditions:
1. When the "critical" weight of data has been consumed (will be set approximately equal to the size of 1-hr partition)
2. At the end of the training epoch (if data has been consumed since last sync)

Also, updating the FREE pipeline to remove the restriction on number of nodes for Offline Eval runs

Differential Revision: D42407669

fbshipit-source-id: 634c94a594bedbd98d175d0c41371a717bab0306
@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D42407669

@codecov-commenter
Copy link

Codecov Report

Base: 87.72% // Head: 87.73% // Increases project coverage by +0.01% 🎉

Coverage data is based on head (5aac164) compared to base (517a67f).
Patch coverage: 92.45% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #708      +/-   ##
==========================================
+ Coverage   87.72%   87.73%   +0.01%     
==========================================
  Files         373      373              
  Lines       24042    24078      +36     
  Branches       44       44              
==========================================
+ Hits        21091    21125      +34     
- Misses       2925     2927       +2     
  Partials       26       26              
Impacted Files Coverage Δ
reagent/training/cb/base_trainer.py 83.33% <60.00%> (-1.12%) ⬇️
reagent/evaluation/cb/base_evaluator.py 92.98% <93.75%> (-0.50%) ⬇️
reagent/evaluation/cb/policy_evaluator.py 97.22% <95.00%> (+1.38%) ⬆️
reagent/test/evaluation/cb/test_integration.py 100.00% <100.00%> (ø)
...eagent/test/evaluation/cb/test_policy_evaluator.py 97.61% <100.00%> (+0.32%) ⬆️
reagent/gym/tests/test_gym.py 95.93% <0.00%> (-0.82%) ⬇️
reagent/core/utils.py 87.23% <0.00%> (+2.12%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@facebook-github-bot
Copy link

This pull request has been merged in 89519d7.

xuruiyang pushed a commit that referenced this pull request Sep 20, 2025
Summary:
Pull Request resolved: #708

Adding support for distributed Offline Eval. This requires maintaining local buffers in each trainer instance and syncing them across all trainers periodically. The sync happens under one of 2 conditions:
1. When the "critical" weight of data has been consumed (will be set approximately equal to the size of 1-hr partition)
2. At the end of the training epoch (if data has been consumed since last sync)

Also, updating the FREE pipeline to remove the restriction on number of nodes for Offline Eval runs

Differential Revision: D42407669

fbshipit-source-id: ce436b42b1bb01f3688c6f1f80c52a3d66a47b22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants