Tkurth/rl tests #15

azrael417 · 2024-08-16T07:22:27Z

This PR fixes a bug in SAC, where instead of alpha was read from the file and used as entropy coefficient, rho was used instead which made training unstable since alpha should be around 0.1 and rho is close to 1.0. In addition to this fix, this PR adds trainable entropy coefficient as it can improve stability of training.

Furthermore, this PR adds a test suite for many parts of the RL stack and in addition a full test for each algorithm (DDPG, TD3, SAC, PPO) on simple tests environments. Those tests run for some time but it is helpful to verify that the environment works. The only test which is currently failing is the one for the state dependent action reward for DDPG, and only for DDPG. This could have something to do with the fact that DDPG sometimes overfits rapidly on simple environments, so I would not assume that the DDPG implementation by itself is wrong. In any case, since TD3 is basically always superior to DDPG and it seems to work in all cases, I recommend using that instead.

In order for the tests to work, I included a gtest dependency and also added some simple RL related models as base models to torch fort. Those include a sac-policy and an actor-critic model, where actor and critic share the same feature extractor.

… internally from the RL system and the user does not have to bother with it

…efficient)

romerojosh

Added mostly nitpick comments here. Otherwise changes LGTM. Thanks for adding some tests!

docker/Dockerfile

src/csrc/include/internal/rl/rollout_buffer.h

src/csrc/rl/off_policy/sac.cpp

…, enabled building examples by default

azrael417 · 2024-08-26T07:29:23Z

I have addressed all your comments and agree to all of them, thanks for the careful review.

romerojosh

LGTM!

azrael417 added 30 commits April 25, 2024 04:42

starting to implement rollout buffer

22991f1

added rollout buffer

a1c4a4c

refactoring existing rl part in preparation for on_policy methods

fe69873

fixing small issues in ppo

e59520a

fixing device assignment

e7364a6

fixing PPO

ed1dfd2

fixing PPO

114bc67

adding PPO algorithm to training

792f6e4

adding ppo.h

56e0aba

exposing finalizing rollout buffer

7db9bf9

distributed advantage normalization

9c80bf4

finished implementing fortran wrappers for on policy systems

e8c943f

enhancing ignre_tkr

fd6c24f

more selektive ignore pragma

8f0dbf2

adding tkr

7bd6919

fixing device assignments

f9d315b

fixing device assignment issues and memory leaks

b3dc323

fixed ppo as well

2f36d66

rebased onto new main

268b455

slight modularization

21afa0e

changing update rollout buffer in a way, that q and log_p is computed…

605d660

… internally from the RL system and the user does not have to bother with it

restructuring how rollout buffer works

09baa0b

refactoring rollout buffer behavious

6f65d83

fixed device placement of ppo models

235f19c

dbg

7133c0b

removing debug prints

2d34ed4

fixing policy clipping/scaling in PPO and also making it more flexible

cc12ad0

fixing ordering in replay buffer and streamlining sampling

5d7321d

adding linear LR

e072b71

fixing some of the logprob calculation

dd09a5f

azrael417 added 23 commits June 7, 2024 04:49

adding rollout buffer tests

dcf6f5a

adding rollout buffer tests

29c7955

working gtest integration

62522f4

working buffer tests

b0e476c

cleaning up test sections and adding some more small tests

c474918

adding env file

0a3d11e

fixing things

0f87903

environment

8bdc6f5

added first 2 RL system tests

857500c

more basic tests

9e4b4cb

implemented all RL system tests

4609abc

making tests more flexible

9ba8c78

fixed merge

1036c33

cleaned up merge

698a9c3

ading SAC MLP model as base model and cleaning up tests

6fa3ff1

small cleanups

c8ba65a

adding some more options to SACMLPModel

1a98e0c

adding off policy test

48e7f66

fixing bugs in SAC and adding trainable temperature alpha (entropy co…

d8ac8a4

…efficient)

working sac test

281a190

updating ppo

6161254

adding on policy (PPO) tests

656e6fa

adding license headers and configs

55af301

azrael417 requested a review from romerojosh August 16, 2024 07:23

romerojosh reviewed Aug 23, 2024

View reviewed changes

removed commented code blocks, enabled the alpha lr-scheduler for SAC…

1639011

…, enabled building examples by default

romerojosh approved these changes Aug 26, 2024

View reviewed changes

azrael417 merged commit e1a75f3 into NVIDIA:master Aug 27, 2024

romerojosh mentioned this pull request Sep 16, 2025

Explicitly disallow multiple input tensors from built-in MLP and SACMLP models. #73

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tkurth/rl tests #15

Tkurth/rl tests #15

Uh oh!

azrael417 commented Aug 16, 2024 •

edited

Loading

Uh oh!

romerojosh left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

azrael417 commented Aug 26, 2024

Uh oh!

romerojosh left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Tkurth/rl tests #15

Tkurth/rl tests #15

Uh oh!

Conversation

azrael417 commented Aug 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

romerojosh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

azrael417 commented Aug 26, 2024

Uh oh!

romerojosh left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

azrael417 commented Aug 16, 2024 •

edited

Loading