Conversation
…opt, fishr and erm
…rain epoch. Reverted prior backpack changes
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## mhof_dev_merge #843 +/- ##
==================================================
+ Coverage 90.77% 90.90% +0.12%
==================================================
Files 137 137
Lines 5853 5858 +5
==================================================
+ Hits 5313 5325 +12
+ Misses 540 533 -7
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
smilesun
left a comment
There was a problem hiding this comment.
I think we need a comment to "flag_info" so that the code reader/ reviewer know what this variable does in general
There was a problem hiding this comment.
has this yaml file been tested?
There was a problem hiding this comment.
Yes, was a separate issue 831 which is also linked in this PR
There was a problem hiding this comment.
I tested it, and resulted some error. Will print downstairs.
There was a problem hiding this comment.
looks nothing big: zdata does not have pacs yet.
There was a problem hiding this comment.
domainlab/zdata/pacs/PACS/art_painting
There was a problem hiding this comment.
Now i meet:
OutOfMemoryError in file
/ictstr01/home/aih/xudong.sun/domainlab_master/domainlab/exp_protocol/benchmark.smk, line 154:
zoutput/slurm_logs/run_experiment/run_experiment-index=14-21649209.err-251-CUDA out of memory. Tried to allocate 1.98 GiB. GPU 0 has a total capacty of 19.50 GiB of which 221.88 MiB is free. Including
non-PyTorch memory, this process has 19.24 GiB memory in use. Process 1322808 has 19.24 GiB memory in use. Of the allocated memory 18.93 GiB is allocated by PyTorch, and 71.99 MiB is rese
Is it because some GPU has larger mem so your run has gone through? @MatteoWohlrapp
There was a problem hiding this comment.
Does it say which of the two experiements in the yaml? We could try a different dataset to see if it works then. I do remember that it ran on the cluster.
|
You introduced 'flag_info' in your mhof_dev branch. Can you give a brief explanation, I dont think I fully understand the naming. I added it because otherwise training was not possible. It is set to self.flag_setpoint_updated in train_fbopt_b.py. |
Added functionality to use ERM with the hyperparam scheduling. Alternatively to adding the hyper init and hyper update method to ERM, we could also add them to the a_model superclass, or check if the method exists before invoking in the scheduler.