TFMC updates + unbinned v6 fits w/ backgrounds#122
Conversation
|
Main items to still be implemented:
|
There was a problem hiding this comment.
Pull request overview
This PR updates the TFMC training workflow to support the same UID-based train/validation split used for PNNs (needed for upcoming unbinned v6 fits with backgrounds), adds explicit validation-loss computation/plot outputs, and wires a TFMC classifier job into the 2016 unbinned v6-rate config. It also extends the YAML defaults logic so TFMC “classifier” jobs inherit the global splitting defaults.
Changes:
- Added a TFMC validation-loss path (
compute_loss) and updated TFMC training to produce train/val losses and separate train/val convergence plots. - Implemented UID-based train/validation splitting in
tfmc_training.py, plus updated IC weight accumulation to run on the training partition. - Updated the unbinned v6-rate 2016 YAML to use a TFMC classifier with backgrounds and added a corresponding TFMC classifier job; updated YAML loader defaults to apply splitting to TFMC classifier jobs.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 8 comments.
| File | Description |
|---|---|
| ML/TFMC/TFMC.py | Adds a no-update loss computation method for validation/test. |
| ML/TFMC/tfmc_training.py | Introduces UID train/val split handling, val loss logging/plotting, and adjusts IC accumulation strategy. |
| configs/unbinned_v6_rate/unbinned_2016_rate.yaml | Switches likelihood classifier to TFMC with backgrounds and adds a TFMC classifier job definition. |
| common/yaml_loader.py | Applies defaults.splitting to TFMC classifier jobs (type: classifier, framework: tfmc). |
Comments suppressed due to low confidence (1)
ML/TFMC/tfmc_training.py:100
- With
--small, the output path is modified twice:_smallis appended tocfg_basewhen constructingmodel_dir/plot_dir, and then_smallis appended again to the final directory name. This creates a different directory layout than other trainers (and than TFMC previously), and may break resume/lookup logic. Consider using a single consistent suffix strategy (either incfg_baseor at the end, but not both).
cfg_base = os.path.join( cfg.get("version", "default"), J['region'] )
model_dir = os.path.join(user.model_directory, cfg_base+("_small" if args.small else ""), "TFMC", J["id"])
plot_dir = os.path.join(user.plot_directory, cfg_base+("_small" if args.small else ""), "TFMC", J["id"])
from common.helpers import copyIndexPHP
copyIndexPHP( plot_dir )
if args.small:
model_dir += "_small"
plot_dir += "_small"
os.makedirs(model_dir, exist_ok=True)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if jtyp not in {"scaler", "pnn", "bit", "classifier"}: | ||
| continue | ||
| if jtyp == "classifier" and j.get("framework") != "tfmc": | ||
| continue | ||
|
|
||
| # splitting default (only pnn for now; keep bit/tfmc as comments) | ||
| # if jtyp in {"pnn", "bit", "tfmc"} and default_splitting is not None: | ||
| if jtyp in {"pnn", "bit", "dnn_c2st"} and default_splitting is not None: | ||
| if jtyp in {"pnn", "bit", "dnn_c2st", "classifier"} and default_splitting is not None: | ||
| if "splitting" not in j: | ||
| j["splitting"] = default_splitting |
|
Implemented a flag to do train-val splitting in iterate_epoch to allow cross-section calculation (+ yield check) done with all the events, as previously, but keeping the train-val split in the training. |
|
Given the issues seen with convergence in the presence of a large class imbalance (both in terms of weighted and unweighted numbers of events) implementing a new feature: give the inclusive cross-section ratios as prior probabilities (via softmax logits) and have the network learn around that. First core implementation is in with hardcoded event number ratios, the next step is to connect it with the ratio calculation done to get the class weights. |
… class reweighting in loss
…M or max walltime)
|
The implementation of setting the logit priors to the inclusive XS ratio is postponed, as we are now experimenting with simpler ways to have stable trainings. To avoid noise in the code, I removed that in commit 2060baa We can return to that if we have exhausted all other options. |
|
The main new feature implemented into TFMC training is Early Stopping. @Dorhand can you have a look ? |
|
Additionally, I'm also storing the last and best epoch information. This will allow us to start trainings from the last epoch for cases where we e.g. ran into OOM or max. walltime. I also implemented a feature both in the TFMC and PNN training to avoid restarting trainings which have properly finished, without using the PS: I also implemented access to RDataLoader n_split access in TFMC training. |
In view of upcoming v6 fits with backgrounds, I updated the TFMC code to include the train and validation split used for the PNNs. I mostly took the code from pnn_training.py, updated to the TFMC specifics.
The original TFMC code builds the training sample by combining the events from each class into a single sample (keeping class labels) then shuffling them. To do something similar taking into account the train-validation split, I take the training fraction of events from each sample and then combine them into a single training sample, which I then shuffle. I do the same thing for the validation. My goal was to keep the same relative fraction of events from each class in both samples.
To get the inclusive cross-sections, which we need to convert classifier output probabilities into differential cross-section ratios, the original TFMC code iterated through the entire sample and accumulated the weights for each class. Now that we are no longer using the entire sample, I modified the code to use the training fraction of events to accumulate the weights. It will not give us the correct values, but it should keep the relative cross-section differences correct, which in my mind should lead to the correct DCR. I will double-check this with pen and paper and if I find something I'll change it. One way to get closer to the correct values would be to divide the obtained values by the training partition fraction. This relies on the splitting and shuffling being random enough that a fraction x of the sample has a sum of weights = ~ x*full cross-section.Changed the code to get the cross-sections accumulating from the entire sample, but only use the training fraction for training.WIP so keeping as draft for now.