Skip to content

TFMC updates + unbinned v6 fits w/ backgrounds#122

Open
rbarrue wants to merge 26 commits into
sbi-pdf-CMSfrom
dev-rbarrue_unbinned_bkgs
Open

TFMC updates + unbinned v6 fits w/ backgrounds#122
rbarrue wants to merge 26 commits into
sbi-pdf-CMSfrom
dev-rbarrue_unbinned_bkgs

Conversation

@rbarrue
Copy link
Copy Markdown

@rbarrue rbarrue commented May 5, 2026

In view of upcoming v6 fits with backgrounds, I updated the TFMC code to include the train and validation split used for the PNNs. I mostly took the code from pnn_training.py, updated to the TFMC specifics.

The original TFMC code builds the training sample by combining the events from each class into a single sample (keeping class labels) then shuffling them. To do something similar taking into account the train-validation split, I take the training fraction of events from each sample and then combine them into a single training sample, which I then shuffle. I do the same thing for the validation. My goal was to keep the same relative fraction of events from each class in both samples.

To get the inclusive cross-sections, which we need to convert classifier output probabilities into differential cross-section ratios, the original TFMC code iterated through the entire sample and accumulated the weights for each class. Now that we are no longer using the entire sample, I modified the code to use the training fraction of events to accumulate the weights. It will not give us the correct values, but it should keep the relative cross-section differences correct, which in my mind should lead to the correct DCR. I will double-check this with pen and paper and if I find something I'll change it. One way to get closer to the correct values would be to divide the obtained values by the training partition fraction. This relies on the splitting and shuffling being random enough that a fraction x of the sample has a sum of weights = ~ x*full cross-section. Changed the code to get the cross-sections accumulating from the entire sample, but only use the training fraction for training.

WIP so keeping as draft for now.

@rbarrue
Copy link
Copy Markdown
Author

rbarrue commented May 5, 2026

Main items to still be implemented:

  • implement access to last epoch and best epochs (via text files)
  • implement early stopping (will require changing yaml_loader._apply_defaults_and checks such that the default can be used)

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the TFMC training workflow to support the same UID-based train/validation split used for PNNs (needed for upcoming unbinned v6 fits with backgrounds), adds explicit validation-loss computation/plot outputs, and wires a TFMC classifier job into the 2016 unbinned v6-rate config. It also extends the YAML defaults logic so TFMC “classifier” jobs inherit the global splitting defaults.

Changes:

  • Added a TFMC validation-loss path (compute_loss) and updated TFMC training to produce train/val losses and separate train/val convergence plots.
  • Implemented UID-based train/validation splitting in tfmc_training.py, plus updated IC weight accumulation to run on the training partition.
  • Updated the unbinned v6-rate 2016 YAML to use a TFMC classifier with backgrounds and added a corresponding TFMC classifier job; updated YAML loader defaults to apply splitting to TFMC classifier jobs.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 8 comments.

File Description
ML/TFMC/TFMC.py Adds a no-update loss computation method for validation/test.
ML/TFMC/tfmc_training.py Introduces UID train/val split handling, val loss logging/plotting, and adjusts IC accumulation strategy.
configs/unbinned_v6_rate/unbinned_2016_rate.yaml Switches likelihood classifier to TFMC with backgrounds and adds a TFMC classifier job definition.
common/yaml_loader.py Applies defaults.splitting to TFMC classifier jobs (type: classifier, framework: tfmc).
Comments suppressed due to low confidence (1)

ML/TFMC/tfmc_training.py:100

  • With --small, the output path is modified twice: _small is appended to cfg_base when constructing model_dir/plot_dir, and then _small is appended again to the final directory name. This creates a different directory layout than other trainers (and than TFMC previously), and may break resume/lookup logic. Consider using a single consistent suffix strategy (either in cfg_base or at the end, but not both).
cfg_base = os.path.join( cfg.get("version", "default"), J['region'] )

model_dir = os.path.join(user.model_directory, cfg_base+("_small" if args.small else ""), "TFMC", J["id"])
plot_dir  = os.path.join(user.plot_directory,  cfg_base+("_small" if args.small else ""), "TFMC", J["id"])

from common.helpers import copyIndexPHP
copyIndexPHP( plot_dir )
if args.small:
    model_dir += "_small"
    plot_dir  += "_small"
os.makedirs(model_dir, exist_ok=True)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ML/TFMC/TFMC.py
Comment thread ML/TFMC/tfmc_training.py
Comment thread ML/TFMC/tfmc_training.py Outdated
Comment thread ML/TFMC/tfmc_training.py
Comment thread ML/TFMC/tfmc_training.py Outdated
Comment thread ML/TFMC/tfmc_training.py Outdated
Comment thread ML/TFMC/tfmc_training.py Outdated
Comment thread common/yaml_loader.py
Comment on lines 250 to 259
if jtyp not in {"scaler", "pnn", "bit", "classifier"}:
continue
if jtyp == "classifier" and j.get("framework") != "tfmc":
continue

# splitting default (only pnn for now; keep bit/tfmc as comments)
# if jtyp in {"pnn", "bit", "tfmc"} and default_splitting is not None:
if jtyp in {"pnn", "bit", "dnn_c2st"} and default_splitting is not None:
if jtyp in {"pnn", "bit", "dnn_c2st", "classifier"} and default_splitting is not None:
if "splitting" not in j:
j["splitting"] = default_splitting
@rbarrue
Copy link
Copy Markdown
Author

rbarrue commented May 6, 2026

Implemented a flag to do train-val splitting in iterate_epoch to allow cross-section calculation (+ yield check) done with all the events, as previously, but keeping the train-val split in the training.

@rbarrue
Copy link
Copy Markdown
Author

rbarrue commented May 7, 2026

Given the issues seen with convergence in the presence of a large class imbalance (both in terms of weighted and unweighted numbers of events) implementing a new feature: give the inclusive cross-section ratios as prior probabilities (via softmax logits) and have the network learn around that.

First core implementation is in with hardcoded event number ratios, the next step is to connect it with the ratio calculation done to get the class weights.

@rbarrue
Copy link
Copy Markdown
Author

rbarrue commented May 13, 2026

The implementation of setting the logit priors to the inclusive XS ratio is postponed, as we are now experimenting with simpler ways to have stable trainings. To avoid noise in the code, I removed that in commit 2060baa

We can return to that if we have exhausted all other options.

@rbarrue
Copy link
Copy Markdown
Author

rbarrue commented May 13, 2026

The main new feature implemented into TFMC training is Early Stopping. @Dorhand can you have a look ?

@rbarrue
Copy link
Copy Markdown
Author

rbarrue commented May 13, 2026

Additionally, I'm also storing the last and best epoch information. This will allow us to start trainings from the last epoch for cases where we e.g. ran into OOM or max. walltime.

I also implemented a feature both in the TFMC and PNN training to avoid restarting trainings which have properly finished, without using the --overwrite flag.

PS: I also implemented access to RDataLoader n_split access in TFMC training.

@rbarrue rbarrue marked this pull request as ready for review May 13, 2026 14:23
@rbarrue rbarrue requested review from Dorhand and Copilot May 13, 2026 14:26
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 7 comments.

Comment thread ML/TFMC/TFMC.py Outdated
Comment thread ML/TFMC/tfmc_training.py Outdated
Comment thread ML/TFMC/tfmc_training.py Outdated
Comment thread ML/TFMC/tfmc_training.py Outdated
Comment thread ML/TFMC/tfmc_training.py
Comment thread ML/TFMC/tfmc_training.py
Comment thread ML/TFMC/TFMC.py Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants