Skip to content

Pseudodata in diagonal basis and saving rotation#2455

Open
jacoterh wants to merge 22 commits into
masterfrom
diag_covmat_reproducibility
Open

Pseudodata in diagonal basis and saving rotation#2455
jacoterh wants to merge 22 commits into
masterfrom
diag_covmat_reproducibility

Conversation

@jacoterh
Copy link
Copy Markdown
Collaborator

@jacoterh jacoterh commented Apr 14, 2026

WIP

  • make sure nnfit_theory_covmat is ordered consistently with the experimental covmat before converting both to a dataframe
  • Is it save to read the fitting covmat back in during n3fit using dataset_inputs_covmat_t0_considered?

@jacoterh jacoterh marked this pull request as ready for review April 14, 2026 16:08
@jacoterh jacoterh linked an issue Apr 14, 2026 that may be closed by this pull request
@jacoterh jacoterh changed the title writing pseudodata in diagonal basis and saving eigvecs and eigvals as part of vp-setupfit Pseudodata in diagonal basis and saving rotation Apr 14, 2026
@jacoterh jacoterh marked this pull request as draft April 14, 2026 16:10
@jacoterh jacoterh force-pushed the diag_covmat_reproducibility branch 2 times, most recently from 7cae6d8 to e05e7ed Compare April 21, 2026 11:56
@scarlehoff scarlehoff marked this pull request as ready for review April 21, 2026 12:04
@scarlehoff scarlehoff requested a review from achiefa April 21, 2026 12:04
Copy link
Copy Markdown
Contributor

@achiefa achiefa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have a closer look tomorrow morning. For the moment, this is the only thing that confuses me. But I'm surely missing something.

Comment thread validphys2/src/validphys/n3fit_data.py Outdated
@achiefa
Copy link
Copy Markdown
Contributor

achiefa commented Apr 24, 2026

So just to make the point here and remind the myself of the future. @jacoterh has successfully implemented the possiblity of storing the eigenvectors and relative eigenvalues in vp-setupfit for reproducibility. This required us to default use_t0 to True in Validphys, in a similar way to what is currently done in n3fit. If I'm understanding it correctly, an independent diagonalisation takes place also in n3fit. It would be great to use the stored values from vp-setupfit instead of recomputing them in n3fit (which, anyway, caches them if run on gpus), but I'd say this is not a priority for the moment.

There is one open problem with the current branch, namely that vp-setupfit sets use_thcovmat_in_fitting to False by default. Hence the diagonalisation in vp-setupfit doesn't take into account the theory covmat, unlike the n3fit case. So I was just wondering whether there's a reason why use_thcovmat_in_fitting is defaulted to False in first place. Do we break something if we set it to True? @scarlehoff

Am I missing something @jacoterh ?

@scarlehoff
Copy link
Copy Markdown
Member

I guess by validphys you mean vp-setupfit. For use_thcovmat_in_fitting, since we are always using it we might as well set it true everywhere.

It would be great to use the stored values from vp-setupfit instead of recomputing them in n3fit (which, anyway, caches them if run on gpus),

Indeed, I think that, given that now vp-setupfit stores the diagonalization, it can be treated as the theory covmat and n3fit should read it instead of recalculating it.

@achiefa
Copy link
Copy Markdown
Contributor

achiefa commented Apr 27, 2026

Indeed, I think that, given that now vp-setupfit stores the diagonalization, it can be treated as the theory covmat and n3fit should read it instead of recalculating it.

So do we want to implement this now, or shall we keep it for a new PR?

@jacoterh
Copy link
Copy Markdown
Collaborator Author

Yes, let's do this here - I was already working on it on Friday but was encountering some issues. I continue with it this afternoon so we can hopefully merge this PR by tomorrow at the CM.

Comment thread validphys2/src/validphys/covmats.py Outdated
@@ -484,6 +484,7 @@ def dataset_inputs_t0_total_covmat(dataset_inputs_t0_exp_covmat, loaded_theory_c
"""
covmat = dataset_inputs_t0_exp_covmat
covmat += loaded_theory_covmat
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scarlehoff Is there a way to have vp-setupfit write the theory covmat csv files to the table directory first before attempting to load them? We need the theory covmat already at the stage of the diagonalisation in vp-setupfit.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be best to use whatever is in memory. But I remember we explicitly decide to forbid that a long time ago and I don't remember right now how easy / possible it is to revert that.

If you didn't already, try simply swapping the order of the theory covmat and the diagonalization in the vp_setupfit script. I don't remember when are they actually written down, but if it is upon creation that should be enough.

Comment thread validphys2/src/validphys/config.py
Comment thread n3fit/src/n3fit/scripts/vp_setupfit.py Outdated
'datacuts::theory::theorycovmatconfig nnfit_theory_covmat'
)

SETUPFIT_FIXED_CONFIG['actions_'] += [rotation_action]
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scarlehoff I changed the order of the actions in the hope the theory covmat csv would get written first, but again the same FileNotFoundError occurs. Maybe reportengine first resolves all dependencies regardless of their order? Not sure.

Copy link
Copy Markdown
Member

@scarlehoff scarlehoff Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is very likely.

One option is to make it so the diagonalization can only be run from vp-setupfit and so it doesn't depend on the covmat that goes to the csv but on nnfit_theory_covmat, but then you need to also take into consideration the transformations that happen on the csv.

Another possibility is to change nnfit_theory_covmat so that, when it is coming from vp-setupfit it works normally, but when it is running from n3fit it just returns None (or perhaps lambda : None or whatever, not sure if it needs to be a function).
Then the .csv function can depend on nnfit_theory_covmat as well. When it is None it means that it is coming from n3fit (or from whatever that makes it return None) and needs to read the data from the .csv, otherwise use that nnfit_theory_covmat that just arrived.

As I said, we decided back in the day not to have the option to run without vp_setupfit first so I'm not sure this second option will work ootb.

Edit: can you stack two produce rules one of which is an explicit_node? I'm not sure :(
You can always mix and match these options. For instance, making sure that the order is fixed already when the covmat is written to the .csv file so that you don't need to think about that later.

Copy link
Copy Markdown
Collaborator Author

@jacoterh jacoterh Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Juan! I have gone for the for the first option you suggested, i.e. call nnfit_theory_covmat during vp-setupfit. Further checks are necessary but we have something working at the moment. For instance, I haven't paid close attention to the transformation you mentioned? Is this a reindexing of the pandas data frames when converting them to numpy arrays?

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR is aiming to support running/saving pseudodata in the diagonal (eigenmode) basis by persisting the diagonal-basis rotation/eigensystem (or fitting covmat) at vp-setupfit time, and then reusing it during n3fit (including saving pseudodata indexed by eigenmodes).

Changes:

  • Save fitting covmat / diagonal-basis eigensystem via a new fitting_covmat_table table action triggered from vp-setupfit.
  • Load the saved eigensystem/covmat inside fitting_data_dict, and save pseudodata in eigenmode indexing when diagonal_basis: true.
  • Minor formatting/whitespace tidy-ups in pseudodata generation and docs.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
validphys2/src/validphys/pseudodata.py Reformat call site for replica generation (no functional change).
validphys2/src/validphys/n3fit_data.py Adds covmat/eigensystem persistence + diagonal-basis pseudodata saving; refactors inverse-covmat preparation.
validphys2/src/validphys/covmats.py Minor whitespace change.
validphys2/src/validphys/config.py Removes fitting-covmat selector arg; adds loader for saved fitting covmat; adjusts defaults for theory-covmat loading.
n3fit/src/n3fit/scripts/vp_setupfit.py Adds validphys.n3fit_data providers and schedules the new fitting_covmat_table action.
doc/sphinx/source/n3fit/runcard_detailed.rst Documents diagonal-basis pseudodata saving and the persisted rotation table.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread validphys2/src/validphys/n3fit_data.py Outdated
Comment thread validphys2/src/validphys/n3fit_data.py Outdated
Comment thread validphys2/src/validphys/n3fit_data.py Outdated
Comment thread validphys2/src/validphys/n3fit_data.py
Comment thread validphys2/src/validphys/config.py Outdated
Comment thread validphys2/src/validphys/config.py Outdated
Comment thread n3fit/src/n3fit/scripts/vp_setupfit.py Outdated
Comment thread doc/sphinx/source/n3fit/runcard_detailed.rst
@scarlehoff
Copy link
Copy Markdown
Member

Before trying to solve the failing test, rebase on top of master to avoid doing the work twice (there are things added there like a more strict check on the documentation)

@jacoterh jacoterh force-pushed the diag_covmat_reproducibility branch from c06505d to c95f151 Compare April 28, 2026 12:30
@jacoterh
Copy link
Copy Markdown
Collaborator Author

jacoterh commented Apr 28, 2026

@scarlehoff some worrying news perhaps. I found that the line

covmat += loaded_theory_covmat

is called twice on master. As a result the theory covmat gets added twice, which then explains why I found different eigenvalues in the presence of a theory covmat.

To reproduce this behaviour, use the runcard below and turn on the pdb debugger inside this function and you'll see that it gets called twice. Not sure why! I hope I'm wrong... maybe I'm going mad at this point

260414-jth-diagonal_basis_test.yml

@scarlehoff
Copy link
Copy Markdown
Member

scarlehoff commented Apr 28, 2026

What do you mean by twice? It doesn't (shouldn't) matter how many times you call it, it is producing the sum of the experimental and theory covmat.

It would be inefficient (perhaps a leftover of having two covmats in the fit, the t0 and the normal) but it should be fine.

@jacoterh
Copy link
Copy Markdown
Collaborator Author

Together with @achiefa we have checked that if you print dataset_inputs_t0_exp_covmat you will see that it changes between the two calls, and we suspect it already includes the theory covmat after the first time

@scarlehoff
Copy link
Copy Markdown
Member

That said, regardless how many times it gets called, this function should be just return dataset_inputs_t0_exp_covmat +loaded_theory_covmat

Change it to that and see whether results change. The += is asking for trouble if pandas or numpy suddenly decide that means "in place" (if this is the issue then it is horrendous but luckily we have been compatible from pandas 3 only since january I think)

@jacoterh
Copy link
Copy Markdown
Collaborator Author

Yes this is the issue indeed

@scarlehoff
Copy link
Copy Markdown
Member

Then, if you think this PR is close to finish, change it here so that it gets merged.

Otherwise, please open a new PR with this change so that we merge it to master asap (and perhaps even release 4.1.4)

@achiefa
Copy link
Copy Markdown
Contributor

achiefa commented Apr 28, 2026

So does this mean that all the fits performed since January are basically wrong?

@jacoterh
Copy link
Copy Markdown
Collaborator Author

jacoterh commented Apr 28, 2026

I'm starting to fear yes, the point is that covmat += loaded_theory_covmat modifies covmat in place (what else could it mean?), which links by reference to dataset_inputs_t0_exp_covmat so that this also changes...

Let's address this in a separate PR to fix this asap and check meanwhile which fits got affected by this

@scarlehoff
Copy link
Copy Markdown
Member

scarlehoff commented Apr 29, 2026

So does this mean that all the fits performed since January are basically wrong?

I hope it is only since January.

Otherwise since we started using t0 for both sampling and fit.*

That function has been like that since forever and as a @jacoterh said it does say explicitly "modify covmat in place", my hope is that older versions of pandas/numpy were just creating a new object (so the original covmat is safe).

*unless going through it twice only happened recently, or only for the diagonal case. This is also a possibility.

@jacoterh jacoterh force-pushed the diag_covmat_reproducibility branch from 3bb2b7f to 56d001c Compare April 30, 2026 14:52
@jacoterh jacoterh requested a review from Copilot May 1, 2026 10:55
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Comments suppressed due to low confidence (1)

validphys2/src/validphys/pseudodata.py:131

  • make_replica now depends on dataset_inputs_covmat_t0_considered, but the docstring still documents the old dataset_inputs_sampling_covmat argument (and mentions sampling from exp/theory covmat). Please update the docstring/parameter docs to match the new input and its semantics to avoid misleading API users.
def make_replica(
    central_values_array,
    group_replica_mcseed,
    dataset_inputs_covmat_t0_considered,
    group_multiplicative_errors=None,
    group_positivity_mask=None,
    sep_mult=False,
    genrep=True,
    max_tries=int(1e6),

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread validphys2/src/validphys/config.py Outdated
Comment thread n3fit/src/n3fit/scripts/vp_setupfit.py
Comment on lines 123 to +126
def make_replica(
central_values_array,
group_replica_mcseed,
dataset_inputs_sampling_covmat,
dataset_inputs_covmat_t0_considered,
Copy link

Copilot AI May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switching make_replica's injected covmat from dataset_inputs_sampling_covmat to dataset_inputs_covmat_t0_considered appears to drop support for runcard options like use_t0_sampling and use_thcovmat_in_sampling (the sampling covmat selector was removed from config). This likely changes replica generation behavior and breaks existing tests/docs that rely on those flags; consider reintroducing a sampling-covmat provider keyed off use_t0_sampling/use_thcovmat_in_sampling, or explicitly mapping those flags onto this new input.

Copilot uses AI. Check for mistakes.
Comment thread validphys2/src/validphys/n3fit_data.py Outdated
Comment thread validphys2/src/validphys/n3fit_data.py
Comment on lines +508 to +509
# TODO: how to know for sure if the index matches the covmat value ordering?
return pd.DataFrame(covmat, index=procs_index_matched, columns=procs_index_matched)
Copy link

Copilot AI May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dataset_inputs_t0_exp_covmat now returns a pd.DataFrame, but the docstring and downstream consumers typically treat these providers as returning np.ndarray. This type change can easily break callers that expect ndarray methods (e.g. .tobytes() for hashing, or plain NumPy broadcasting). If the goal is ordering/label alignment, consider returning the reindexed .values (and optionally returning the index separately), or update the contract/docs and audit callers accordingly.

Suggested change
# TODO: how to know for sure if the index matches the covmat value ordering?
return pd.DataFrame(covmat, index=procs_index_matched, columns=procs_index_matched)
if len(procs_index_matched) != covmat.shape[0]:
raise ValueError(
"procs_index_matched length does not match the covariance matrix dimensions"
)
return covmat

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need a dataframe? Is this because nnfit_theory_covmat is a dataframe too?

RE the returned type: my understanding of this function is that it serves at an intermediate level in the computation graph. In fact, there are other functions in vp that return the covmat as an indexed dataframe. So I'm wondering why we need to return a dataframe at this stage.

@scarlehoff, what do you think?

Copy link
Copy Markdown
Collaborator Author

@jacoterh jacoterh May 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly, I'd argue we need a dataframe here because it's later combined with nnfit_theory_covmat that is already a dataframe and I wanted to avoid any possible misalignment between the two, see dataset_inputs_t0_total_covmat.

@achiefa
Copy link
Copy Markdown
Contributor

achiefa commented May 11, 2026

Hi @jacoterh, what's the status of this? Can I review it?

@jacoterh
Copy link
Copy Markdown
Collaborator Author

Yes, sorry, this is ready for review. Please pay extra attention to the first point raised by copilot above. I don't think it breaks anything, but better safe than sorry

@achiefa achiefa force-pushed the diag_covmat_reproducibility branch from eba8fe9 to 3db249d Compare May 13, 2026 08:07
Copy link
Copy Markdown
Contributor

@achiefa achiefa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jacoterh for this. Overall I'd say that the PR solves the issue. However, there are some aspects that I don't completely understand and I'd like to clarify before we merge.

def make_replica(
central_values_array,
group_replica_mcseed,
dataset_inputs_sampling_covmat,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we removing the function dataset_inputs_sampling_covmat in favour of dataset_inputs_covmat_t0_considered? The problem I see with this is partially exposed in Copilot's comment. As far as I can tell, dataset_inputs_covmat_t0_considered can only include or not t0, but does not consider any theory covmat. If we then use dataset_inputs_covmat_t0_considered in make_replica, we're then sampling without theory covmat, even if use_thcovmat_in_sampling is set to True in the runcard (at least this is what I gather from the function). Furthermore, I'm not entirely sure that use_t0_sampling is effective as of now: it was used by dataset_inputs_sampling_covmat, while dataset_inputs_covmat_t0_considered's signature is use_t0. I'm pretty sure that use_t0 and use_t0_sampling must be decoupled.

So the code won't break if we use use_t0_sampling and use_thcovmat_in_sampling, but it doesn't lead to the expected behavour. Am I missing something?

Copy link
Copy Markdown
Collaborator Author

@jacoterh jacoterh May 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for spotting this, you're right. The issue is the following. We are now storing the covmat as part of vp-setupfit and we want n3fit to load these back in to avoid doing things twice. However, make_replica generates pseudodata in the data basis regardless of whether the user requests the diagonal basis or not, which means that this function needs access to the full covmat (exp + possibly th) in the data basis. But this we don't store when the diagonal basis is requested - we store the eigenvectors and eigenvalues. We can reconstruct it from here of course, but then we're going around in circles... The whole design with storing/loading needs some more thought!

return True

@configparser.explicit_node
def produce_dataset_inputs_sampling_covmat(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the other comment/



def dataset_inputs_t0_total_covmat_separate(
dataset_inputs_t0_exp_covmat_separate, loaded_theory_covmat
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm quite sure that this was discussed and I forgot. But why aren't we loading the theory covmat anymore? nnfit_theory_covmat computes the theory covmat from scratch. Is this what we want? And why?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this function is now called already during vp-setupfit and there is no stored covmat available to load at that point yet!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, my bad!

Comment on lines +508 to +509
# TODO: how to know for sure if the index matches the covmat value ordering?
return pd.DataFrame(covmat, index=procs_index_matched, columns=procs_index_matched)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need a dataframe? Is this because nnfit_theory_covmat is a dataframe too?

RE the returned type: my understanding of this function is that it serves at an intermediate level in the computation graph. In fact, there are other functions in vp that return the covmat as an indexed dataframe. So I'm wondering why we need to return a dataframe at this stage.

@scarlehoff, what do you think?

Comment on lines +334 to +335
output_path : Path
Path to output directory containing diagonal basis data if needed.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
output_path : Path
Path to output directory containing diagonal basis data if needed.
output_path : Path
Path to output directory containing diagonal basis data if needed.
use_thcovmat_in_fitting: bool, optional
If True, load the total covariance matrix, which includes the theory covariance matrix. If False, load only the experimental covariance matrix. Default is False.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is False the desired default option for use_thcovmat_in_fitting?

return covmat, inv_total, diag_rot, eig_vals


def _fiting_covmat(dataset_inputs_fitting_covmat, data_input, diagonal_basis=True):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _fiting_covmat(dataset_inputs_fitting_covmat, data_input, diagonal_basis=True):
def _fiting_covmat(dataset_inputs_fitting_covmat, diagonal_basis=True):

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reproducibility of diagonal covmat fits (clarify or bug?)

4 participants