Pseudodata in diagonal basis and saving rotation by jacoterh · Pull Request #2455 · NNPDF/nnpdf

jacoterh · 2026-04-14T16:07:43Z

WIP

make sure nnfit_theory_covmat is ordered consistently with the experimental covmat before converting both to a dataframe
Is it save to read the fitting covmat back in during n3fit using dataset_inputs_covmat_t0_considered?

achiefa

I'll have a closer look tomorrow morning. For the moment, this is the only thing that confuses me. But I'm surely missing something.

achiefa · 2026-04-24T14:40:16Z

So just to make the point here and remind the myself of the future. @jacoterh has successfully implemented the possiblity of storing the eigenvectors and relative eigenvalues in vp-setupfit for reproducibility. This required us to default use_t0 to True in Validphys, in a similar way to what is currently done in n3fit. If I'm understanding it correctly, an independent diagonalisation takes place also in n3fit. It would be great to use the stored values from vp-setupfit instead of recomputing them in n3fit (which, anyway, caches them if run on gpus), but I'd say this is not a priority for the moment.

There is one open problem with the current branch, namely that vp-setupfit sets use_thcovmat_in_fitting to False by default. Hence the diagonalisation in vp-setupfit doesn't take into account the theory covmat, unlike the n3fit case. So I was just wondering whether there's a reason why use_thcovmat_in_fitting is defaulted to False in first place. Do we break something if we set it to True? @scarlehoff

Am I missing something @jacoterh ?

scarlehoff · 2026-04-24T14:44:23Z

I guess by validphys you mean vp-setupfit. For use_thcovmat_in_fitting, since we are always using it we might as well set it true everywhere.

It would be great to use the stored values from vp-setupfit instead of recomputing them in n3fit (which, anyway, caches them if run on gpus),

Indeed, I think that, given that now vp-setupfit stores the diagonalization, it can be treated as the theory covmat and n3fit should read it instead of recalculating it.

achiefa · 2026-04-27T09:57:13Z

Indeed, I think that, given that now vp-setupfit stores the diagonalization, it can be treated as the theory covmat and n3fit should read it instead of recalculating it.

So do we want to implement this now, or shall we keep it for a new PR?

jacoterh · 2026-04-27T10:47:28Z

Yes, let's do this here - I was already working on it on Friday but was encountering some issues. I continue with it this afternoon so we can hopefully merge this PR by tomorrow at the CM.

jacoterh · 2026-04-27T11:22:58Z

@@ -484,6 +484,7 @@ def dataset_inputs_t0_total_covmat(dataset_inputs_t0_exp_covmat, loaded_theory_c
    """
    covmat = dataset_inputs_t0_exp_covmat
    covmat += loaded_theory_covmat


@scarlehoff Is there a way to have vp-setupfit write the theory covmat csv files to the table directory first before attempting to load them? We need the theory covmat already at the stage of the diagonalisation in vp-setupfit.

I think it would be best to use whatever is in memory. But I remember we explicitly decide to forbid that a long time ago and I don't remember right now how easy / possible it is to revert that.

If you didn't already, try simply swapping the order of the theory covmat and the diagonalization in the vp_setupfit script. I don't remember when are they actually written down, but if it is upon creation that should be enough.

jacoterh · 2026-04-27T12:11:48Z

                'datacuts::theory::theorycovmatconfig nnfit_theory_covmat'
            )

+        SETUPFIT_FIXED_CONFIG['actions_'] += [rotation_action]


@scarlehoff I changed the order of the actions in the hope the theory covmat csv would get written first, but again the same FileNotFoundError occurs. Maybe reportengine first resolves all dependencies regardless of their order? Not sure.

Yes, it is very likely.

One option is to make it so the diagonalization can only be run from vp-setupfit and so it doesn't depend on the covmat that goes to the csv but on nnfit_theory_covmat, but then you need to also take into consideration the transformations that happen on the csv.

Another possibility is to change nnfit_theory_covmat so that, when it is coming from vp-setupfit it works normally, but when it is running from n3fit it just returns None (or perhaps lambda : None or whatever, not sure if it needs to be a function).
Then the .csv function can depend on nnfit_theory_covmat as well. When it is None it means that it is coming from n3fit (or from whatever that makes it return None) and needs to read the data from the .csv, otherwise use that nnfit_theory_covmat that just arrived.

As I said, we decided back in the day not to have the option to run without vp_setupfit first so I'm not sure this second option will work ootb.

Edit: can you stack two produce rules one of which is an explicit_node? I'm not sure :(
You can always mix and match these options. For instance, making sure that the order is fixed already when the covmat is written to the .csv file so that you don't need to think about that later.

Thanks Juan! I have gone for the for the first option you suggested, i.e. call nnfit_theory_covmat during vp-setupfit. Further checks are necessary but we have something working at the moment. For instance, I haven't paid close attention to the transformation you mentioned? Is this a reindexing of the pandas data frames when converting them to numpy arrays?

Copilot

Pull request overview

This PR is aiming to support running/saving pseudodata in the diagonal (eigenmode) basis by persisting the diagonal-basis rotation/eigensystem (or fitting covmat) at vp-setupfit time, and then reusing it during n3fit (including saving pseudodata indexed by eigenmodes).

Changes:

Save fitting covmat / diagonal-basis eigensystem via a new fitting_covmat_table table action triggered from vp-setupfit.
Load the saved eigensystem/covmat inside fitting_data_dict, and save pseudodata in eigenmode indexing when diagonal_basis: true.
Minor formatting/whitespace tidy-ups in pseudodata generation and docs.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
`validphys2/src/validphys/pseudodata.py`	Reformat call site for replica generation (no functional change).
`validphys2/src/validphys/n3fit_data.py`	Adds covmat/eigensystem persistence + diagonal-basis pseudodata saving; refactors inverse-covmat preparation.
`validphys2/src/validphys/covmats.py`	Minor whitespace change.
`validphys2/src/validphys/config.py`	Removes fitting-covmat selector arg; adds loader for saved fitting covmat; adjusts defaults for theory-covmat loading.
`n3fit/src/n3fit/scripts/vp_setupfit.py`	Adds `validphys.n3fit_data` providers and schedules the new `fitting_covmat_table` action.
`doc/sphinx/source/n3fit/runcard_detailed.rst`	Documents diagonal-basis pseudodata saving and the persisted rotation table.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

scarlehoff · 2026-04-28T10:15:58Z

Before trying to solve the failing test, rebase on top of master to avoid doing the work twice (there are things added there like a more strict check on the documentation)

jacoterh · 2026-04-28T17:49:18Z

@scarlehoff some worrying news perhaps. I found that the line

nnpdf/validphys2/src/validphys/covmats.py

Line 486 in 44a5ddf

covmat += loaded_theory_covmat

is called twice on master. As a result the theory covmat gets added twice, which then explains why I found different eigenvalues in the presence of a theory covmat.

To reproduce this behaviour, use the runcard below and turn on the pdb debugger inside this function and you'll see that it gets called twice. Not sure why! I hope I'm wrong... maybe I'm going mad at this point

260414-jth-diagonal_basis_test.yml

scarlehoff · 2026-04-28T18:21:52Z

What do you mean by twice? It doesn't (shouldn't) matter how many times you call it, it is producing the sum of the experimental and theory covmat.

It would be inefficient (perhaps a leftover of having two covmats in the fit, the t0 and the normal) but it should be fine.

jacoterh · 2026-04-28T18:32:38Z

Together with @achiefa we have checked that if you print dataset_inputs_t0_exp_covmat you will see that it changes between the two calls, and we suspect it already includes the theory covmat after the first time

scarlehoff · 2026-04-28T18:32:45Z

That said, regardless how many times it gets called, this function should be just return dataset_inputs_t0_exp_covmat +loaded_theory_covmat

Change it to that and see whether results change. The += is asking for trouble if pandas or numpy suddenly decide that means "in place" (if this is the issue then it is horrendous but luckily we have been compatible from pandas 3 only since january I think)

jacoterh · 2026-04-28T18:40:59Z

Yes this is the issue indeed

scarlehoff · 2026-04-28T18:48:52Z

Then, if you think this PR is close to finish, change it here so that it gets merged.

Otherwise, please open a new PR with this change so that we merge it to master asap (and perhaps even release 4.1.4)

achiefa · 2026-04-28T20:54:48Z

So does this mean that all the fits performed since January are basically wrong?

jacoterh · 2026-04-28T21:57:33Z

I'm starting to fear yes, the point is that covmat += loaded_theory_covmat modifies covmat in place (what else could it mean?), which links by reference to dataset_inputs_t0_exp_covmat so that this also changes...

Let's address this in a separate PR to fix this asap and check meanwhile which fits got affected by this

scarlehoff · 2026-04-29T05:24:00Z

So does this mean that all the fits performed since January are basically wrong?

I hope it is only since January.

Otherwise since we started using t0 for both sampling and fit.*

That function has been like that since forever and as a @jacoterh said it does say explicitly "modify covmat in place", my hope is that older versions of pandas/numpy were just creating a new object (so the original covmat is safe).

*unless going through it twice only happened recently, or only for the diagonal case. This is also a possibility.

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Comments suppressed due to low confidence (1)

validphys2/src/validphys/pseudodata.py:131

make_replica now depends on dataset_inputs_covmat_t0_considered, but the docstring still documents the old dataset_inputs_sampling_covmat argument (and mentions sampling from exp/theory covmat). Please update the docstring/parameter docs to match the new input and its semantics to avoid misleading API users.

def make_replica(
    central_values_array,
    group_replica_mcseed,
    dataset_inputs_covmat_t0_considered,
    group_multiplicative_errors=None,
    group_positivity_mask=None,
    sep_mult=False,
    genrep=True,
    max_tries=int(1e6),

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-05-01T11:02:18Z

 def make_replica(
    central_values_array,
    group_replica_mcseed,
-    dataset_inputs_sampling_covmat,
+    dataset_inputs_covmat_t0_considered,


Switching make_replica's injected covmat from dataset_inputs_sampling_covmat to dataset_inputs_covmat_t0_considered appears to drop support for runcard options like use_t0_sampling and use_thcovmat_in_sampling (the sampling covmat selector was removed from config). This likely changes replica generation behavior and breaks existing tests/docs that rely on those flags; consider reintroducing a sampling-covmat provider keyed off use_t0_sampling/use_thcovmat_in_sampling, or explicitly mapping those flags onto this new input.

Copilot · 2026-05-01T11:02:18Z

+    # TODO: how to know for sure if the index matches the covmat value ordering?
+    return pd.DataFrame(covmat, index=procs_index_matched, columns=procs_index_matched)


dataset_inputs_t0_exp_covmat now returns a pd.DataFrame, but the docstring and downstream consumers typically treat these providers as returning np.ndarray. This type change can easily break callers that expect ndarray methods (e.g. .tobytes() for hashing, or plain NumPy broadcasting). If the goal is ordering/label alignment, consider returning the reindexed .values (and optionally returning the index separately), or update the contract/docs and audit callers accordingly.

Suggested change

# TODO: how to know for sure if the index matches the covmat value ordering?

return pd.DataFrame(covmat, index=procs_index_matched, columns=procs_index_matched)

if len(procs_index_matched) != covmat.shape[0]:

raise ValueError(

"procs_index_matched length does not match the covariance matrix dimensions"

)

return covmat

Why do you need a dataframe? Is this because nnfit_theory_covmat is a dataframe too?

RE the returned type: my understanding of this function is that it serves at an intermediate level in the computation graph. In fact, there are other functions in vp that return the covmat as an indexed dataframe. So I'm wondering why we need to return a dataframe at this stage.

@scarlehoff, what do you think?

Yes, exactly, I'd argue we need a dataframe here because it's later combined with nnfit_theory_covmat that is already a dataframe and I wanted to avoid any possible misalignment between the two, see dataset_inputs_t0_total_covmat.

achiefa · 2026-05-11T15:33:09Z

Hi @jacoterh, what's the status of this? Can I review it?

jacoterh · 2026-05-12T09:47:16Z

Yes, sorry, this is ready for review. Please pay extra attention to the first point raised by copilot above. I don't think it breaks anything, but better safe than sorry

…s part of vp-setupfit

Needed to compute the covmat for diagonalisation durin vp-setupfit

…uncard

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

… setupfitconfig

achiefa

Thank you @jacoterh for this. Overall I'd say that the PR solves the issue. However, there are some aspects that I don't completely understand and I'd like to clarify before we merge.

achiefa · 2026-05-13T08:47:41Z

 def make_replica(
    central_values_array,
    group_replica_mcseed,
-    dataset_inputs_sampling_covmat,


Why are we removing the function dataset_inputs_sampling_covmat in favour of dataset_inputs_covmat_t0_considered? The problem I see with this is partially exposed in Copilot's comment. As far as I can tell, dataset_inputs_covmat_t0_considered can only include or not t0, but does not consider any theory covmat. If we then use dataset_inputs_covmat_t0_considered in make_replica, we're then sampling without theory covmat, even if use_thcovmat_in_sampling is set to True in the runcard (at least this is what I gather from the function). Furthermore, I'm not entirely sure that use_t0_sampling is effective as of now: it was used by dataset_inputs_sampling_covmat, while dataset_inputs_covmat_t0_considered's signature is use_t0. I'm pretty sure that use_t0 and use_t0_sampling must be decoupled.

So the code won't break if we use use_t0_sampling and use_thcovmat_in_sampling, but it doesn't lead to the expected behavour. Am I missing something?

Thanks for spotting this, you're right. The issue is the following. We are now storing the covmat as part of vp-setupfit and we want n3fit to load these back in to avoid doing things twice. However, make_replica generates pseudodata in the data basis regardless of whether the user requests the diagonal basis or not, which means that this function needs access to the full covmat (exp + possibly th) in the data basis. But this we don't store when the diagonal basis is requested - we store the eigenvectors and eigenvalues. We can reconstruct it from here of course, but then we're going around in circles... The whole design with storing/loading needs some more thought!

achiefa · 2026-05-13T08:48:07Z

        return True

-    @configparser.explicit_node
-    def produce_dataset_inputs_sampling_covmat(


See the other comment/

achiefa · 2026-05-13T08:55:30Z



 def dataset_inputs_t0_total_covmat_separate(
-    dataset_inputs_t0_exp_covmat_separate, loaded_theory_covmat


I'm quite sure that this was discussed and I forgot. But why aren't we loading the theory covmat anymore? nnfit_theory_covmat computes the theory covmat from scratch. Is this what we want? And why?

Because this function is now called already during vp-setupfit and there is no stored covmat available to load at that point yet!

Right, my bad!

achiefa · 2026-05-13T08:59:55Z

+    # TODO: how to know for sure if the index matches the covmat value ordering?
+    return pd.DataFrame(covmat, index=procs_index_matched, columns=procs_index_matched)


Why do you need a dataframe? Is this because nnfit_theory_covmat is a dataframe too?

RE the returned type: my understanding of this function is that it serves at an intermediate level in the computation graph. In fact, there are other functions in vp that return the covmat as an indexed dataframe. So I'm wondering why we need to return a dataframe at this stage.

@scarlehoff, what do you think?

achiefa · 2026-05-13T09:07:14Z

+    output_path : Path
+        Path to output directory containing diagonal basis data if needed.


Suggested change

output_path : Path

Path to output directory containing diagonal basis data if needed.

output_path : Path

Path to output directory containing diagonal basis data if needed.

use_thcovmat_in_fitting: bool, optional

If True, load the total covariance matrix, which includes the theory covariance matrix. If False, load only the experimental covariance matrix. Default is False.

Is False the desired default option for use_thcovmat_in_fitting?

achiefa · 2026-05-13T09:21:47Z

+    return covmat, inv_total, diag_rot, eig_vals
+
+
+def _fiting_covmat(dataset_inputs_fitting_covmat, data_input, diagonal_basis=True):


Suggested change

def _fiting_covmat(dataset_inputs_fitting_covmat, data_input, diagonal_basis=True):

def _fiting_covmat(dataset_inputs_fitting_covmat, diagonal_basis=True):

jacoterh marked this pull request as ready for review April 14, 2026 16:08

jacoterh linked an issue Apr 14, 2026 that may be closed by this pull request

Reproducibility of diagonal covmat fits (clarify or bug?) #2445

Open

jacoterh changed the title ~~writing pseudodata in diagonal basis and saving eigvecs and eigvals as part of vp-setupfit~~ Pseudodata in diagonal basis and saving rotation Apr 14, 2026

jacoterh marked this pull request as draft April 14, 2026 16:10

jacoterh force-pushed the diag_covmat_reproducibility branch 2 times, most recently from 7cae6d8 to e05e7ed Compare April 21, 2026 11:56

scarlehoff marked this pull request as ready for review April 21, 2026 12:04

scarlehoff requested a review from achiefa April 21, 2026 12:04

achiefa reviewed Apr 22, 2026

View reviewed changes

Comment thread validphys2/src/validphys/n3fit_data.py Outdated

jacoterh commented Apr 27, 2026

View reviewed changes

Comment thread validphys2/src/validphys/config.py

jacoterh commented Apr 27, 2026

View reviewed changes

scarlehoff mentioned this pull request Apr 27, 2026

Fail if the runcard doesn't match the md5 as generated by vp-setupfit #2461

Closed

jacoterh requested a review from Copilot April 27, 2026 20:38

Copilot AI reviewed Apr 27, 2026

View reviewed changes

jacoterh force-pushed the diag_covmat_reproducibility branch from c06505d to c95f151 Compare April 28, 2026 12:30

jacoterh force-pushed the diag_covmat_reproducibility branch from 3bb2b7f to 56d001c Compare April 30, 2026 14:52

jacoterh requested a review from Copilot May 1, 2026 10:55

Copilot started reviewing on behalf of jacoterh May 1, 2026 10:55 View session

Copilot AI reviewed May 1, 2026

View reviewed changes

jacoterh and others added 22 commits May 13, 2026 09:07

writing pseudodata in diagonal basis and saving eigvecs and eigvals a…

972a638

…s part of vp-setupfit

using table decorator

3fbae4d

more cleaning

20c117f

more cleaning

b3a4cc8

setting use_t0 = True in vp-setupfit

3f3233c

Needed to compute the covmat for diagonalisation durin vp-setupfit

wip on loading diag rot from table dir in vp-setupfit

53e12a3

wip on loading diag rot from table dir in vp-setupfit

b95f9fa

updating theory cov defaults

72dfa1f

swapping order rotation and theory covmat in vp-setupfit

aedb9cc

vp-setupfit now runs

8f791ca

calling nnfit_theory_covmat instead of reading the csv

7efff86

vp-setupfit stores covmat in diagonal and non-diagonal case

27a960b

caching inverse covmat in non diagonal basis

dfe28f3

cleaning

9c21aff

fixing covmat by reference issue

7f97264

n3fit runs again

4a7503e

extending number of headers

ed5c332

passing covmat in data basis to make_replica for correct sampling

bf39e94

indexing by process group

63c9c2d

make rotation action condition on presence of theorycovmatconfig in r…

0f30ac1

…uncard

Update validphys2/src/validphys/config.py

c547b96

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

making csv file name theorycovmat dependent + no longer modify global…

3db249d

… setupfitconfig

achiefa force-pushed the diag_covmat_reproducibility branch from eba8fe9 to 3db249d Compare May 13, 2026 08:07

achiefa reviewed May 13, 2026

View reviewed changes

		# TODO: how to know for sure if the index matches the covmat value ordering?
		return pd.DataFrame(covmat, index=procs_index_matched, columns=procs_index_matched)



		def dataset_inputs_t0_total_covmat_separate(
		dataset_inputs_t0_exp_covmat_separate, loaded_theory_covmat

		output_path : Path
		Path to output directory containing diagonal basis data if needed.

		return covmat, inv_total, diag_rot, eig_vals


		def _fiting_covmat(dataset_inputs_fitting_covmat, data_input, diagonal_basis=True):

Conversation

jacoterh commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

achiefa left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

achiefa commented Apr 24, 2026

Uh oh!

scarlehoff commented Apr 24, 2026

Uh oh!

achiefa commented Apr 27, 2026

Uh oh!

jacoterh commented Apr 27, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scarlehoff Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jacoterh Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

scarlehoff commented Apr 28, 2026

Uh oh!

jacoterh commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

scarlehoff commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jacoterh commented Apr 28, 2026

Uh oh!

scarlehoff commented Apr 28, 2026

Uh oh!

jacoterh commented Apr 28, 2026

Uh oh!

scarlehoff commented Apr 28, 2026

Uh oh!

achiefa commented Apr 28, 2026

Uh oh!

jacoterh commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

scarlehoff commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI May 1, 2026

jacoterh commented Apr 14, 2026 •

edited

Loading

scarlehoff Apr 27, 2026 •

edited

Loading

jacoterh Apr 27, 2026 •

edited

Loading

jacoterh commented Apr 28, 2026 •

edited

Loading

scarlehoff commented Apr 28, 2026 •

edited

Loading

jacoterh commented Apr 28, 2026 •

edited

Loading

scarlehoff commented Apr 29, 2026 •

edited

Loading

jacoterh May 13, 2026 •

edited

Loading

jacoterh May 13, 2026 •

edited

Loading