Pseudodata in diagonal basis and saving rotation#2455
Conversation
7cae6d8 to
e05e7ed
Compare
achiefa
left a comment
There was a problem hiding this comment.
I'll have a closer look tomorrow morning. For the moment, this is the only thing that confuses me. But I'm surely missing something.
|
So just to make the point here and remind the myself of the future. @jacoterh has successfully implemented the possiblity of storing the eigenvectors and relative eigenvalues in vp-setupfit for reproducibility. This required us to default There is one open problem with the current branch, namely that vp-setupfit sets Am I missing something @jacoterh ? |
|
I guess by validphys you mean vp-setupfit. For
Indeed, I think that, given that now vp-setupfit stores the diagonalization, it can be treated as the theory covmat and n3fit should read it instead of recalculating it. |
So do we want to implement this now, or shall we keep it for a new PR? |
|
Yes, let's do this here - I was already working on it on Friday but was encountering some issues. I continue with it this afternoon so we can hopefully merge this PR by tomorrow at the CM. |
| @@ -484,6 +484,7 @@ def dataset_inputs_t0_total_covmat(dataset_inputs_t0_exp_covmat, loaded_theory_c | |||
| """ | |||
| covmat = dataset_inputs_t0_exp_covmat | |||
| covmat += loaded_theory_covmat | |||
There was a problem hiding this comment.
@scarlehoff Is there a way to have vp-setupfit write the theory covmat csv files to the table directory first before attempting to load them? We need the theory covmat already at the stage of the diagonalisation in vp-setupfit.
There was a problem hiding this comment.
I think it would be best to use whatever is in memory. But I remember we explicitly decide to forbid that a long time ago and I don't remember right now how easy / possible it is to revert that.
If you didn't already, try simply swapping the order of the theory covmat and the diagonalization in the vp_setupfit script. I don't remember when are they actually written down, but if it is upon creation that should be enough.
| 'datacuts::theory::theorycovmatconfig nnfit_theory_covmat' | ||
| ) | ||
|
|
||
| SETUPFIT_FIXED_CONFIG['actions_'] += [rotation_action] |
There was a problem hiding this comment.
@scarlehoff I changed the order of the actions in the hope the theory covmat csv would get written first, but again the same FileNotFoundError occurs. Maybe reportengine first resolves all dependencies regardless of their order? Not sure.
There was a problem hiding this comment.
Yes, it is very likely.
One option is to make it so the diagonalization can only be run from vp-setupfit and so it doesn't depend on the covmat that goes to the csv but on nnfit_theory_covmat, but then you need to also take into consideration the transformations that happen on the csv.
Another possibility is to change nnfit_theory_covmat so that, when it is coming from vp-setupfit it works normally, but when it is running from n3fit it just returns None (or perhaps lambda : None or whatever, not sure if it needs to be a function).
Then the .csv function can depend on nnfit_theory_covmat as well. When it is None it means that it is coming from n3fit (or from whatever that makes it return None) and needs to read the data from the .csv, otherwise use that nnfit_theory_covmat that just arrived.
As I said, we decided back in the day not to have the option to run without vp_setupfit first so I'm not sure this second option will work ootb.
Edit: can you stack two produce rules one of which is an explicit_node? I'm not sure :(
You can always mix and match these options. For instance, making sure that the order is fixed already when the covmat is written to the .csv file so that you don't need to think about that later.
There was a problem hiding this comment.
Thanks Juan! I have gone for the for the first option you suggested, i.e. call nnfit_theory_covmat during vp-setupfit. Further checks are necessary but we have something working at the moment. For instance, I haven't paid close attention to the transformation you mentioned? Is this a reindexing of the pandas data frames when converting them to numpy arrays?
There was a problem hiding this comment.
Pull request overview
This PR is aiming to support running/saving pseudodata in the diagonal (eigenmode) basis by persisting the diagonal-basis rotation/eigensystem (or fitting covmat) at vp-setupfit time, and then reusing it during n3fit (including saving pseudodata indexed by eigenmodes).
Changes:
- Save fitting covmat / diagonal-basis eigensystem via a new
fitting_covmat_tabletable action triggered fromvp-setupfit. - Load the saved eigensystem/covmat inside
fitting_data_dict, and save pseudodata in eigenmode indexing whendiagonal_basis: true. - Minor formatting/whitespace tidy-ups in pseudodata generation and docs.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
validphys2/src/validphys/pseudodata.py |
Reformat call site for replica generation (no functional change). |
validphys2/src/validphys/n3fit_data.py |
Adds covmat/eigensystem persistence + diagonal-basis pseudodata saving; refactors inverse-covmat preparation. |
validphys2/src/validphys/covmats.py |
Minor whitespace change. |
validphys2/src/validphys/config.py |
Removes fitting-covmat selector arg; adds loader for saved fitting covmat; adjusts defaults for theory-covmat loading. |
n3fit/src/n3fit/scripts/vp_setupfit.py |
Adds validphys.n3fit_data providers and schedules the new fitting_covmat_table action. |
doc/sphinx/source/n3fit/runcard_detailed.rst |
Documents diagonal-basis pseudodata saving and the persisted rotation table. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Before trying to solve the failing test, rebase on top of master to avoid doing the work twice (there are things added there like a more strict check on the documentation) |
c06505d to
c95f151
Compare
|
@scarlehoff some worrying news perhaps. I found that the line nnpdf/validphys2/src/validphys/covmats.py Line 486 in 44a5ddf is called twice on master. As a result the theory covmat gets added twice, which then explains why I found different eigenvalues in the presence of a theory covmat. To reproduce this behaviour, use the runcard below and turn on the |
|
What do you mean by twice? It doesn't (shouldn't) matter how many times you call it, it is producing the sum of the experimental and theory covmat. It would be inefficient (perhaps a leftover of having two covmats in the fit, the t0 and the normal) but it should be fine. |
|
Together with @achiefa we have checked that if you print dataset_inputs_t0_exp_covmat you will see that it changes between the two calls, and we suspect it already includes the theory covmat after the first time |
|
That said, regardless how many times it gets called, this function should be just Change it to that and see whether results change. The |
|
Yes this is the issue indeed |
|
Then, if you think this PR is close to finish, change it here so that it gets merged. Otherwise, please open a new PR with this change so that we merge it to master asap (and perhaps even release 4.1.4) |
|
So does this mean that all the fits performed since January are basically wrong? |
|
I'm starting to fear yes, the point is that Let's address this in a separate PR to fix this asap and check meanwhile which fits got affected by this |
I hope it is only since January. Otherwise since we started using t0 for both sampling and fit.* That function has been like that since forever and as a @jacoterh said it does say explicitly "modify covmat in place", my hope is that older versions of pandas/numpy were just creating a new object (so the original covmat is safe). *unless going through it twice only happened recently, or only for the diagonal case. This is also a possibility. |
3bb2b7f to
56d001c
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.
Comments suppressed due to low confidence (1)
validphys2/src/validphys/pseudodata.py:131
make_replicanow depends ondataset_inputs_covmat_t0_considered, but the docstring still documents the olddataset_inputs_sampling_covmatargument (and mentions sampling from exp/theory covmat). Please update the docstring/parameter docs to match the new input and its semantics to avoid misleading API users.
def make_replica(
central_values_array,
group_replica_mcseed,
dataset_inputs_covmat_t0_considered,
group_multiplicative_errors=None,
group_positivity_mask=None,
sep_mult=False,
genrep=True,
max_tries=int(1e6),
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def make_replica( | ||
| central_values_array, | ||
| group_replica_mcseed, | ||
| dataset_inputs_sampling_covmat, | ||
| dataset_inputs_covmat_t0_considered, |
There was a problem hiding this comment.
Switching make_replica's injected covmat from dataset_inputs_sampling_covmat to dataset_inputs_covmat_t0_considered appears to drop support for runcard options like use_t0_sampling and use_thcovmat_in_sampling (the sampling covmat selector was removed from config). This likely changes replica generation behavior and breaks existing tests/docs that rely on those flags; consider reintroducing a sampling-covmat provider keyed off use_t0_sampling/use_thcovmat_in_sampling, or explicitly mapping those flags onto this new input.
| # TODO: how to know for sure if the index matches the covmat value ordering? | ||
| return pd.DataFrame(covmat, index=procs_index_matched, columns=procs_index_matched) |
There was a problem hiding this comment.
dataset_inputs_t0_exp_covmat now returns a pd.DataFrame, but the docstring and downstream consumers typically treat these providers as returning np.ndarray. This type change can easily break callers that expect ndarray methods (e.g. .tobytes() for hashing, or plain NumPy broadcasting). If the goal is ordering/label alignment, consider returning the reindexed .values (and optionally returning the index separately), or update the contract/docs and audit callers accordingly.
| # TODO: how to know for sure if the index matches the covmat value ordering? | |
| return pd.DataFrame(covmat, index=procs_index_matched, columns=procs_index_matched) | |
| if len(procs_index_matched) != covmat.shape[0]: | |
| raise ValueError( | |
| "procs_index_matched length does not match the covariance matrix dimensions" | |
| ) | |
| return covmat |
There was a problem hiding this comment.
Why do you need a dataframe? Is this because nnfit_theory_covmat is a dataframe too?
RE the returned type: my understanding of this function is that it serves at an intermediate level in the computation graph. In fact, there are other functions in vp that return the covmat as an indexed dataframe. So I'm wondering why we need to return a dataframe at this stage.
@scarlehoff, what do you think?
There was a problem hiding this comment.
Yes, exactly, I'd argue we need a dataframe here because it's later combined with nnfit_theory_covmat that is already a dataframe and I wanted to avoid any possible misalignment between the two, see dataset_inputs_t0_total_covmat.
|
Hi @jacoterh, what's the status of this? Can I review it? |
|
Yes, sorry, this is ready for review. Please pay extra attention to the first point raised by copilot above. I don't think it breaks anything, but better safe than sorry |
…s part of vp-setupfit
Needed to compute the covmat for diagonalisation durin vp-setupfit
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
eba8fe9 to
3db249d
Compare
| def make_replica( | ||
| central_values_array, | ||
| group_replica_mcseed, | ||
| dataset_inputs_sampling_covmat, |
There was a problem hiding this comment.
Why are we removing the function dataset_inputs_sampling_covmat in favour of dataset_inputs_covmat_t0_considered? The problem I see with this is partially exposed in Copilot's comment. As far as I can tell, dataset_inputs_covmat_t0_considered can only include or not t0, but does not consider any theory covmat. If we then use dataset_inputs_covmat_t0_considered in make_replica, we're then sampling without theory covmat, even if use_thcovmat_in_sampling is set to True in the runcard (at least this is what I gather from the function). Furthermore, I'm not entirely sure that use_t0_sampling is effective as of now: it was used by dataset_inputs_sampling_covmat, while dataset_inputs_covmat_t0_considered's signature is use_t0. I'm pretty sure that use_t0 and use_t0_sampling must be decoupled.
So the code won't break if we use use_t0_sampling and use_thcovmat_in_sampling, but it doesn't lead to the expected behavour. Am I missing something?
There was a problem hiding this comment.
Thanks for spotting this, you're right. The issue is the following. We are now storing the covmat as part of vp-setupfit and we want n3fit to load these back in to avoid doing things twice. However, make_replica generates pseudodata in the data basis regardless of whether the user requests the diagonal basis or not, which means that this function needs access to the full covmat (exp + possibly th) in the data basis. But this we don't store when the diagonal basis is requested - we store the eigenvectors and eigenvalues. We can reconstruct it from here of course, but then we're going around in circles... The whole design with storing/loading needs some more thought!
| return True | ||
|
|
||
| @configparser.explicit_node | ||
| def produce_dataset_inputs_sampling_covmat( |
|
|
||
|
|
||
| def dataset_inputs_t0_total_covmat_separate( | ||
| dataset_inputs_t0_exp_covmat_separate, loaded_theory_covmat |
There was a problem hiding this comment.
I'm quite sure that this was discussed and I forgot. But why aren't we loading the theory covmat anymore? nnfit_theory_covmat computes the theory covmat from scratch. Is this what we want? And why?
There was a problem hiding this comment.
Because this function is now called already during vp-setupfit and there is no stored covmat available to load at that point yet!
| # TODO: how to know for sure if the index matches the covmat value ordering? | ||
| return pd.DataFrame(covmat, index=procs_index_matched, columns=procs_index_matched) |
There was a problem hiding this comment.
Why do you need a dataframe? Is this because nnfit_theory_covmat is a dataframe too?
RE the returned type: my understanding of this function is that it serves at an intermediate level in the computation graph. In fact, there are other functions in vp that return the covmat as an indexed dataframe. So I'm wondering why we need to return a dataframe at this stage.
@scarlehoff, what do you think?
| output_path : Path | ||
| Path to output directory containing diagonal basis data if needed. |
There was a problem hiding this comment.
| output_path : Path | |
| Path to output directory containing diagonal basis data if needed. | |
| output_path : Path | |
| Path to output directory containing diagonal basis data if needed. | |
| use_thcovmat_in_fitting: bool, optional | |
| If True, load the total covariance matrix, which includes the theory covariance matrix. If False, load only the experimental covariance matrix. Default is False. |
There was a problem hiding this comment.
Is False the desired default option for use_thcovmat_in_fitting?
| return covmat, inv_total, diag_rot, eig_vals | ||
|
|
||
|
|
||
| def _fiting_covmat(dataset_inputs_fitting_covmat, data_input, diagonal_basis=True): |
There was a problem hiding this comment.
| def _fiting_covmat(dataset_inputs_fitting_covmat, data_input, diagonal_basis=True): | |
| def _fiting_covmat(dataset_inputs_fitting_covmat, diagonal_basis=True): |
WIP
nnfit_theory_covmatis ordered consistently with the experimental covmat before converting both to a dataframedataset_inputs_covmat_t0_considered?