Tr-vl-split replicas#1618
Conversation
…o fit new structure
…ng the partial losses
|
Thanks @goord, I will have a look. |
|
Found an issue with the integrabiliy constraints in the nnlo fit btw |
What do you mean? (thanks for the PR!) |
scarlehoff
left a comment
There was a problem hiding this comment.
Thank you very much for this PR. I've been looking through it and left some comments as I went.
If I understand correctly the idea is to separate each replica so that instead of having
<pdf> X (<fktable> X <mask>)
and, in the case of many replicas:
<many pdfs> x (<fktable> X <mask>)
one has
<pdf> X (<fktable> X <mask>) for each replica
The problem with that approach as you well say is that it increases the memory requirement by quite a bit (since right now (<fktable> X <mask>) is a singleton). In principle it is a growth of nreplicas* (and eventually we would like to do 50/100 replicas in parallel).
And as you say in some of the comments the solution would indeed be to factor out the fktable and the mask.
The (potential) problem with that approach is that it might be heavy on memory for the single replica fit (which we would like to avoid as it is the main use-case). But I think I would start there, meaning, trying the factorization in single replica fits, profiling that (in memory and time) and then apply said factorization to this branch.
If the factorization works in single replica fits without having an impact in performance it would be fantastic tbh.
| return x | ||
|
|
||
|
|
||
| def aggregate_replicas(dictionary, method="append", rep_token_index=1): |
There was a problem hiding this comment.
Could you add some docstrings to this function? What is it used for, what is the expected input / output etc?
And what is the difference between append and sum.
There was a problem hiding this comment.
Yes I found myself hard-coding loops to either make lists of partial losses or sum them in other cases, which is why I created this method. I think this entire loss aggregation business should have a round of code improvement because as it is it makes very specific assumptions about the structure of the output layer names. I would like to adopt a more robust design.
There was a problem hiding this comment.
But note that you don't need partial losses and that, moving forward, there will be no partial losses (as all experiments will be grouped together).
About the names, for better or for worse there are several layers of nnpdf code that rely on the names for different things (positivity, integrability, etc) so I guess you can rely on that as well here.
| n_replicas = next(iter(names.values())) | ||
| for name in names: | ||
| if names[name] != n_replicas: | ||
| raise ValueError(f"Dataset {name} has unexpected no replica output layers {names[name]}") |
There was a problem hiding this comment.
Is this error ever expected? If so, and depending on the situation, I would prefer to make sure that it is caught before the fit starts loading the data, otherwise is a very expensive crash.
Also, I'm not entirely sure I understand the logic of this loop. What is this used for? If it is just for the check I would rather remove it. When the fit goes to the point of computing the loss there should be no doubt whatsoever that all replicas are working, if there is a problem it should've been caught and fixed earlier.
There was a problem hiding this comment.
The loop is simply there to compose the out_names and this entire block including the losses_fun() can be simplified by saving some information at initialization. Actually this thing should be handled by a 'aggregate_losses' type of utility method, but it works at the level of tensors, which is why it's still there...
There was a problem hiding this comment.
But why have the out_names changed with respect to the previous version?
Also, is the error ever expected? As I said, if it is, then it should be handled elsewhere.
| total_losses.append(tf.reduce_sum(rep_losses, axis=0)) | ||
| predictions = stacked_predictions | ||
| result = [tf.concat(total_losses, axis=0)] + predictions | ||
| return dict(zip(out_names, result)) |
There was a problem hiding this comment.
I'd rather lose some granularity on the loss that complicate (with quite a few loops) something that is going to be computed once per epoch. Take into account that we don't really care about the individual loss of the datasets (actually, when we will be doing th uncertainties, there will be no individual groups).
The loss per replica might be interesting, but again, it is something that can be recovered at the end.
| # so we need to convert the tensors | ||
| return _to_numpy_or_python_type(ret) | ||
| ret2 = _to_numpy_or_python_type(ret) | ||
| # import pdb; pdb.set_trace() |
There was a problem hiding this comment.
| # import pdb; pdb.set_trace() |
| # if not same_trvl_per_replica: | ||
| # raise CheckError( | ||
| # "Replicas cannot be run in parallel with different training/validation " | ||
| # " masks, please set `same_trvl_per_replica` to True in the runcard" | ||
| # ) |
There was a problem hiding this comment.
| # if not same_trvl_per_replica: | |
| # raise CheckError( | |
| # "Replicas cannot be run in parallel with different training/validation " | |
| # " masks, please set `same_trvl_per_replica` to True in the runcard" | |
| # ) |
|
|
||
| def observable_generator( | ||
| spec_dict, positivity_initial=1.0, integrability=False | ||
| spec_dict, positivity_initial=1.0, integrability=False, name_suffix='' |
There was a problem hiding this comment.
I'm guessing the name_suffix is so that different observable are treated differently so one can have different trvl masks. Please add to the docstr.
| return False | ||
|
|
||
| # Step 2. Compute the validation metrics | ||
| # import pdb; pdb.set_trace() |
There was a problem hiding this comment.
| # import pdb; pdb.set_trace() |
| """ | ||
| import copy | ||
| import logging | ||
| import pdb |
| splitted_pdfs = [] | ||
| for layer in all_replicas_pdf: | ||
| split_layers = splitting_op(layer) | ||
| splitted_pdfs.append([op.expand(sl) for sl in split_layers]) |
There was a problem hiding this comment.
I'm confused about this changes. The way it is done now full_pdf_per_replica and splitting_layer are not needed.
The original logic was to have just one (very big) pdf that would convolute with the fktable for the many replica case. Have you tested that the changes doesn't lead to tf doing many convolutions? (one per replica) We would want to avoid that (and it should have no effect on having a trvl different per replica).
If you have (and it doesn't) then please remove splitting_layer and full_pdf_per_replica (since this is actually cleaner than what I had anyway).
There was a problem hiding this comment.
Yes those are remnants of earlier code and will be cleaned.
I don't think tf is doing more convolutions per se in this new architecture, but it probably cannot do them as efficiently as in the earlier version, where all replica tensors were essentially stacked no?
There was a problem hiding this comment.
Well, it is a difference between doing (pseudo code)
pdf_{ri} X fktable_{i} = result_{r}or
result = []
for i in pdf_list:
result.append(pdf_{i} X fktable_{i})Specially in a GPU, the first version will be much faster than the second as there is only one copy of the PDF for all replicas (even if it is a bigger array) instead of many copy of smaller PDFs.
There was a problem hiding this comment.
In your second example, the allocation for the convolution will probably be the bottleneck on the GPU. But then again, tensorflow uses a big memory pool so this isn't an issue. As for the extra copies: you're right, this architecture has more layers, more tensors being copied and probably less utilization of the device. But we will measure this to avoid speculation.
As I said, this was exploratory from my side and I am going to focus my attention on getting the tr-vl split into the masked convolutions at the level of the Observable implementations. Currently the masks are only there to denote active flavors (uniform per replica of course); the aim will be to mask out data points per replica, right?
There was a problem hiding this comment.
No, there are two masks. The flavour mask is applied to the pdf/luminosity and it is always the same for all replicas for a given fktable.
Instead, the training/validation mask is different per replica. Currently it is applied on the fktable since it reduces its size considerably. However, by doing that, a different trvl mask generates a new fktable, which is a bottleneck for parallelizarion over the number of replicas.
Instead the trvl mask would need to be applied to the data, so that all replicas use the same fktable and generate the same (in shape and operations) output vector and only at the end the validation is data is masked out
There was a problem hiding this comment.
Ok so where the observable class now contains a list of fk-tables (I assume length n-replicas, each entry a masked version of the full fk-table), this will be only the full one. When you say 'at the end the validation data is masked out' you mean the observable layer should perform an extra masking of the fk-table convolution. So for instance DY will first execute a masked convolution, then a tensor product with the fk-tables (uniform across replicas) and finally masks out the validation entries per replica.
There was a problem hiding this comment.
The fitting process is as follows (for a DY-type observable which is the most complicated one). The exact order of the shape might not be correct in the example below though.
x enters as an input to the PDF model, input shape (n_x,)
The PDF model is a concatenation of PDFs, one for replica, its output is then (nreplicas, n_flavours, n_x)
The result of the PDF models is then given to the observable. The observable can be formed by several fktables (for instance, the observable can be the measure of a ratio between two cross sections).
Inside the observable first a luminosity tensor is constructed to account for the two colliding protons, so that we have a luminosity such that:
L = PDF x PDF, shape (nreplicas, n_flavours, n_flavours, n_x, n_x)
However, not all flavour combinations contribute to all observables, so this luminosity tensor is masked such that only active flavours remain (which means the calculation to be done is also lighter).
The new masked luminosity is then: mL = (nreplicas, n_active_flavours_combinations, n_x, n_x)
Now this masked luminosity is convoluted with every fktable in the observable. The output of this convolution is a vector of data points (n_data).
If the observable is composed (for instance, a ratio between two cross sections or whatever) then an operation is applied to all the output arrays.
The final result of the observable is then (n_replicas, n_data).
However, in order to do the fit correctly we apply a training/validation mask. This traning validation mask is applied at the level of the data points, so the natural way would be, after the computation is finished, apply the mask, so that one gets:
training_observable = observable X mask
validation_observable = observable X ~mask
But, for backpropagation, we are only interested in the training_observable, and the fktable is a massive object. Therefore what we do is to apply the mask directly to the fktable so that we have
fk_training = fk X mask
fk_validation = fk X ~mask
And then, using those two fktables, we generate two observables (so we no longer have to apply the mask at the end).
Note btw that this training/validation mask to the fktable is applied before the training even starts.
There was a problem hiding this comment.
ok thanks for your explanation @scarlehoff, I was a bit confused by the list of fk-tables within the observable class, but your comment clarifies this.
|
Closing this branch in favour of #1661 |
Hi @scarlehoff and @Radonirinaunimi this is a branch containing my rebased changes onto the latest master. Initial tests seem to indicate things still work as intended, but again the NNPDF-4 reproduction has yet to be tested.