Skip to content

Tr-vl-split replicas#1618

Closed
goord wants to merge 7 commits into
NNPDF:masterfrom
LHC-NLeSC:tr-vl-split-replicas
Closed

Tr-vl-split replicas#1618
goord wants to merge 7 commits into
NNPDF:masterfrom
LHC-NLeSC:tr-vl-split-replicas

Conversation

@goord
Copy link
Copy Markdown
Collaborator

@goord goord commented Oct 24, 2022

Hi @scarlehoff and @Radonirinaunimi this is a branch containing my rebased changes onto the latest master. Initial tests seem to indicate things still work as intended, but again the NNPDF-4 reproduction has yet to be tested.

@Radonirinaunimi Radonirinaunimi self-requested a review October 24, 2022 21:39
@Radonirinaunimi
Copy link
Copy Markdown
Member

Thanks @goord, I will have a look.

@goord
Copy link
Copy Markdown
Collaborator Author

goord commented Oct 25, 2022

Found an issue with the integrabiliy constraints in the nnlo fit btw

@scarlehoff
Copy link
Copy Markdown
Member

Found an issue with the integrabiliy constraints in the nnlo fit btw

What do you mean?

(thanks for the PR!)

@scarlehoff scarlehoff self-requested a review October 25, 2022 06:07
Copy link
Copy Markdown
Member

@scarlehoff scarlehoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for this PR. I've been looking through it and left some comments as I went.

If I understand correctly the idea is to separate each replica so that instead of having

<pdf> X (<fktable> X <mask>)

and, in the case of many replicas:

<many pdfs> x (<fktable> X <mask>)

one has

<pdf> X (<fktable> X <mask>) for each replica

The problem with that approach as you well say is that it increases the memory requirement by quite a bit (since right now (<fktable> X <mask>) is a singleton). In principle it is a growth of nreplicas* (and eventually we would like to do 50/100 replicas in parallel).

And as you say in some of the comments the solution would indeed be to factor out the fktable and the mask.

The (potential) problem with that approach is that it might be heavy on memory for the single replica fit (which we would like to avoid as it is the main use-case). But I think I would start there, meaning, trying the factorization in single replica fits, profiling that (in memory and time) and then apply said factorization to this branch.

If the factorization works in single replica fits without having an impact in performance it would be fantastic tbh.

return x


def aggregate_replicas(dictionary, method="append", rep_token_index=1):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some docstrings to this function? What is it used for, what is the expected input / output etc?
And what is the difference between append and sum.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I found myself hard-coding loops to either make lists of partial losses or sum them in other cases, which is why I created this method. I think this entire loss aggregation business should have a round of code improvement because as it is it makes very specific assumptions about the structure of the output layer names. I would like to adopt a more robust design.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But note that you don't need partial losses and that, moving forward, there will be no partial losses (as all experiments will be grouped together).

About the names, for better or for worse there are several layers of nnpdf code that rely on the names for different things (positivity, integrability, etc) so I guess you can rely on that as well here.

n_replicas = next(iter(names.values()))
for name in names:
if names[name] != n_replicas:
raise ValueError(f"Dataset {name} has unexpected no replica output layers {names[name]}")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this error ever expected? If so, and depending on the situation, I would prefer to make sure that it is caught before the fit starts loading the data, otherwise is a very expensive crash.

Also, I'm not entirely sure I understand the logic of this loop. What is this used for? If it is just for the check I would rather remove it. When the fit goes to the point of computing the loss there should be no doubt whatsoever that all replicas are working, if there is a problem it should've been caught and fixed earlier.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop is simply there to compose the out_names and this entire block including the losses_fun() can be simplified by saving some information at initialization. Actually this thing should be handled by a 'aggregate_losses' type of utility method, but it works at the level of tensors, which is why it's still there...

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why have the out_names changed with respect to the previous version?

Also, is the error ever expected? As I said, if it is, then it should be handled elsewhere.

total_losses.append(tf.reduce_sum(rep_losses, axis=0))
predictions = stacked_predictions
result = [tf.concat(total_losses, axis=0)] + predictions
return dict(zip(out_names, result))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather lose some granularity on the loss that complicate (with quite a few loops) something that is going to be computed once per epoch. Take into account that we don't really care about the individual loss of the datasets (actually, when we will be doing th uncertainties, there will be no individual groups).
The loss per replica might be interesting, but again, it is something that can be recovered at the end.

# so we need to convert the tensors
return _to_numpy_or_python_type(ret)
ret2 = _to_numpy_or_python_type(ret)
# import pdb; pdb.set_trace()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# import pdb; pdb.set_trace()

Comment thread n3fit/src/n3fit/checks.py
Comment on lines +362 to +366
# if not same_trvl_per_replica:
# raise CheckError(
# "Replicas cannot be run in parallel with different training/validation "
# " masks, please set `same_trvl_per_replica` to True in the runcard"
# )
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# if not same_trvl_per_replica:
# raise CheckError(
# "Replicas cannot be run in parallel with different training/validation "
# " masks, please set `same_trvl_per_replica` to True in the runcard"
# )


def observable_generator(
spec_dict, positivity_initial=1.0, integrability=False
spec_dict, positivity_initial=1.0, integrability=False, name_suffix=''
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing the name_suffix is so that different observable are treated differently so one can have different trvl masks. Please add to the docstr.

return False

# Step 2. Compute the validation metrics
# import pdb; pdb.set_trace()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# import pdb; pdb.set_trace()

Comment thread n3fit/src/n3fit/model_trainer.py Outdated
"""
import copy
import logging
import pdb
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
import pdb

splitted_pdfs = []
for layer in all_replicas_pdf:
split_layers = splitting_op(layer)
splitted_pdfs.append([op.expand(sl) for sl in split_layers])
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused about this changes. The way it is done now full_pdf_per_replica and splitting_layer are not needed.

The original logic was to have just one (very big) pdf that would convolute with the fktable for the many replica case. Have you tested that the changes doesn't lead to tf doing many convolutions? (one per replica) We would want to avoid that (and it should have no effect on having a trvl different per replica).

If you have (and it doesn't) then please remove splitting_layer and full_pdf_per_replica (since this is actually cleaner than what I had anyway).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes those are remnants of earlier code and will be cleaned.

I don't think tf is doing more convolutions per se in this new architecture, but it probably cannot do them as efficiently as in the earlier version, where all replica tensors were essentially stacked no?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it is a difference between doing (pseudo code)

pdf_{ri} X fktable_{i} = result_{r}

or

result = []
for i in pdf_list:
    result.append(pdf_{i} X fktable_{i})

Specially in a GPU, the first version will be much faster than the second as there is only one copy of the PDF for all replicas (even if it is a bigger array) instead of many copy of smaller PDFs.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your second example, the allocation for the convolution will probably be the bottleneck on the GPU. But then again, tensorflow uses a big memory pool so this isn't an issue. As for the extra copies: you're right, this architecture has more layers, more tensors being copied and probably less utilization of the device. But we will measure this to avoid speculation.

As I said, this was exploratory from my side and I am going to focus my attention on getting the tr-vl split into the masked convolutions at the level of the Observable implementations. Currently the masks are only there to denote active flavors (uniform per replica of course); the aim will be to mask out data points per replica, right?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, there are two masks. The flavour mask is applied to the pdf/luminosity and it is always the same for all replicas for a given fktable.

Instead, the training/validation mask is different per replica. Currently it is applied on the fktable since it reduces its size considerably. However, by doing that, a different trvl mask generates a new fktable, which is a bottleneck for parallelizarion over the number of replicas.

Instead the trvl mask would need to be applied to the data, so that all replicas use the same fktable and generate the same (in shape and operations) output vector and only at the end the validation is data is masked out

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok so where the observable class now contains a list of fk-tables (I assume length n-replicas, each entry a masked version of the full fk-table), this will be only the full one. When you say 'at the end the validation data is masked out' you mean the observable layer should perform an extra masking of the fk-table convolution. So for instance DY will first execute a masked convolution, then a tensor product with the fk-tables (uniform across replicas) and finally masks out the validation entries per replica.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fitting process is as follows (for a DY-type observable which is the most complicated one). The exact order of the shape might not be correct in the example below though.

x enters as an input to the PDF model, input shape (n_x,)

The PDF model is a concatenation of PDFs, one for replica, its output is then (nreplicas, n_flavours, n_x)

The result of the PDF models is then given to the observable. The observable can be formed by several fktables (for instance, the observable can be the measure of a ratio between two cross sections).
Inside the observable first a luminosity tensor is constructed to account for the two colliding protons, so that we have a luminosity such that:

L = PDF x PDF, shape (nreplicas, n_flavours, n_flavours, n_x, n_x)

However, not all flavour combinations contribute to all observables, so this luminosity tensor is masked such that only active flavours remain (which means the calculation to be done is also lighter).

The new masked luminosity is then: mL = (nreplicas, n_active_flavours_combinations, n_x, n_x)

Now this masked luminosity is convoluted with every fktable in the observable. The output of this convolution is a vector of data points (n_data).

If the observable is composed (for instance, a ratio between two cross sections or whatever) then an operation is applied to all the output arrays.

The final result of the observable is then (n_replicas, n_data).


However, in order to do the fit correctly we apply a training/validation mask. This traning validation mask is applied at the level of the data points, so the natural way would be, after the computation is finished, apply the mask, so that one gets:

training_observable = observable X mask
validation_observable = observable X ~mask

But, for backpropagation, we are only interested in the training_observable, and the fktable is a massive object. Therefore what we do is to apply the mask directly to the fktable so that we have

fk_training = fk X mask
fk_validation = fk X ~mask

And then, using those two fktables, we generate two observables (so we no longer have to apply the mask at the end).

Note btw that this training/validation mask to the fktable is applied before the training even starts.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok thanks for your explanation @scarlehoff, I was a bit confused by the list of fk-tables within the observable class, but your comment clarifies this.

@Radonirinaunimi Radonirinaunimi marked this pull request as draft November 9, 2022 09:06
@scarlehoff
Copy link
Copy Markdown
Member

Closing this branch in favour of #1661

@scarlehoff scarlehoff closed this Jan 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants