Tr-vl-split replicas by goord · Pull Request #1618 · NNPDF/nnpdf

goord · 2022-10-24T19:14:04Z

Hi @scarlehoff and @Radonirinaunimi this is a branch containing my rebased changes onto the latest master. Initial tests seem to indicate things still work as intended, but again the NNPDF-4 reproduction has yet to be tested.

…o fit new structure

…ng the partial losses

Radonirinaunimi · 2022-10-24T21:40:35Z

Thanks @goord, I will have a look.

goord · 2022-10-25T05:17:17Z

Found an issue with the integrabiliy constraints in the nnlo fit btw

scarlehoff · 2022-10-25T06:07:15Z

Found an issue with the integrabiliy constraints in the nnlo fit btw

What do you mean?

(thanks for the PR!)

scarlehoff

Thank you very much for this PR. I've been looking through it and left some comments as I went.

If I understand correctly the idea is to separate each replica so that instead of having

<pdf> X (<fktable> X <mask>)

and, in the case of many replicas:

<many pdfs> x (<fktable> X <mask>)

one has

<pdf> X (<fktable> X <mask>) for each replica

The problem with that approach as you well say is that it increases the memory requirement by quite a bit (since right now (<fktable> X <mask>) is a singleton). In principle it is a growth of nreplicas* (and eventually we would like to do 50/100 replicas in parallel).

And as you say in some of the comments the solution would indeed be to factor out the fktable and the mask.

The (potential) problem with that approach is that it might be heavy on memory for the single replica fit (which we would like to avoid as it is the main use-case). But I think I would start there, meaning, trying the factorization in single replica fits, profiling that (in memory and time) and then apply said factorization to this branch.

If the factorization works in single replica fits without having an impact in performance it would be fantastic tbh.

scarlehoff · 2022-10-25T06:08:29Z

    return x


+def aggregate_replicas(dictionary, method="append", rep_token_index=1):


Could you add some docstrings to this function? What is it used for, what is the expected input / output etc?
And what is the difference between append and sum.

Yes I found myself hard-coding loops to either make lists of partial losses or sum them in other cases, which is why I created this method. I think this entire loss aggregation business should have a round of code improvement because as it is it makes very specific assumptions about the structure of the output layer names. I would like to adopt a more robust design.

But note that you don't need partial losses and that, moving forward, there will be no partial losses (as all experiments will be grouped together).

About the names, for better or for worse there are several layers of nnpdf code that rely on the names for different things (positivity, integrability, etc) so I guess you can rely on that as well here.

scarlehoff · 2022-10-25T06:24:02Z

+            n_replicas = next(iter(names.values()))
+            for name in names:
+                if names[name] != n_replicas:
+                    raise ValueError(f"Dataset {name} has unexpected no replica output layers {names[name]}")


Is this error ever expected? If so, and depending on the situation, I would prefer to make sure that it is caught before the fit starts loading the data, otherwise is a very expensive crash.

Also, I'm not entirely sure I understand the logic of this loop. What is this used for? If it is just for the check I would rather remove it. When the fit goes to the point of computing the loss there should be no doubt whatsoever that all replicas are working, if there is a problem it should've been caught and fixed earlier.

The loop is simply there to compose the out_names and this entire block including the losses_fun() can be simplified by saving some information at initialization. Actually this thing should be handled by a 'aggregate_losses' type of utility method, but it works at the level of tensors, which is why it's still there...

But why have the out_names changed with respect to the previous version?

Also, is the error ever expected? As I said, if it is, then it should be handled elsewhere.

scarlehoff · 2022-10-25T06:32:49Z

+                        total_losses.append(tf.reduce_sum(rep_losses, axis=0))
+                    predictions = stacked_predictions
+                result = [tf.concat(total_losses, axis=0)] + predictions
+                return dict(zip(out_names, result))


I'd rather lose some granularity on the loss that complicate (with quite a few loops) something that is going to be computed once per epoch. Take into account that we don't really care about the individual loss of the datasets (actually, when we will be doing th uncertainties, there will be no individual groups).
The loss per replica might be interesting, but again, it is something that can be recovered at the end.

scarlehoff · 2022-10-25T06:33:03Z

        # so we need to convert the tensors
-        return _to_numpy_or_python_type(ret)
+        ret2 = _to_numpy_or_python_type(ret)
+#        import pdb; pdb.set_trace()


Suggested change

# import pdb; pdb.set_trace()

scarlehoff · 2022-10-25T06:34:04Z

+#    if not same_trvl_per_replica:
+#        raise CheckError(
+#            "Replicas cannot be run in parallel with different training/validation "
+#            " masks, please set `same_trvl_per_replica` to True in the runcard"
+#        )


Suggested change

# if not same_trvl_per_replica:

# raise CheckError(

# "Replicas cannot be run in parallel with different training/validation "

# " masks, please set `same_trvl_per_replica` to True in the runcard"

# )

scarlehoff · 2022-10-25T06:35:33Z


 def observable_generator(
-    spec_dict, positivity_initial=1.0, integrability=False
+    spec_dict, positivity_initial=1.0, integrability=False, name_suffix=''


I'm guessing the name_suffix is so that different observable are treated differently so one can have different trvl masks. Please add to the docstr.

scarlehoff · 2022-10-25T06:37:21Z

            return False

        # Step 2. Compute the validation metrics
+#        import pdb; pdb.set_trace()


Suggested change

# import pdb; pdb.set_trace()

scarlehoff · 2022-10-25T06:37:42Z

 """
+import copy
 import logging
+import pdb


Suggested change

import pdb

scarlehoff · 2022-10-25T06:46:32Z

+        splitted_pdfs = []
+        for layer in all_replicas_pdf:
+            split_layers = splitting_op(layer)
+            splitted_pdfs.append([op.expand(sl) for sl in split_layers])


I'm confused about this changes. The way it is done now full_pdf_per_replica and splitting_layer are not needed.

The original logic was to have just one (very big) pdf that would convolute with the fktable for the many replica case. Have you tested that the changes doesn't lead to tf doing many convolutions? (one per replica) We would want to avoid that (and it should have no effect on having a trvl different per replica).

If you have (and it doesn't) then please remove splitting_layer and full_pdf_per_replica (since this is actually cleaner than what I had anyway).

Yes those are remnants of earlier code and will be cleaned.

I don't think tf is doing more convolutions per se in this new architecture, but it probably cannot do them as efficiently as in the earlier version, where all replica tensors were essentially stacked no?

Well, it is a difference between doing (pseudo code)

pdf_{ri} X fktable_{i} = result_{r}

or

result = [] for i in pdf_list: result.append(pdf_{i} X fktable_{i})

Specially in a GPU, the first version will be much faster than the second as there is only one copy of the PDF for all replicas (even if it is a bigger array) instead of many copy of smaller PDFs.

In your second example, the allocation for the convolution will probably be the bottleneck on the GPU. But then again, tensorflow uses a big memory pool so this isn't an issue. As for the extra copies: you're right, this architecture has more layers, more tensors being copied and probably less utilization of the device. But we will measure this to avoid speculation.

As I said, this was exploratory from my side and I am going to focus my attention on getting the tr-vl split into the masked convolutions at the level of the Observable implementations. Currently the masks are only there to denote active flavors (uniform per replica of course); the aim will be to mask out data points per replica, right?

No, there are two masks. The flavour mask is applied to the pdf/luminosity and it is always the same for all replicas for a given fktable.

Instead, the training/validation mask is different per replica. Currently it is applied on the fktable since it reduces its size considerably. However, by doing that, a different trvl mask generates a new fktable, which is a bottleneck for parallelizarion over the number of replicas.

Instead the trvl mask would need to be applied to the data, so that all replicas use the same fktable and generate the same (in shape and operations) output vector and only at the end the validation is data is masked out

Ok so where the observable class now contains a list of fk-tables (I assume length n-replicas, each entry a masked version of the full fk-table), this will be only the full one. When you say 'at the end the validation data is masked out' you mean the observable layer should perform an extra masking of the fk-table convolution. So for instance DY will first execute a masked convolution, then a tensor product with the fk-tables (uniform across replicas) and finally masks out the validation entries per replica.

The fitting process is as follows (for a DY-type observable which is the most complicated one). The exact order of the shape might not be correct in the example below though.

x enters as an input to the PDF model, input shape (n_x,)

The PDF model is a concatenation of PDFs, one for replica, its output is then (nreplicas, n_flavours, n_x)

The result of the PDF models is then given to the observable. The observable can be formed by several fktables (for instance, the observable can be the measure of a ratio between two cross sections).
Inside the observable first a luminosity tensor is constructed to account for the two colliding protons, so that we have a luminosity such that:

L = PDF x PDF, shape (nreplicas, n_flavours, n_flavours, n_x, n_x)

However, not all flavour combinations contribute to all observables, so this luminosity tensor is masked such that only active flavours remain (which means the calculation to be done is also lighter).

The new masked luminosity is then: mL = (nreplicas, n_active_flavours_combinations, n_x, n_x)

Now this masked luminosity is convoluted with every fktable in the observable. The output of this convolution is a vector of data points (n_data).

If the observable is composed (for instance, a ratio between two cross sections or whatever) then an operation is applied to all the output arrays.

The final result of the observable is then (n_replicas, n_data).

However, in order to do the fit correctly we apply a training/validation mask. This traning validation mask is applied at the level of the data points, so the natural way would be, after the computation is finished, apply the mask, so that one gets:

training_observable = observable X mask
validation_observable = observable X ~mask

But, for backpropagation, we are only interested in the training_observable, and the fktable is a massive object. Therefore what we do is to apply the mask directly to the fktable so that we have

fk_training = fk X mask
fk_validation = fk X ~mask

And then, using those two fktables, we generate two observables (so we no longer have to apply the mask at the end).

Note btw that this training/validation mask to the fktable is applied before the training even starts.

ok thanks for your explanation @scarlehoff, I was a bit confused by the list of fk-tables within the observable class, but your comment clarifies this.

scarlehoff · 2023-01-20T15:41:35Z

Closing this branch in favour of #1661

goord added 5 commits October 24, 2022 20:33

Initial work to accommodate different tr-vl split per parallel replica.

6e6e168

Append pos-datasets and integ-datasets to all replica models

abddd72

Made neural network with masking per replica

bf2f57d

Adapted validation layers names and stopping condition name lookups t…

94fdb0e

…o fit new structure

Removed replica stacking layers from network, built methods aggregati…

92efa1d

…ng the partial losses

Radonirinaunimi self-requested a review October 24, 2022 21:39

scarlehoff self-requested a review October 25, 2022 06:07

scarlehoff reviewed Oct 25, 2022

View reviewed changes

goord added 2 commits October 25, 2022 14:53

Some cleanup in model generation

19a8642

Fixed integrability observable layers

fadb05a

Radonirinaunimi marked this pull request as draft November 9, 2022 09:06

scarlehoff closed this Jan 20, 2023

		return x


		def aggregate_replicas(dictionary, method="append", rep_token_index=1):

Conversation

goord commented Oct 24, 2022

Uh oh!

Radonirinaunimi commented Oct 24, 2022

Uh oh!

goord commented Oct 25, 2022

Uh oh!

scarlehoff commented Oct 25, 2022

Uh oh!

scarlehoff left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scarlehoff commented Jan 20, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants