Train on 1-point datasets by scarlehoff · Pull Request #1636 · NNPDF/nnpdf

scarlehoff · 2022-11-25T15:14:53Z

(creating the pr just to send the bot)

RoyStegeman · 2022-11-25T15:25:37Z

In that case, why not just do this for all datasets? e.g.
(np.random.rand() < ndat*frac-int(ndat*frac)) + int(ndat*frac)

github-actions · 2022-11-25T18:38:51Z

Greetings from your nice fit 🤖 !
I have good news for you, I just finished my tasks:

Fit Name: NNBOT-47d12bdb8-2022-11-25
Fit Report: https://vp.nnpdf.science/bI5-TYtXSkqRik8v5HwUyw==
Fit Data: https://data.nnpdf.science/fits/NNBOT-47d12bdb8-2022-11-25.tar.gz

Check the report carefully, and please buy me a ☕ , or better, a GPU 😉!

scarlehoff · 2022-11-26T09:53:47Z

https://vp.nnpdf.science/sq18cmllT9-9udqY5Nb2sg==

Absolutely nothing changes :)

RoyStegeman · 2022-11-26T13:10:21Z

Sorry, but why not do this for all datasets?

scarlehoff · 2022-11-26T13:16:55Z

Sorry, but why not do this for all datasets?

You mean doing mask = np.random.rand(ndata) < frac? If so the reason for using shuffle instead is to ensure a 75% for all datasets, otherwise you can potentially have times where by change all points are masked away.

Not that it would be wrong (at some point I even wanted to apply the mask per dataset but that was not received well), but that's the reason.

RoyStegeman · 2022-11-26T13:24:22Z

No I mean the same proposal I made before. For ndat==1 you now make an exception such that there there is a probability of either including it or not and as more replicas are generated the fraction will approach whatever you put (so 0.75 in this case). Why not do this for all datasets instead of always rounding down? It's weird to have an if statement while the best thing to do would be to treat all datasets the same.

scarlehoff · 2022-11-26T13:35:07Z

Is your worry having a fraction such that also ndata > 1 will suffer from the same problem?

(because other than that I don't really see the difference of the new line with the old one, you are implicitly doing the if inside the parenthesis but with int(ndata*frac) instead -which is more correct than just doing it for ndata-).

In any case I've committed your version since the previous one was only correct for frac > 0.5.

RoyStegeman · 2022-11-26T13:45:02Z

Is your worry having a fraction such that also ndata > 1 will suffer from the same problem?

That is one example indeed, don't see when we would go to very low training fractions but you never know. Alsoe, even in general for datasets of e.g. 10 points it's better to have 7 training points 75% of the time and 8 training points the other 25% rather than always having 7 (or at least that's closer to what we say/intend to do).

It's not a big complaint or anything - I never bothered to change it before - but I think this way of calculating trmax is more correct for all datasets, so I just think the exception for ndat==1 is a bit strange.

It's not really for any practical reason, if that's what you're wondering

scarlehoff · 2022-11-26T13:50:58Z

It's not really for any practical reason, if that's what you're wondering

It was indeed, yes.

RoyStegeman · 2022-11-26T13:59:09Z

Is this ready to be merged? (once test completes)

scarlehoff · 2022-11-26T14:27:25Z

Yes, but regressions tests and probably data rebuilding etc will need to be regenerated since we are effectively changing the random seed.

scarlehoff · 2022-11-28T16:04:43Z

Now it should be ready for review. Sending also the bot just in case (and because with the change of seeds it should be updated).

github-actions · 2022-11-28T20:25:59Z

Greetings from your nice fit 🤖 !
I have good news for you, I just finished my tasks:

Fit Name: NNBOT-54abb18ca-2022-11-28
Fit Report: https://vp.nnpdf.science/6DLqpKvJQaa5-x6RAp7rYg==
Fit Data: https://data.nnpdf.science/fits/NNBOT-54abb18ca-2022-11-28.tar.gz

Check the report carefully, and please buy me a ☕ , or better, a GPU 😉!

Zaharid

I think we should discuss this a bit. As it is now, it is a change in behaviour. We used to have a fixed number of datapoints for every dataset in either split and now it is variable. Even if the change was fine, we should be sure the splits are reproducible.

scarlehoff · 2022-11-28T21:07:01Z

They should be reproducible, as the seed will affect the same.

The discussion I guess is whether to never take the "leftover point" (current ans behaviour, so with the if) or take it with trvl probability (Roy's proposal).

Note that the if should be on int(trvl*ndata) and that the ceil doesnt't work because then 2 points would never be divided with a trvl od 0.51

Zaharid · 2022-11-28T21:16:47Z

They should be reproducible, as the seed will affect the same.

No it isn’t: the call to np.random.rand uses the global, unitialized, uncontrolled random state.

scarlehoff · 2022-11-28T21:32:12Z

Are you sure? I think random.seed(x) affects random.rand() as well.

Zaharid · 2022-11-28T21:40:29Z

Uh, sorry, I assumed we already had done the todo saying to use random generators rather than dealing with the global state elsewhere.

Zaharid · 2022-11-28T21:41:59Z

As an aside, I find how this https://jax.readthedocs.io/en/latest/jax.random.html (http://www.thesalmons.org/john/random123/papers/random123sc11.pdf) works pretty interesting.

scarlehoff · 2022-11-28T21:51:32Z

Ah, no.
Maybe we should given that we are already dealing with this.

Then, the two possibilities are

Everything gets a 0.75 chance.
We isolate the cases in which that equals 0, in those special cases trmax is manually set to 1 if rand() > 0.75 (closer to current implementation)

In both cases changing to a generator is a possibility.

I don't have any strong opinion but would default to the first one since I already have some of the regressions fits ready (tomorrow I'll check why the tests didn't work in the ci... maybe I forgot vp-upload...)

scarlehoff · 2022-11-30T11:08:43Z

@Zaharid @RoyStegeman

Please have a look at the current version. I've upgraded to a random number generator and I've decided to go for the modification that changes the behaviour the least: datasets get always the same number of point masked as they do now but if that number happens to be 0 (because of the combination of size of dataset and trvl mask) then the dataset gets one point with probability frac.
Discussing with @Zaharid this was apparently a discussion already at some point within the collaboration so I rather leave it like that.

Please, let me know if you are happy with it and when you are I'll update the tests.

RoyStegeman · 2022-11-30T14:08:27Z

I mean, I still don't understand what makes the situation of trmax==0 so special, but since it's won't affect the outcome I'm not going to push this further. If it has been decided within the collaboration (where I would be surprised if anyone even cared), and you and Zahari are happy, then I'm also fine.

Zaharid · 2022-11-30T15:15:20Z

Personally I still slightly prefer my thing, mostly owing to the aesthetics of not having special cases, but similarly am not too sure it matters enough anyway. I guess the remaining concern now is that this will break the rebuilding of pseudodata which various things rely on.

scarlehoff · 2022-11-30T15:16:20Z

What is your thing? This is your thing (but not singling out ndata == 1 which would be wrong).

this is reproducible, the fact that I need to update the tests is proof enough i'd say x)

scarlehoff · 2022-11-30T16:27:47Z

I will squash the commits and send the fit bot and then merge (probably tomorrow I guess, the bot will take a few hours).

…th probability frac, fixes #1634

github-actions · 2022-11-30T20:19:49Z

Greetings from your nice fit 🤖 !
I have good news for you, I just finished my tasks:

Fit Name: NNBOT-90875c07e-2022-11-30
Fit Report: https://vp.nnpdf.science/ZBV_CfzITJSM4upIUVNF9w==
Fit Data: https://data.nnpdf.science/fits/NNBOT-90875c07e-2022-11-30.tar.gz

Check the report carefully, and please buy me a ☕ , or better, a GPU 😉!

after the change of the trvl mask in #1636 the fitbot changes and so it must be updated

scarlehoff · 2022-12-01T10:21:24Z

The last commit is an attempt to have a bit more stability. From the numpy docs, compatibility between versions is not guaranteed but I hope that algorithm in particular is stable enough (since it right now their default).

Zaharid · 2022-12-01T11:33:11Z

Hmm. I am going to want to version tag this…

scarlehoff · 2022-12-01T11:35:35Z

Feel free. From 4.0.4 we have had a roller coaster of breaking backward compatibility changes... but this should've stopped now (famous last words)

Zaharid · 2022-12-01T11:45:19Z

Which are these changes?

scarlehoff · 2022-12-01T11:46:25Z

I'm not sure but I remember having to update regression tests more than once.

scarlehoff added the run-fit-bot Starts fit bot from a PR. label Nov 25, 2022

scarlehoff commented Nov 25, 2022

View reviewed changes