Train on 1-point datasets#1636
Conversation
|
In that case, why not just do this for all datasets? e.g. |
|
Greetings from your nice fit 🤖 !
Check the report carefully, and please buy me a ☕ , or better, a GPU 😉! |
|
https://vp.nnpdf.science/sq18cmllT9-9udqY5Nb2sg== Absolutely nothing changes :) |
|
Sorry, but why not do this for all datasets? |
You mean doing Not that it would be wrong (at some point I even wanted to apply the mask per dataset but that was not received well), but that's the reason. |
|
No I mean the same proposal I made before. For |
|
Is your worry having a fraction such that also ndata > 1 will suffer from the same problem? (because other than that I don't really see the difference of the new line with the old one, you are implicitly doing the if inside the parenthesis but with In any case I've committed your version since the previous one was only correct for |
That is one example indeed, don't see when we would go to very low training fractions but you never know. Alsoe, even in general for datasets of e.g. 10 points it's better to have 7 training points 75% of the time and 8 training points the other 25% rather than always having 7 (or at least that's closer to what we say/intend to do). It's not a big complaint or anything - I never bothered to change it before - but I think this way of calculating trmax is more correct for all datasets, so I just think the exception for ndat==1 is a bit strange. It's not really for any practical reason, if that's what you're wondering |
It was indeed, yes. |
|
Is this ready to be merged? (once test completes) |
|
Yes, but regressions tests and probably data rebuilding etc will need to be regenerated since we are effectively changing the random seed. |
|
Now it should be ready for review. Sending also the bot just in case (and because with the change of seeds it should be updated). |
|
Greetings from your nice fit 🤖 !
Check the report carefully, and please buy me a ☕ , or better, a GPU 😉! |
There was a problem hiding this comment.
I think we should discuss this a bit. As it is now, it is a change in behaviour. We used to have a fixed number of datapoints for every dataset in either split and now it is variable. Even if the change was fine, we should be sure the splits are reproducible.
|
They should be reproducible, as the seed will affect the same. The discussion I guess is whether to never take the "leftover point" (current ans behaviour, so with the if) or take it with trvl probability (Roy's proposal). Note that the if should be on |
No it isn’t: the call to np.random.rand uses the global, unitialized, uncontrolled random state. |
|
Are you sure? I think |
|
Uh, sorry, I assumed we already had done the todo saying to use random generators rather than dealing with the global state elsewhere. |
|
As an aside, I find how this https://jax.readthedocs.io/en/latest/jax.random.html (http://www.thesalmons.org/john/random123/papers/random123sc11.pdf) works pretty interesting. |
|
Ah, no. Then, the two possibilities are
In both cases changing to a generator is a possibility. I don't have any strong opinion but would default to the first one since I already have some of the regressions fits ready (tomorrow I'll check why the tests didn't work in the ci... maybe I forgot vp-upload...) |
|
Please have a look at the current version. I've upgraded to a random number generator and I've decided to go for the modification that changes the behaviour the least: datasets get always the same number of point masked as they do now but if that number happens to be 0 (because of the combination of size of dataset and trvl mask) then the dataset gets one point with probability Please, let me know if you are happy with it and when you are I'll update the tests. |
|
I mean, I still don't understand what makes the situation of trmax==0 so special, but since it's won't affect the outcome I'm not going to push this further. If it has been decided within the collaboration (where I would be surprised if anyone even cared), and you and Zahari are happy, then I'm also fine. |
|
Personally I still slightly prefer my thing, mostly owing to the aesthetics of not having special cases, but similarly am not too sure it matters enough anyway. I guess the remaining concern now is that this will break the rebuilding of pseudodata which various things rely on. |
|
What is your thing? This is your thing (but not singling out ndata == 1 which would be wrong). this is reproducible, the fact that I need to update the tests is proof enough i'd say x) |
|
I will squash the commits and send the fit bot and then merge (probably tomorrow I guess, the bot will take a few hours). |
…th probability frac, fixes #1634
66bac16 to
ec0f497
Compare
|
Greetings from your nice fit 🤖 !
Check the report carefully, and please buy me a ☕ , or better, a GPU 😉! |
after the change of the trvl mask in #1636 the fitbot changes and so it must be updated
|
The last commit is an attempt to have a bit more stability. From the numpy docs, compatibility between versions is not guaranteed but I hope that algorithm in particular is stable enough (since it right now their default). |
|
Hmm. I am going to want to version tag this… |
|
Feel free. From 4.0.4 we have had a roller coaster of breaking backward compatibility changes... but this should've stopped now (famous last words) |
|
Which are these changes? |
|
I'm not sure but I remember having to update regression tests more than once. |
closes #1634
(creating the pr just to send the bot)