Skip to content

Train on 1-point datasets#1636

Merged
scarlehoff merged 3 commits into
masterfrom
bugfix_to_1pointDS
Dec 1, 2022
Merged

Train on 1-point datasets#1636
scarlehoff merged 3 commits into
masterfrom
bugfix_to_1pointDS

Conversation

@scarlehoff
Copy link
Copy Markdown
Member

@scarlehoff scarlehoff commented Nov 25, 2022

closes #1634

(creating the pr just to send the bot)

@scarlehoff scarlehoff added the run-fit-bot Starts fit bot from a PR. label Nov 25, 2022
@RoyStegeman
Copy link
Copy Markdown
Member

RoyStegeman commented Nov 25, 2022

In that case, why not just do this for all datasets? e.g.
(np.random.rand() < ndat*frac-int(ndat*frac)) + int(ndat*frac)

Comment thread validphys2/src/validphys/n3fit_data.py Outdated
@github-actions
Copy link
Copy Markdown

Greetings from your nice fit 🤖 !
I have good news for you, I just finished my tasks:

Check the report carefully, and please buy me a ☕ , or better, a GPU 😉!

@scarlehoff
Copy link
Copy Markdown
Member Author

scarlehoff commented Nov 26, 2022

https://vp.nnpdf.science/sq18cmllT9-9udqY5Nb2sg==

Absolutely nothing changes :)

Comment thread validphys2/src/validphys/n3fit_data.py Outdated
@RoyStegeman
Copy link
Copy Markdown
Member

Sorry, but why not do this for all datasets?

@scarlehoff
Copy link
Copy Markdown
Member Author

Sorry, but why not do this for all datasets?

You mean doing mask = np.random.rand(ndata) < frac? If so the reason for using shuffle instead is to ensure a 75% for all datasets, otherwise you can potentially have times where by change all points are masked away.

Not that it would be wrong (at some point I even wanted to apply the mask per dataset but that was not received well), but that's the reason.

@RoyStegeman
Copy link
Copy Markdown
Member

No I mean the same proposal I made before. For ndat==1 you now make an exception such that there there is a probability of either including it or not and as more replicas are generated the fraction will approach whatever you put (so 0.75 in this case). Why not do this for all datasets instead of always rounding down? It's weird to have an if statement while the best thing to do would be to treat all datasets the same.

Comment thread validphys2/src/validphys/n3fit_data.py Outdated
@scarlehoff
Copy link
Copy Markdown
Member Author

scarlehoff commented Nov 26, 2022

Is your worry having a fraction such that also ndata > 1 will suffer from the same problem?

(because other than that I don't really see the difference of the new line with the old one, you are implicitly doing the if inside the parenthesis but with int(ndata*frac) instead -which is more correct than just doing it for ndata-).

In any case I've committed your version since the previous one was only correct for frac > 0.5.

Comment thread validphys2/src/validphys/n3fit_data.py Outdated
@RoyStegeman
Copy link
Copy Markdown
Member

Is your worry having a fraction such that also ndata > 1 will suffer from the same problem?

That is one example indeed, don't see when we would go to very low training fractions but you never know. Alsoe, even in general for datasets of e.g. 10 points it's better to have 7 training points 75% of the time and 8 training points the other 25% rather than always having 7 (or at least that's closer to what we say/intend to do).

It's not a big complaint or anything - I never bothered to change it before - but I think this way of calculating trmax is more correct for all datasets, so I just think the exception for ndat==1 is a bit strange.

It's not really for any practical reason, if that's what you're wondering

@scarlehoff
Copy link
Copy Markdown
Member Author

It's not really for any practical reason, if that's what you're wondering

It was indeed, yes.

Comment thread validphys2/src/validphys/n3fit_data.py Outdated
@RoyStegeman
Copy link
Copy Markdown
Member

RoyStegeman commented Nov 26, 2022

Is this ready to be merged? (once test completes)

@scarlehoff
Copy link
Copy Markdown
Member Author

scarlehoff commented Nov 26, 2022

Yes, but regressions tests and probably data rebuilding etc will need to be regenerated since we are effectively changing the random seed.

@scarlehoff
Copy link
Copy Markdown
Member Author

scarlehoff commented Nov 28, 2022

Now it should be ready for review. Sending also the bot just in case (and because with the change of seeds it should be updated).

@scarlehoff scarlehoff added run-fit-bot Starts fit bot from a PR. and removed run-fit-bot Starts fit bot from a PR. labels Nov 28, 2022
@github-actions
Copy link
Copy Markdown

Greetings from your nice fit 🤖 !
I have good news for you, I just finished my tasks:

Check the report carefully, and please buy me a ☕ , or better, a GPU 😉!

Copy link
Copy Markdown
Contributor

@Zaharid Zaharid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should discuss this a bit. As it is now, it is a change in behaviour. We used to have a fixed number of datapoints for every dataset in either split and now it is variable. Even if the change was fine, we should be sure the splits are reproducible.

@scarlehoff
Copy link
Copy Markdown
Member Author

They should be reproducible, as the seed will affect the same.

The discussion I guess is whether to never take the "leftover point" (current ans behaviour, so with the if) or take it with trvl probability (Roy's proposal).

Note that the if should be on int(trvl*ndata) and that the ceil doesnt't work because then 2 points would never be divided with a trvl od 0.51

@Zaharid
Copy link
Copy Markdown
Contributor

Zaharid commented Nov 28, 2022

They should be reproducible, as the seed will affect the same.

No it isn’t: the call to np.random.rand uses the global, unitialized, uncontrolled random state.

@scarlehoff
Copy link
Copy Markdown
Member Author

Are you sure? I think random.seed(x) affects random.rand() as well.

@Zaharid
Copy link
Copy Markdown
Contributor

Zaharid commented Nov 28, 2022

Uh, sorry, I assumed we already had done the todo saying to use random generators rather than dealing with the global state elsewhere.

@Zaharid
Copy link
Copy Markdown
Contributor

Zaharid commented Nov 28, 2022

@scarlehoff
Copy link
Copy Markdown
Member Author

Ah, no.
Maybe we should given that we are already dealing with this.

Then, the two possibilities are

  • Everything gets a 0.75 chance.
  • We isolate the cases in which that equals 0, in those special cases trmax is manually set to 1 if rand() > 0.75 (closer to current implementation)

In both cases changing to a generator is a possibility.

I don't have any strong opinion but would default to the first one since I already have some of the regressions fits ready (tomorrow I'll check why the tests didn't work in the ci... maybe I forgot vp-upload...)

@scarlehoff scarlehoff removed the run-fit-bot Starts fit bot from a PR. label Nov 29, 2022
Comment thread validphys2/src/validphys/n3fit_data.py Outdated
@scarlehoff
Copy link
Copy Markdown
Member Author

@Zaharid @RoyStegeman

Please have a look at the current version. I've upgraded to a random number generator and I've decided to go for the modification that changes the behaviour the least: datasets get always the same number of point masked as they do now but if that number happens to be 0 (because of the combination of size of dataset and trvl mask) then the dataset gets one point with probability frac.
Discussing with @Zaharid this was apparently a discussion already at some point within the collaboration so I rather leave it like that.

Please, let me know if you are happy with it and when you are I'll update the tests.

@RoyStegeman
Copy link
Copy Markdown
Member

I mean, I still don't understand what makes the situation of trmax==0 so special, but since it's won't affect the outcome I'm not going to push this further. If it has been decided within the collaboration (where I would be surprised if anyone even cared), and you and Zahari are happy, then I'm also fine.

@Zaharid
Copy link
Copy Markdown
Contributor

Zaharid commented Nov 30, 2022

Personally I still slightly prefer my thing, mostly owing to the aesthetics of not having special cases, but similarly am not too sure it matters enough anyway. I guess the remaining concern now is that this will break the rebuilding of pseudodata which various things rely on.

@scarlehoff
Copy link
Copy Markdown
Member Author

scarlehoff commented Nov 30, 2022

What is your thing? This is your thing (but not singling out ndata == 1 which would be wrong).

this is reproducible, the fact that I need to update the tests is proof enough i'd say x)

@scarlehoff
Copy link
Copy Markdown
Member Author

I will squash the commits and send the fit bot and then merge (probably tomorrow I guess, the bot will take a few hours).

@scarlehoff scarlehoff added the run-fit-bot Starts fit bot from a PR. label Nov 30, 2022
@github-actions
Copy link
Copy Markdown

Greetings from your nice fit 🤖 !
I have good news for you, I just finished my tasks:

Check the report carefully, and please buy me a ☕ , or better, a GPU 😉!

after the change of the trvl mask in #1636 the fitbot changes and so it must be updated
@scarlehoff scarlehoff removed the run-fit-bot Starts fit bot from a PR. label Dec 1, 2022
@scarlehoff
Copy link
Copy Markdown
Member Author

The last commit is an attempt to have a bit more stability. From the numpy docs, compatibility between versions is not guaranteed but I hope that algorithm in particular is stable enough (since it right now their default).

@scarlehoff scarlehoff merged commit c3432de into master Dec 1, 2022
@scarlehoff scarlehoff deleted the bugfix_to_1pointDS branch December 1, 2022 11:28
@Zaharid
Copy link
Copy Markdown
Contributor

Zaharid commented Dec 1, 2022

Hmm. I am going to want to version tag this…

@scarlehoff
Copy link
Copy Markdown
Member Author

Feel free. From 4.0.4 we have had a roller coaster of breaking backward compatibility changes... but this should've stopped now (famous last words)

@Zaharid
Copy link
Copy Markdown
Contributor

Zaharid commented Dec 1, 2022

Which are these changes?

@scarlehoff
Copy link
Copy Markdown
Member Author

I'm not sure but I remember having to update regression tests more than once.

@RoyStegeman RoyStegeman mentioned this pull request Jan 13, 2023
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

One point datasets always end up in validation split

3 participants