[WIP] Experiment API#476
Conversation
|
Before going to deep/fast, @Zaharid @scarlehoff @wilsonmr could you please let me know if you agree with the current idea? |
|
I have a bunch of review comments, but first of all, could you please summarise what is this supposed to replace? Is this supposed to provide backwards compatibility or rather be the new thing (I can see it donf the first thing but not the second)? |
|
Also I think that parsing commondata files should go in a separate place (like a new file). |
Zaharid
left a comment
There was a problem hiding this comment.
I think most of the comments should be relevant either way.
| """ | ||
| # read raw commondata file | ||
| dataset_file = commondata_folder / f'DATA_{dataset_name}.dat' | ||
| table = pd.read_csv(dataset_file, sep=r'\s+|\t', skiprows=1, header=None, engine='python') |
There was a problem hiding this comment.
Can you try the same call as in fkparser? Should be faster than the python engine.
There was a problem hiding this comment.
Sure, thanks. I have just realized how inconsistent these files are...
|
|
||
| # remove NaNs | ||
| # TODO: replace commondata files with bad formatting | ||
| table.dropna(axis='columns', inplace=True) |
There was a problem hiding this comment.
Are those the semantics we want? In any case this looks a bit dangerous, in that we are removing whole sets of systematics or central values or whatever... I think we should try to do better here (but yeah, having tabs at the end of the row sucks). I guess we are at least checking a bit when setting the columns below.
There was a problem hiding this comment.
Absolutely not, as the TODO say, I would prefer to fix the commondata files...
| table.columns = header | ||
|
|
||
| # replace datapoint column | ||
| table['dataset_point'] = f'{dataset_name}_' + table['dataset_point'].astype(str) |
There was a problem hiding this comment.
The idiomatic way of doing this would be multiindexes. I am not saying we should do multiindexes because they are a pain. Just thinking loud.
| required when loading data. | ||
| """ | ||
| self.experiments = None | ||
| self.loader = loader.Loader() |
There was a problem hiding this comment.
This will need to somehow mind the environment loader.
There was a problem hiding this comment.
Also, can this not take a list of dataset_inputs?
There was a problem hiding this comment.
Yes, absolutely, we should build an object detached from a specific initialization method.
| """ | ||
| self.experiments = None | ||
| self.loader = loader.Loader() | ||
| if experiments_list: |
There was a problem hiding this comment.
Do we want experiments_list to be possibly None or rather the empty list?
There was a problem hiding this comment.
It also shouldn't be a list but rather a tuple, because we want this thing to be immutable.
There was a problem hiding this comment.
In fact I wouldn't make experiments_list an attribute at all. Let it return something that is not bound.
There was a problem hiding this comment.
Yes, sounds good, but if we drop the list then we have to change the list(dict) extracted from the runcard (which is fine), but e.g. have a look at the test_experimentapi.py.
| import pandas as pd | ||
|
|
||
|
|
||
| def load_dataset(dataset_name, commondata_folder): |
There was a problem hiding this comment.
I think we have a commondataspec thing that this should be using.
There was a problem hiding this comment.
Actually, maybe not. Commondataspec also knows about systypes, which this doesn't care about...
|
Yes, sorry, I would like to check if everybody agrees with this development direction, as it is now, this is not replacing anything however it:
|
All these points look good to me. I guess I was confused because I was expecting this to be loading |
|
Ok, great thanks, I will continue then in this direction, and polish all these points. |
|
Thinking a bit more (but just a bit more, so take it with a grain of salt), we have that |
I haven't had enough exp. / though enough about what is the right way of doing this. All I can say is I don't see anything obviously wrong |
|
Then in config.py there is, And I guess @wilsonmr will want to argue that Other than that, it is trivial to get a commondata specification from a dataset_input. |
|
Just to avoid misunderstandings, do we agree that the concept of |
|
I am saying that because |
|
Umm, thinking about the mechanism you are proposing, if I understand correctly, the ExperimentAPI class doesn't need to exist, but instead there is just tuple and several adjustments in |
I think that is what I was thinking. It should be possible to generate a list of experiments, looking a bit (or a lot) like the current ones from |
|
We can use the |
|
That would be pretty useful, I guess somehow it either would be used by, for example: nnpdf/validphys2/src/validphys/config.py Line 768 in cd2c521 or even somehow replace it? |
I don't see a structure that "applies conversions between different groupings" because the groupings don't belong to the structure. That said, if you want to have a class that does something like: class Data:
def to_experiments(self) -> List[Dict]:
...then that in python is pretty much the same as: def data_to_experiments(data:Data) -> List[Dict]:
...and is really a matter of taste. The problem with |
|
Yeah, I agree, this discussion in quite useful. All in all, we don't really need a data structure for that, the problem can be solved with lists of dicts and auxiliary functions. |
|
Hi I am just having a look at this for the first time, can I just check I understand what is going on? So I am presuming this is part of the destroy c++ thing, providing an alternative to the C++ loading of commondata. I was just wondering (a) exactly why the NaNs arise - I know this can happen with loading with pandas, but in this case is it due to the way the header is set out or something? (b) the relevance of the API - as far as I can see this seems to be used in the testing but I'm not sure of the broader picture. |
|
I am closing this PR because:
|
No description provided.