Skip to content

[WIP] Experiment API#476

Closed
scarrazza wants to merge 4 commits into
masterfrom
experimentapi
Closed

[WIP] Experiment API#476
scarrazza wants to merge 4 commits into
masterfrom
experimentapi

Conversation

@scarrazza
Copy link
Copy Markdown
Member

No description provided.

@scarrazza
Copy link
Copy Markdown
Member Author

Before going to deep/fast, @Zaharid @scarlehoff @wilsonmr could you please let me know if you agree with the current idea?

@Zaharid
Copy link
Copy Markdown
Contributor

Zaharid commented Jun 3, 2019

I have a bunch of review comments, but first of all, could you please summarise what is this supposed to replace? Is this supposed to provide backwards compatibility or rather be the new thing (I can see it donf the first thing but not the second)?

@Zaharid
Copy link
Copy Markdown
Contributor

Zaharid commented Jun 3, 2019

Also I think that parsing commondata files should go in a separate place (like a new file).

Copy link
Copy Markdown
Contributor

@Zaharid Zaharid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think most of the comments should be relevant either way.

Comment thread validphys2/src/validphys/core.py Outdated
"""
# read raw commondata file
dataset_file = commondata_folder / f'DATA_{dataset_name}.dat'
table = pd.read_csv(dataset_file, sep=r'\s+|\t', skiprows=1, header=None, engine='python')
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you try the same call as in fkparser? Should be faster than the python engine.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, thanks. I have just realized how inconsistent these files are...

Comment thread validphys2/src/validphys/core.py Outdated

# remove NaNs
# TODO: replace commondata files with bad formatting
table.dropna(axis='columns', inplace=True)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are those the semantics we want? In any case this looks a bit dangerous, in that we are removing whole sets of systematics or central values or whatever... I think we should try to do better here (but yeah, having tabs at the end of the row sucks). I guess we are at least checking a bit when setting the columns below.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely not, as the TODO say, I would prefer to fix the commondata files...

Comment thread validphys2/src/validphys/core.py Outdated
table.columns = header

# replace datapoint column
table['dataset_point'] = f'{dataset_name}_' + table['dataset_point'].astype(str)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idiomatic way of doing this would be multiindexes. I am not saying we should do multiindexes because they are a pain. Just thinking loud.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, makes sense.

Comment thread validphys2/src/validphys/core.py Outdated
required when loading data.
"""
self.experiments = None
self.loader = loader.Loader()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need to somehow mind the environment loader.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, can this not take a list of dataset_inputs?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, absolutely, we should build an object detached from a specific initialization method.

Comment thread validphys2/src/validphys/core.py Outdated
"""
self.experiments = None
self.loader = loader.Loader()
if experiments_list:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want experiments_list to be possibly None or rather the empty list?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also shouldn't be a list but rather a tuple, because we want this thing to be immutable.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact I wouldn't make experiments_list an attribute at all. Let it return something that is not bound.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sounds good, but if we drop the list then we have to change the list(dict) extracted from the runcard (which is fine), but e.g. have a look at the test_experimentapi.py.

Comment thread validphys2/src/validphys/core.py Outdated
import pandas as pd


def load_dataset(dataset_name, commondata_folder):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have a commondataspec thing that this should be using.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, maybe not. Commondataspec also knows about systypes, which this doesn't care about...

@scarrazza
Copy link
Copy Markdown
Member Author

Yes, sorry, I would like to check if everybody agrees with this development direction, as it is now, this is not replacing anything however it:

  • loads all data in pandas DataFrames;
  • drops the calls to CommonData / DataSet;
  • drops the requirement of loading data and fktables together;
  • loading functions can be moved to a new file and can converted into independent actions (called easily outside the ExperimentAPI class);
  • and, I think can be interfaced with DataSetSpecs and ExperimentSpecs.
    I think this structure simplifies the further steps like applying cuts, computing covmat, making replicas, etc.

@Zaharid
Copy link
Copy Markdown
Contributor

Zaharid commented Jun 3, 2019

Yes, sorry, I would like to check if everybody agrees with this development direction, as it is now, this is not replacing anything however it:

* loads all data in pandas DataFrames;

* drops the calls to CommonData / DataSet;

* drops the requirement of loading data and fktables together;

* loading functions can be moved to a new file and can converted into independent actions (called easily outside the ExperimentAPI class);

* and, I think can be interfaced with DataSetSpecs and ExperimentSpecs.
  I think this structure simplifies the further steps like applying cuts, computing covmat, making replicas, etc.

All these points look good to me. I guess I was confused because I was expecting this to be loading data as per #356. But sure enough, that can be changed.

@scarrazza
Copy link
Copy Markdown
Member Author

Ok, great thanks, I will continue then in this direction, and polish all these points.

@Zaharid
Copy link
Copy Markdown
Contributor

Zaharid commented Jun 3, 2019

Thinking a bit more (but just a bit more, so take it with a grain of salt), we have that ExperimentAPI == data == list tuple of dataset_input.

@scarlehoff
Copy link
Copy Markdown
Member

Yes, sorry, I would like to check if everybody agrees with this development direction, as it is now, this is not replacing anything however it:

* loads all data in pandas DataFrames;

* drops the calls to CommonData / DataSet;

* drops the requirement of loading data and fktables together;

* loading functions can be moved to a new file and can converted into independent actions (called easily outside the ExperimentAPI class);

* and, I think can be interfaced with DataSetSpecs and ExperimentSpecs.
  I think this structure simplifies the further steps like applying cuts, computing covmat, making replicas, etc.

I haven't had enough exp. / though enough about what is the right way of doing this. All I can say is I don't see anything obviously wrong

@Zaharid
Copy link
Copy Markdown
Contributor

Zaharid commented Jun 3, 2019

Then in config.py there is,

    def produce_commondata(self,
                           *,
                           dataset_input,
                           use_fitcommondata=False,
                           fit=None):

And I guess @wilsonmr will want to argue that use_fitcommondata is very important, but really it would be great if closure test data was represented in a completely different way...

Other than that, it is trivial to get a commondata specification from a dataset_input.

@scarrazza
Copy link
Copy Markdown
Member Author

Just to avoid misunderstandings, do we agree that the concept of experiment should disappear in favour of tuples of datasets == data == ExperimentAPI?

@scarrazza
Copy link
Copy Markdown
Member Author

I am saying that because ExperimentAPI can still store the experiment list and deliver a method/attribute with the tuples of datasets.

@scarrazza
Copy link
Copy Markdown
Member Author

Umm, thinking about the mechanism you are proposing, if I understand correctly, the ExperimentAPI class doesn't need to exist, but instead there is just tuple and several adjustments in CoreConfig where we replace the calls to CommonData with some variation of my load_dataset.

@Zaharid
Copy link
Copy Markdown
Contributor

Zaharid commented Jun 4, 2019

Umm, thinking about the mechanism you are proposing, if I understand correctly, the ExperimentAPI class doesn't need to exist, but instead there is just tuple and several adjustments in CoreConfig where we replace the calls to CommonData with some variation of my load_dataset.

I think that is what I was thinking.

It should be possible to generate a list of experiments, looking a bit (or a lot) like the current ones from data, to keep backwards compatibility and for orgaizational purposes like grouping plots by experiment.

@scarrazza
Copy link
Copy Markdown
Member Author

scarrazza commented Jun 4, 2019

We can use the ExperimentAPI as the data structure which stores tuples/dicts and applies conversions between different grouping.

@wilsonmr
Copy link
Copy Markdown
Contributor

wilsonmr commented Jun 4, 2019

That would be pretty useful, I guess somehow it either would be used by, for example:

def produce_fit_data_groupby_experiment(self, fit):

or even somehow replace it?

@Zaharid
Copy link
Copy Markdown
Contributor

Zaharid commented Jun 4, 2019

We can use the ExperimentAPI as the data structure which stores tuples/dicts and applies conversions between different grouping.

I don't see a structure that "applies conversions between different groupings" because the groupings don't belong to the structure. That said, if you want to have a class that does something like:

class Data:
    def to_experiments(self) -> List[Dict]:
        ...

then that in python is pretty much the same as:

    def data_to_experiments(data:Data) -> List[Dict]:
        ...

and is really a matter of taste. The problem with Data being a real class is that you have to either implement all the methods of collections.abc.Sequence or use data.actual_tuple_of_stuff pretty much everywhere. The advantage is that if we later were to decide that data should hold more information, we can do it more easily.

@Zaharid Zaharid closed this Jun 4, 2019
@Zaharid Zaharid reopened this Jun 4, 2019
@scarrazza
Copy link
Copy Markdown
Member Author

Yeah, I agree, this discussion in quite useful. All in all, we don't really need a data structure for that, the problem can be solved with lists of dicts and auxiliary functions.

@wilsonmr wilsonmr mentioned this pull request Sep 27, 2019
4 tasks
@RosalynLP
Copy link
Copy Markdown
Contributor

Hi I am just having a look at this for the first time, can I just check I understand what is going on? So I am presuming this is part of the destroy c++ thing, providing an alternative to the C++ loading of commondata. I was just wondering (a) exactly why the NaNs arise - I know this can happen with loading with pandas, but in this case is it due to the way the header is set out or something? (b) the relevance of the API - as far as I can see this seems to be used in the testing but I'm not sure of the broader picture.

@scarrazza
Copy link
Copy Markdown
Member Author

I am closing this PR because:

  • it should be replaced with a newer version of the code.
  • it contains a trivial implementation that can be extended/improved in another PR.
  • may rely indirectly from the outcome [WIP] Data Keyword #515.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants