[WIP] Experiment API by scarrazza · Pull Request #476 · NNPDF/nnpdf

scarrazza · 2019-06-03T15:51:00Z

No description provided.

scarrazza · 2019-06-03T15:51:42Z

Before going to deep/fast, @Zaharid @scarlehoff @wilsonmr could you please let me know if you agree with the current idea?

Zaharid · 2019-06-03T16:10:46Z

I have a bunch of review comments, but first of all, could you please summarise what is this supposed to replace? Is this supposed to provide backwards compatibility or rather be the new thing (I can see it donf the first thing but not the second)?

Zaharid · 2019-06-03T16:11:24Z

Also I think that parsing commondata files should go in a separate place (like a new file).

Zaharid

I think most of the comments should be relevant either way.

Zaharid · 2019-06-03T15:57:12Z

+    """
+    # read raw commondata file
+    dataset_file = commondata_folder / f'DATA_{dataset_name}.dat'
+    table = pd.read_csv(dataset_file, sep=r'\s+|\t', skiprows=1, header=None, engine='python')


Can you try the same call as in fkparser? Should be faster than the python engine.

Sure, thanks. I have just realized how inconsistent these files are...

Zaharid · 2019-06-03T15:59:52Z

+
+    # remove NaNs
+    # TODO: replace commondata files with bad formatting
+    table.dropna(axis='columns', inplace=True)


Are those the semantics we want? In any case this looks a bit dangerous, in that we are removing whole sets of systematics or central values or whatever... I think we should try to do better here (but yeah, having tabs at the end of the row sucks). I guess we are at least checking a bit when setting the columns below.

Absolutely not, as the TODO say, I would prefer to fix the commondata files...

Zaharid · 2019-06-03T16:02:41Z

+    table.columns = header
+
+    # replace datapoint column
+    table['dataset_point'] = f'{dataset_name}_' + table['dataset_point'].astype(str)


The idiomatic way of doing this would be multiindexes. I am not saying we should do multiindexes because they are a pain. Just thinking loud.

Yeah, makes sense.

Zaharid · 2019-06-03T16:04:47Z

+        required when loading data.
+        """
+        self.experiments = None
+        self.loader = loader.Loader()


This will need to somehow mind the environment loader.

Also, can this not take a list of dataset_inputs?

Yes, absolutely, we should build an object detached from a specific initialization method.

Zaharid · 2019-06-03T16:06:27Z

+        """
+        self.experiments = None
+        self.loader = loader.Loader()
+        if experiments_list:


Do we want experiments_list to be possibly None or rather the empty list?

It also shouldn't be a list but rather a tuple, because we want this thing to be immutable.

In fact I wouldn't make experiments_list an attribute at all. Let it return something that is not bound.

Yes, sounds good, but if we drop the list then we have to change the list(dict) extracted from the runcard (which is fine), but e.g. have a look at the test_experimentapi.py.

Zaharid · 2019-06-03T16:14:55Z

+import pandas as pd
+
+
+def load_dataset(dataset_name, commondata_folder):


I think we have a commondataspec thing that this should be using.

Actually, maybe not. Commondataspec also knows about systypes, which this doesn't care about...

scarrazza · 2019-06-03T16:17:21Z

Yes, sorry, I would like to check if everybody agrees with this development direction, as it is now, this is not replacing anything however it:

loads all data in pandas DataFrames;
drops the calls to CommonData / DataSet;
drops the requirement of loading data and fktables together;
loading functions can be moved to a new file and can converted into independent actions (called easily outside the ExperimentAPI class);
and, I think can be interfaced with DataSetSpecs and ExperimentSpecs.
I think this structure simplifies the further steps like applying cuts, computing covmat, making replicas, etc.

Zaharid · 2019-06-03T16:19:43Z

Yes, sorry, I would like to check if everybody agrees with this development direction, as it is now, this is not replacing anything however it:

* loads all data in pandas DataFrames;

* drops the calls to CommonData / DataSet;

* drops the requirement of loading data and fktables together;

* loading functions can be moved to a new file and can converted into independent actions (called easily outside the ExperimentAPI class);

* and, I think can be interfaced with DataSetSpecs and ExperimentSpecs.
  I think this structure simplifies the further steps like applying cuts, computing covmat, making replicas, etc.

All these points look good to me. I guess I was confused because I was expecting this to be loading data as per #356. But sure enough, that can be changed.

scarrazza · 2019-06-03T16:23:41Z

Ok, great thanks, I will continue then in this direction, and polish all these points.

Zaharid · 2019-06-03T16:28:13Z

Thinking a bit more (but just a bit more, so take it with a grain of salt), we have that ExperimentAPI == data == ~~list~~ tuple of dataset_input.

scarlehoff · 2019-06-03T16:31:47Z

Yes, sorry, I would like to check if everybody agrees with this development direction, as it is now, this is not replacing anything however it:

* loads all data in pandas DataFrames;

* drops the calls to CommonData / DataSet;

* drops the requirement of loading data and fktables together;

* loading functions can be moved to a new file and can converted into independent actions (called easily outside the ExperimentAPI class);

* and, I think can be interfaced with DataSetSpecs and ExperimentSpecs.
  I think this structure simplifies the further steps like applying cuts, computing covmat, making replicas, etc.

I haven't had enough exp. / though enough about what is the right way of doing this. All I can say is I don't see anything obviously wrong

Zaharid · 2019-06-03T16:35:37Z

Then in config.py there is,

    def produce_commondata(self,
                           *,
                           dataset_input,
                           use_fitcommondata=False,
                           fit=None):

And I guess @wilsonmr will want to argue that use_fitcommondata is very important, but really it would be great if closure test data was represented in a completely different way...

Other than that, it is trivial to get a commondata specification from a dataset_input.

scarrazza · 2019-06-03T18:34:39Z

Just to avoid misunderstandings, do we agree that the concept of experiment should disappear in favour of tuples of datasets == data == ExperimentAPI?

scarrazza · 2019-06-03T18:35:31Z

I am saying that because ExperimentAPI can still store the experiment list and deliver a method/attribute with the tuples of datasets.

scarrazza · 2019-06-04T07:37:01Z

Umm, thinking about the mechanism you are proposing, if I understand correctly, the ExperimentAPI class doesn't need to exist, but instead there is just tuple and several adjustments in CoreConfig where we replace the calls to CommonData with some variation of my load_dataset.

Zaharid · 2019-06-04T08:21:25Z

Umm, thinking about the mechanism you are proposing, if I understand correctly, the ExperimentAPI class doesn't need to exist, but instead there is just tuple and several adjustments in CoreConfig where we replace the calls to CommonData with some variation of my load_dataset.

I think that is what I was thinking.

It should be possible to generate a list of experiments, looking a bit (or a lot) like the current ones from data, to keep backwards compatibility and for orgaizational purposes like grouping plots by experiment.

scarrazza · 2019-06-04T08:27:34Z

We can use the ExperimentAPI as the data structure which stores tuples/dicts and applies conversions between different grouping.

wilsonmr · 2019-06-04T08:38:26Z

That would be pretty useful, I guess somehow it either would be used by, for example:

nnpdf/validphys2/src/validphys/config.py

Line 768 in cd2c521

def produce_fit_data_groupby_experiment(self, fit):

or even somehow replace it?

Zaharid · 2019-06-04T08:41:26Z

We can use the ExperimentAPI as the data structure which stores tuples/dicts and applies conversions between different grouping.

I don't see a structure that "applies conversions between different groupings" because the groupings don't belong to the structure. That said, if you want to have a class that does something like:

class Data:
    def to_experiments(self) -> List[Dict]:
        ...

then that in python is pretty much the same as:

    def data_to_experiments(data:Data) -> List[Dict]:
        ...

and is really a matter of taste. The problem with Data being a real class is that you have to either implement all the methods of collections.abc.Sequence or use data.actual_tuple_of_stuff pretty much everywhere. The advantage is that if we later were to decide that data should hold more information, we can do it more easily.

scarrazza · 2019-06-04T08:53:40Z

Yeah, I agree, this discussion in quite useful. All in all, we don't really need a data structure for that, the problem can be solved with lists of dicts and auxiliary functions.

RosalynLP · 2020-02-20T11:15:02Z

Hi I am just having a look at this for the first time, can I just check I understand what is going on? So I am presuming this is part of the destroy c++ thing, providing an alternative to the C++ loading of commondata. I was just wondering (a) exactly why the NaNs arise - I know this can happen with loading with pandas, but in this case is it due to the way the header is set out or something? (b) the relevance of the API - as far as I can see this seems to be used in the testing but I'm not sure of the broader picture.

scarrazza · 2020-03-31T09:17:05Z

I am closing this PR because:

it should be replaced with a newer version of the code.
it contains a trivial implementation that can be extended/improved in another PR.
may rely indirectly from the outcome [WIP] Data Keyword #515.

first ultra-basic approach

cbf1a37

scarrazza requested review from Zaharid, scarlehoff and wilsonmr June 3, 2019 15:51

Zaharid reviewed Jun 3, 2019

View reviewed changes

Zaharid closed this Jun 4, 2019

Zaharid reopened this Jun 4, 2019

scarrazza added 3 commits June 11, 2019 12:13

replacing pandas read_csv engine from python to c

7e1b71a

replacing CommonData load with pandas df

bb162b3

removing deprecated comment

5a49ce2

wilsonmr mentioned this pull request Jun 13, 2019

Added tables for chi2 and phi by process #481

Closed

scarrazza mentioned this pull request Jun 16, 2019

Roadmap for experiment API #483

Closed

5 tasks

Zaharid mentioned this pull request Sep 3, 2019

Add tools to both fit and analyse fits with regularised covmats #541

Merged

wilsonmr mentioned this pull request Sep 27, 2019

n3fit closure tests #577

Closed

4 tasks

wilsonmr mentioned this pull request Jan 13, 2020

add closure test to n3fit #514

Merged

scarrazza closed this Mar 31, 2020

RosalynLP mentioned this pull request May 27, 2020

[WIP]: Python commondata parser #769

Merged

wilsonmr mentioned this pull request Sep 1, 2020

datasets with asymmetric errors - multiplicative uncertainties don't match additive #903

Closed

scarrazza deleted the experimentapi branch August 27, 2021 21:41

		import pandas as pd


		def load_dataset(dataset_name, commondata_folder):

Conversation

scarrazza commented Jun 3, 2019

Uh oh!

scarrazza commented Jun 3, 2019

Uh oh!

Zaharid commented Jun 3, 2019

Uh oh!

Zaharid commented Jun 3, 2019

Uh oh!

Zaharid left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scarrazza commented Jun 3, 2019

Uh oh!

Zaharid commented Jun 3, 2019

Uh oh!

scarrazza commented Jun 3, 2019

Uh oh!

Zaharid commented Jun 3, 2019

Uh oh!

scarlehoff commented Jun 3, 2019

Uh oh!

Zaharid commented Jun 3, 2019

Uh oh!

scarrazza commented Jun 3, 2019

Uh oh!

scarrazza commented Jun 3, 2019

Uh oh!

scarrazza commented Jun 4, 2019

Uh oh!

Zaharid commented Jun 4, 2019

Uh oh!

scarrazza commented Jun 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wilsonmr commented Jun 4, 2019

Uh oh!

Zaharid commented Jun 4, 2019

Uh oh!

scarrazza commented Jun 4, 2019

Uh oh!

RosalynLP commented Feb 20, 2020

Uh oh!

scarrazza commented Mar 31, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

scarrazza commented Jun 4, 2019 •

edited

Loading