Fkparser by Zaharid · Pull Request #404 · NNPDF/nnpdf

Zaharid · 2019-03-20T12:20:42Z

This adds a pure python parser for fktables and is work in progress. At the moment it should give sensible results for both compressed and uncompressed hadronic fktables. The most frequent usage should be something like:

from validphys.fkparser import load_fktable
from validphys.loader import Loader
l = Loader()
fk = l.check_fktable(setname="ATLASTTBARTOT", theoryID=52, cfac=())
f = load_fktable(fk)

Some thoughts and considerations:

I am trying to follow the spec https://github.com/NNPDF/nnpdf/blob/master/doc/data/data_layout.pdf and have not looked at the c++ code in a long time. Lets see how that goes.
The fktable format is very irregular. For example the fktable in the examples uses both (!) tabs and spaces to delimit numbers in different places of the fastkernel grid. This makes parsing slower than it has to be and means that about the only non manual solution that has any hope of working is pandas.read_csv. This shows you why an opaque binary format that forces you to specify everything precisely is better in the long term.
At the moment I am prioritizing good error messages and clear code over performance. I suspect that this will not matter much as most of the time will go into parsing the sigma grids, which is done with an external function.
The theory info field is a bit redundant unless somebody is doing something very wrong. I can't really be bothered to write the types for all the keys. Does anybody thing those might be needed? More generally I think for the fit we will only need the tensor and the boolean mask, possibly also with a flag to specify if it is hadronic or not. The rest of the data might be useful for analysis purposes. There is also some redundancy (e.g. ndata and nx) which will be used to check the consistency of the fktable. In any case there will be some higher level objects than a dictionary at some point, which will check for things like required fields.
I would like very much that this serves to make the parallel mode of validphys faster. For that it would be best to allocate the sigma grid on some shared memory store. I know how to do this manually on linux (/dev/shm), but not so much in cross platform way. For that I am looking at https://arrow.apache.org/ and in particular https://arrow.apache.org/docs/python/plasma.html. There is the question on whether we could load directly an arrow table on a shared store and avoid copying, but that would mean patching their csv parser (which is more limited but probably faster than the pandas one because it is multithreaded). At the moment I am leaning towards implementing all that as a layer on top of the parser, even thought it is not the most efficient approach.
The sigma is effectively encoded as as sparse tensor in the x combinations and a dense tensor in the flavour indexes. So we have all possible 14 flavour combinations, many of which are typically zero, but only some indexed for the xgrids. For example the first few entries of the index of the table above look like (0, 2, 21), (0, 2, 22), (0, 2, 23), (0, 2, 24), (0, 3, 15), (0, 3, 16), (0, 3, 17), where the first index is the data point and the other two are indexes in the xgrid. @scarrazza @scarlehoff do you want this on you would prefer a full grid full of zeros?
Overall I am too old for writing parsers. I think that now that we can process these files into pure python we should just dump them to parquet (https://parquet.apache.org/), because it seems to be the format with most support (and there are direct interfaces for both Arrow and pandas so we really would never have to look inside the file). This would make things easier and faster.

Additionally this needs better docs and tests.

Will close #377.

Zaharid · 2019-03-20T16:20:20Z

Also, it is fine if load_fktable takes care of applying the cfactors? Seems one less thing to worry about down the line.

Zaharid · 2019-03-20T16:56:00Z

Unfortunately as it stands now, it is quite a lot slower (I think due to pandas.reas_csv doing more work than what we do in c++).

%timeit f = load_fktable(fk)
2.53 s ± 49.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit fk.load()
1.56 s ± 14.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

scarrazza · 2019-03-20T16:57:51Z

What about using cffi?

scarlehoff · 2019-03-20T17:11:51Z

wrt to the sigma tensor, my personal experience tells me the "grid full of 0s" is better.
Otherwise you have to refer to some well-explained documentation or ask someone who knows (which I did) and even then it required a bit of trial and error to understand why the size of my tensors didn't match.

Zaharid · 2019-03-20T17:16:08Z

@scarlehoff Well, now you get a multindexed object with the index explicitly attached to the points. Keeping all the zeros would make it quite a bit bigger. And anyhow this is not meant to be used in high level interfaces.

Zaharid · 2019-03-20T17:22:10Z

What about using cffi?

Well we would have to write or find something that is fast enough to be worth it, and produces something equivalent (and also supports '\s+' as separator because we seem to need that).

My preference would be to use this parser to just dump everything to parquet (or whatever) and be done with it. And just upload all the theories again in whatever format we choose.

Btw, could you have a look why apfelcomb has this behaviour of using spaces for some values and tabs for others?

Zaharid · 2019-03-20T17:28:23Z

@scarlehoff e.g for ATLAS1JET11 you get 99285 rows instead of 140*45*45, so you save two thirds of the space.

Zaharid · 2019-03-20T17:33:24Z

...not to mention that only 31 of the 196 flavour combinations we write for this thing have some non zero entry. So we really could make it a lot smaller in memory.

scarlehoff · 2019-03-20T17:46:48Z

Actually the flavours combinations I'm happy with because you can define your dimensions as (xgrid, flavours combinations, experimental points) and it is still a dense block and consistent between different datasets.

@scarlehoff e.g for ATLAS1JET11 you get 99285 instead of 140_45_45, so you save two thirds of the space.

But I didn't realise there were so many empty values in the "xgrid" dimension. It's hard to argue with that.

Zaharid · 2019-03-20T17:54:47Z

@scarrazza e.g. instead of having to play games with csv one could just do:

%timeit pd.read_parquet('/tmp/delthis.pk')
111 ms ± 4.44 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit pd.read_csv('/tmp/delthis', sep='\t', index_col=(0,1,2))
2.02 s ± 61.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

scarrazza · 2019-03-20T18:04:35Z

Ok, the parquet solution looks good to me.

Zaharid · 2019-03-20T18:13:59Z

Ok, so I am leaning towards proposing that the output for fit purposes is going to be a table with the same index as now and the columns are going to be the indexes in the FlavourMap that are nonzero. You just have to process the columns and the index to construct the appropriate pdf combinations.

It would probably make sense that the data index is the last rather than the first.

Zaharid · 2019-03-21T10:23:24Z

Here some stub functions that produce convolutions. AFAICT from a quick test these give comparable results to c++. A lot could be done to optimize them in various ways, but, it is rather easy to do by matching the pandas indices (and some numpy indexing magic).

import pandas as pd
import numpy as np

from validphys.pdfbases import evolution
from validphys.pdfgrids import xplotting_grid

FK_FLAVOURS = evolution.to_known_elements(
    [
        "photon",
        "singlet",
        "g",
        "V",
        "V3",
        "V8",
        "V15",
        "V24",
        "V35",
        "T3",
        "T8",
        "T15",
        "T24",
        "T35",
    ]
)


def hadron_predictions(pdf, loaded_fk):
    xgrid = loaded_fk["xGrid"]
    Q = float(loaded_fk["TheoryInfo"]["Q0"])
    sigma = loaded_fk["FastKernel"]
    sigma = sigma.loc[:, loaded_fk["FlavourMap"].ravel()]

    gv = xplotting_grid(
        pdf=pdf, Q=Q, flavours=FK_FLAVOURS, xgrid=xgrid, basis=evolution
    ).grid_values
    fm = sigma.columns
    fl1, fl2 = np.indices((14, 14))
    fl1 = fl1.ravel()[fm]
    fl2 = fl2.ravel()[fm]
    gv1fl = gv[:, fl1, :]
    gv2fl = gv[:, fl2, :]

    def appl(df):
        xx1 = df.index.get_level_values(1)
        xx2 = df.index.get_level_values(2)
        gv1 = gv1fl[:, :, xx1]
        gv2 = gv2fl[:, :, xx2]
        return pd.Series(np.einsum("ijk,ijk,kj->i", gv1, gv2, df.values))

    return sigma.groupby(level=0).apply(appl)


def dis_predictions(pdf, loaded_fk):
    xgrid = loaded_fk["xGrid"]
    Q = float(loaded_fk["TheoryInfo"]["Q0"])
    sigma = loaded_fk["FastKernel"]
    sigma = sigma.loc[:, loaded_fk["FlavourMap"].ravel()]

    fm = sigma.columns
    gv = xplotting_grid(
        pdf=pdf, Q=Q, flavours=FK_FLAVOURS[fm], xgrid=xgrid, basis=evolution
    ).grid_values

    def appl(df):
        xind = df.index.get_level_values(1)
        return pd.Series(np.einsum("ijk,kj->i", gv[:, :, xind], df.values))

    return sigma.groupby(level=0).apply(appl)

Just to emphasise that these functions are not optimized at all. They a more to test for correctness.

Zaharid · 2019-03-21T12:15:18Z

@scarrazza it would be great if somebody from Milan could start playing with this to see how fast can be made. Note that we don't only care for the fit, but would like to have also primitives for LHAPDF grids more generally useful for vp.

scarlehoff · 2019-03-21T15:07:46Z

Just to be sure, the idea is to use this in order to access the fktables directly, right?
If that's the case I think we might be able to use this already, is there a way to access it directly from the DataSetSpec objects?

Right now we basically* do:

dataset_c = dataset_spec.load()
raw_fktable = dataset_c.GetFK(i) 
parsed_fktable = fk_parser(raw_fktable)

where "fk_parser" is some more or less complicated function we wrote and parsed_fktable a dictionary with the grid in x, the flavour map, etc.
*the .load() we actually do at the level of the experiment, but I understand internally is the same.

In conclusion, if there were a way to simplify that to:
parsed_fktable = dataset_spec.fktable()
I would be very happy and I could quickly implement it and give you some feedback.

Zaharid · 2019-03-21T15:24:59Z

Just to be sure, the idea is to use this in order to access the fktables directly, right?

Yes, the idea is to bypass the C++ code, so we can make use of fancier allocation strategies and computation methods. You get a tensor (which at the moment is a pandas dataframe, but doesn't have to) with the required slices, and then do something similar to the functions I showed above to compute the predictions.

If that's the case I think we might be able to use this already, is there a way to access it directly from the DataSetSpec objects?

It would be a little rough at the moment in that there is no support for COMPOUND or cfactors, but you can load the table itself:

In [2]: from validphys.loader import Loader                                                                           

In [3]: l = Loader()                                                                                                  

In [4]: l.check_dataset('ATLASTTBARTOT', theoryid=52, cuts=None)                                                      
Out[4]: DataSetSpec(name='ATLASTTBARTOT', commondata=CommonDataSpec(datafile=PosixPath('/home/zah/anaconda3/share/NNPDF/data/commondata/DATA_ATLASTTBARTOT.dat'), sysfile=PosixPath('/home/zah/anaconda3/share/NNPDF/data/commondata/systypes/SYSTYPE_ATLASTTBARTOT_DEFAULT.dat'), plotfiles=(PosixPath('/home/zah/anaconda3/share/NNPDF/data/commondata/PLOTTINGTYPE_INC.yaml'), PosixPath('/home/zah/anaconda3/share/NNPDF/data/commondata/PLOTTING_ATLASTTBARTOT.yaml'))), fkspecs=(FKTableSpec(fkpath=PosixPath('/home/zah/anaconda3/share/NNPDF/data/theory_52/fastkernel/FK_ATLASTTBARTOT.dat'), cfactors=()),), thspec=TheoryIDSpec(id=52, path=PosixPath('/home/zah/anaconda3/share/NNPDF/data/theory_52')), cuts=None, frac=1, op='NULL', weight=1)

In [5]: ds = l.check_dataset('ATLASTTBARTOT', theoryid=52, cuts=None)                                                 

In [6]: ds.fkspecs                                                                                                    
Out[6]: (FKTableSpec(fkpath=PosixPath('/home/zah/anaconda3/share/NNPDF/data/theory_52/fastkernel/FK_ATLASTTBARTOT.dat'), cfactors=()),)

In [7]: from validphys.fkparser import load_fktable                                                                                                                                                                                           


In [9]: load_fktable(ds.fkspecs[0])

Right now we basically* do:
dataset_c = dataset_spec.load()
raw_fktable = dataset_c.GetFK(i) 
parsed_fktable = fk_parser(raw_fktable)
where "fk_parser" is some more or less complicated function we wrote and parsed_fktable a dictionary with the grid in x, the flavour map, etc.
*the .load() we actually do at the level of the experiment, but I understand internally is the same.

Yeah, see above. The final interface will change. E.g. you will never see the flavour map, but only the indexes on dataframe, and the will be something with more structure than a dictionary.

In conclusion, if there were a way to simplify that to:
parsed_fktable = dataset_spec.fktable()
I would be very happy and I could quickly implement it and give you some feedback.

scarlehoff · 2019-03-21T15:45:40Z

It would be a little rough at the moment in that there is no support for COMPOUND or cfactors, but you can load the table itself:

As long as it works for some datasets we can actually check and compare

Yeah, see above. The final interface will change. E.g. you will never see the flavour map, but only the indexes on dataframe, and the will be something with more structure than a dictionary.

Also fine, I just want to have a feeling of how it works in its current state so I don't look at this in one month time and find one billion lines of code I cannot follow.

scarlehoff · 2019-03-21T15:47:47Z

ps: my last point still stands, being able to do something like:

for experiment in experiments:
    for dataset in experiment.datasets:
        fktable = dataset.fktable()

would be lovely

Zaharid · 2019-03-21T15:50:41Z

ps: my last point still stands, being able to do something like:
for experiment in experiments:
    for dataset in experiment.datasets:
        fktable = dataset.fktable()
would be lovely

It would have to be more complicated in that in general you have several fktables and an operator to put them together. So at least [load_fktable(fs) for fs in ds.fkspecs], which you can do now (although again this does nothing about cfacors yet). Some version of this can easily be made a method of datasetspec.

scarrazza · 2020-03-31T09:21:16Z

@Zaharid I think we should decide if we merge or close this PR because maintaining a prototype code alive for too long seems a bad idea (i.e. accumulating conflicts, and people forgetting about its benefits).

Zaharid · 2020-03-31T09:30:05Z

Well, I have been saying for over a year that this is pretty useless on it and it should be picked up. I am fine with merging it but then again the documentation is full of typos so it has very clearly not been reviewed.

scarlehoff · 2020-03-31T09:34:32Z

It does say WIP, if it is to be merged I can review it.

Zaharid · 2020-03-31T10:02:40Z

@scarrazza please note the comments starting from
#404 (comment)

is there anything else?

scarrazza · 2020-03-31T10:16:10Z

Ok, I did not realize the current implementation is done with read_csv instead of read_parquet.
Fine by me then, this PR implements a useful pure python functionality.

scarlehoff

Even if it is just a list with the options I think we want any option we add to the code to have its place at https://docs.nnpdf.science/vp/index.html so that people know that the load_fktable option exists without having to guess.
This is a good place to start such list.

Other than that I'm fine with it.

scarlehoff · 2020-03-31T14:34:31Z

+            "NX": int
+        })
+    gi = GridInfo(**{k.lower(): v for k, v in d.items()})
+    return gi, l, h


It would be nice to have something more descriptive than d, l, h even if it seems obvious (it is not obvious to me)

scarlehoff · 2020-03-31T15:35:41Z

Mm, this is so old the checks are failing because they try to find packages at https://zigzah.com :_

Add functionality to read fktables in pure python. This supports transparently both compressed and plain tables. We have functionality to process both types of sections defined in the spec, namely blobs and option lists. We support types casting (and more general parsing) for the option values and define various specialized parsing for various blobs (at the moment based on numpy). These still need some work. At the moment the code relies on the untested assumption that the FastKernel section comes last. For the moment, return a dictionary with all the fields.

This is the easiest way to correctly process the indexes. Moreover we need something more complicated than a one character separator because there are fktables that use '\t' and ' ' at the same time. For the moment this only works for hadronic tables, as DIS have different indexing.

Return a higher level object, with the required information to do a convolution, and a bunch of metadata. Parse the ThoryInfo (although we might not use much of it in practice) and refactor a bit the error checking so as to not clutter the main code path.

Add some lower level functionality to parse the files, which could be useful if we e.g. did something with the Monte Carlo uncertainty. At high level, all the cfactors simply multiply the fktable.

These could potentially be obtained in various other ways, so it makes sense to septate them and make it clear that they are not tied to the file format.

Make minor cosmetic changes and extend and improve docs.

In fairness one should also test the error paths but these are unimportant and difficult enough that I can't be bothered. However more quantitative tests are in fact needed, but those require some convolution code.

Zaharid · 2020-04-02T20:17:12Z

Ok, please have a look now. I'll make another PR adding some convolution code and enhance docs and tests accordingly.

Zaharid · 2020-04-02T20:17:29Z

Ok, please have a look now. I'll make another PR adding some convolution code and enhance docs and tests accordingly.

scarlehoff

I took the liberty to add a reference to FKTableData in the docs description (got also bitten by the -f thing :_ ) and fix a few typos. This can be merged imo.

siranipour · 2020-04-03T08:22:00Z

 ----------------

 Currently only FKTables can be directly without C++ code. This is implemented
-in the :py:mod:`validphys.fkarser` module. For example::


Lol, this was unfortunate

Zaharid requested review from scarlehoff, scarrazza and wilsonmr March 20, 2019 12:21

Zaharid mentioned this pull request Apr 24, 2019

Covariance matrix dependency has become circular #441

Closed

Zaharid mentioned this pull request May 3, 2019

Try out the parquet format for some tables #449

Closed

scarlehoff mentioned this pull request May 17, 2019

[WIP] n3fit #461

Closed

Zaharid mentioned this pull request Sep 3, 2019

Add tools to both fit and analyse fits with regularised covmats #541

Merged

Zaharid mentioned this pull request Sep 19, 2019

n3fit should use existing code for pdfbases #559

Closed

siranipour added the destroyingc++ label Mar 31, 2020

scarlehoff requested changes Mar 31, 2020

View reviewed changes

Zaharid and others added 9 commits March 31, 2020 18:04

Add support for DIS tables

dd14faa

Improve parsing

a2e7521

Return a higher level object, with the required information to do a convolution, and a bunch of metadata. Parse the ThoryInfo (although we might not use much of it in practice) and refactor a bit the error checking so as to not clutter the main code path.

Add dataclasses dependency for python 3.6

9dbc039

Add support for cfactors

fb1a8bb

Add some lower level functionality to parse the files, which could be useful if we e.g. did something with the Monte Carlo uncertainty. At high level, all the cfactors simply multiply the fktable.

Add some documentation

f90ef0d

fix typos

a774223

missed two typos

387bfe8

scarrazza force-pushed the fkparser branch from d070ab7 to 387bfe8 Compare March 31, 2020 16:06

scarrazza changed the title ~~[WIP] Fkparser~~ Fkparser Mar 31, 2020

Zaharid mentioned this pull request Mar 31, 2020

Action to return pseudodata in n3fit #650

Merged

Zaharid added 3 commits April 2, 2020 17:11

Split data structures in a separate module

bd37587

These could potentially be obtained in various other ways, so it makes sense to septate them and make it clear that they are not tied to the file format.

Add docs and minimal internal code improvements

bf8e5ee

Make minor cosmetic changes and extend and improve docs.

Add basic tests

7a6f8b8

In fairness one should also test the error paths but these are unimportant and difficult enough that I can't be bothered. However more quantitative tests are in fact needed, but those require some convolution code.

fixed a few typos

a0adaf6

scarlehoff approved these changes Apr 3, 2020

View reviewed changes

siranipour reviewed Apr 3, 2020

View reviewed changes

Zaharid merged commit baa885a into master Apr 3, 2020

Zaharid deleted the fkparser branch April 22, 2020 11:00

This was referenced May 8, 2020

[WIP]: Python commondata parser #768

Closed

[WIP]: Python commondata parser #769

Merged

Zaharid mentioned this pull request Mar 18, 2021

Parquet regression tests #1091

Closed

Conversation

Zaharid commented Mar 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Zaharid commented Mar 20, 2019

Uh oh!

Zaharid commented Mar 20, 2019

Uh oh!

scarrazza commented Mar 20, 2019

Uh oh!

scarlehoff commented Mar 20, 2019

Uh oh!

Zaharid commented Mar 20, 2019

Uh oh!

Zaharid commented Mar 20, 2019

Uh oh!

Zaharid commented Mar 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Zaharid commented Mar 20, 2019

Uh oh!

scarlehoff commented Mar 20, 2019

Uh oh!

Zaharid commented Mar 20, 2019

Uh oh!

scarrazza commented Mar 20, 2019

Uh oh!

Zaharid commented Mar 20, 2019

Uh oh!

Zaharid commented Mar 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Zaharid commented Mar 21, 2019

Uh oh!

scarlehoff commented Mar 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Zaharid commented Mar 21, 2019

Uh oh!

scarlehoff commented Mar 21, 2019

Uh oh!

scarlehoff commented Mar 21, 2019

Uh oh!

Zaharid commented Mar 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

scarrazza commented Mar 31, 2020

Uh oh!

Zaharid commented Mar 31, 2020

Uh oh!

scarlehoff commented Mar 31, 2020

Uh oh!

Zaharid commented Mar 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

scarrazza commented Mar 31, 2020

Uh oh!

scarlehoff left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scarlehoff Mar 31, 2020

Choose a reason for hiding this comment

Uh oh!

scarlehoff commented Mar 31, 2020

Uh oh!

Zaharid commented Apr 2, 2020

Uh oh!

Zaharid commented Apr 2, 2020

Uh oh!

scarlehoff left a comment

Choose a reason for hiding this comment

Uh oh!

siranipour Apr 3, 2020

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Zaharid commented Mar 20, 2019 •

edited

Loading

Zaharid commented Mar 20, 2019 •

edited

Loading

Zaharid commented Mar 21, 2019 •

edited

Loading

scarlehoff commented Mar 21, 2019 •

edited

Loading

Zaharid commented Mar 21, 2019 •

edited

Loading

Zaharid commented Mar 31, 2020 •

edited

Loading

scarlehoff left a comment •

edited

Loading