Skip to content

Fkparser#404

Merged
Zaharid merged 13 commits into
masterfrom
fkparser
Apr 3, 2020
Merged

Fkparser#404
Zaharid merged 13 commits into
masterfrom
fkparser

Conversation

@Zaharid
Copy link
Copy Markdown
Contributor

@Zaharid Zaharid commented Mar 20, 2019

This adds a pure python parser for fktables and is work in progress. At the moment it should give sensible results for both compressed and uncompressed hadronic fktables. The most frequent usage should be something like:

from validphys.fkparser import load_fktable
from validphys.loader import Loader
l = Loader()
fk = l.check_fktable(setname="ATLASTTBARTOT", theoryID=52, cfac=())
f = load_fktable(fk)

Some thoughts and considerations:

  • I am trying to follow the spec https://github.com/NNPDF/nnpdf/blob/master/doc/data/data_layout.pdf and have not looked at the c++ code in a long time. Lets see how that goes.

  • The fktable format is very irregular. For example the fktable in the examples uses both (!) tabs and spaces to delimit numbers in different places of the fastkernel grid. This makes parsing slower than it has to be and means that about the only non manual solution that has any hope of working is pandas.read_csv. This shows you why an opaque binary format that forces you to specify everything precisely is better in the long term.

  • At the moment I am prioritizing good error messages and clear code over performance. I suspect that this will not matter much as most of the time will go into parsing the sigma grids, which is done with an external function.

  • The theory info field is a bit redundant unless somebody is doing something very wrong. I can't really be bothered to write the types for all the keys. Does anybody thing those might be needed? More generally I think for the fit we will only need the tensor and the boolean mask, possibly also with a flag to specify if it is hadronic or not. The rest of the data might be useful for analysis purposes. There is also some redundancy (e.g. ndata and nx) which will be used to check the consistency of the fktable. In any case there will be some higher level objects than a dictionary at some point, which will check for things like required fields.

  • I would like very much that this serves to make the parallel mode of validphys faster. For that it would be best to allocate the sigma grid on some shared memory store. I know how to do this manually on linux (/dev/shm), but not so much in cross platform way. For that I am looking at https://arrow.apache.org/ and in particular https://arrow.apache.org/docs/python/plasma.html. There is the question on whether we could load directly an arrow table on a shared store and avoid copying, but that would mean patching their csv parser (which is more limited but probably faster than the pandas one because it is multithreaded). At the moment I am leaning towards implementing all that as a layer on top of the parser, even thought it is not the most efficient approach.

  • The sigma is effectively encoded as as sparse tensor in the x combinations and a dense tensor in the flavour indexes. So we have all possible 14 flavour combinations, many of which are typically zero, but only some indexed for the xgrids. For example the first few entries of the index of the table above look like (0, 2, 21), (0, 2, 22), (0, 2, 23), (0, 2, 24), (0, 3, 15), (0, 3, 16), (0, 3, 17), where the first index is the data point and the other two are indexes in the xgrid. @scarrazza @scarlehoff do you want this on you would prefer a full grid full of zeros?

  • Overall I am too old for writing parsers. I think that now that we can process these files into pure python we should just dump them to parquet (https://parquet.apache.org/), because it seems to be the format with most support (and there are direct interfaces for both Arrow and pandas so we really would never have to look inside the file). This would make things easier and faster.

Additionally this needs better docs and tests.

Will close #377.

@Zaharid
Copy link
Copy Markdown
Contributor Author

Zaharid commented Mar 20, 2019

Also, it is fine if load_fktable takes care of applying the cfactors? Seems one less thing to worry about down the line.

@Zaharid
Copy link
Copy Markdown
Contributor Author

Zaharid commented Mar 20, 2019

Unfortunately as it stands now, it is quite a lot slower (I think due to pandas.reas_csv doing more work than what we do in c++).

%timeit f = load_fktable(fk)
2.53 s ± 49.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit fk.load()
1.56 s ± 14.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@scarrazza
Copy link
Copy Markdown
Member

What about using cffi?

@scarlehoff
Copy link
Copy Markdown
Member

wrt to the sigma tensor, my personal experience tells me the "grid full of 0s" is better.
Otherwise you have to refer to some well-explained documentation or ask someone who knows (which I did) and even then it required a bit of trial and error to understand why the size of my tensors didn't match.

@Zaharid
Copy link
Copy Markdown
Contributor Author

Zaharid commented Mar 20, 2019

@scarlehoff Well, now you get a multindexed object with the index explicitly attached to the points. Keeping all the zeros would make it quite a bit bigger. And anyhow this is not meant to be used in high level interfaces.

@Zaharid
Copy link
Copy Markdown
Contributor Author

Zaharid commented Mar 20, 2019

What about using cffi?

Well we would have to write or find something that is fast enough to be worth it, and produces something equivalent (and also supports '\s+' as separator because we seem to need that).

My preference would be to use this parser to just dump everything to parquet (or whatever) and be done with it. And just upload all the theories again in whatever format we choose.

Btw, could you have a look why apfelcomb has this behaviour of using spaces for some values and tabs for others?

@Zaharid
Copy link
Copy Markdown
Contributor Author

Zaharid commented Mar 20, 2019

@scarlehoff e.g for ATLAS1JET11 you get 99285 rows instead of 140*45*45, so you save two thirds of the space.

@Zaharid
Copy link
Copy Markdown
Contributor Author

Zaharid commented Mar 20, 2019

...not to mention that only 31 of the 196 flavour combinations we write for this thing have some non zero entry. So we really could make it a lot smaller in memory.

@scarlehoff
Copy link
Copy Markdown
Member

Actually the flavours combinations I'm happy with because you can define your dimensions as (xgrid, flavours combinations, experimental points) and it is still a dense block and consistent between different datasets.

@scarlehoff e.g for ATLAS1JET11 you get 99285 instead of 140_45_45, so you save two thirds of the space.

But I didn't realise there were so many empty values in the "xgrid" dimension. It's hard to argue with that.

@Zaharid
Copy link
Copy Markdown
Contributor Author

Zaharid commented Mar 20, 2019

@scarrazza e.g. instead of having to play games with csv one could just do:

%timeit pd.read_parquet('/tmp/delthis.pk')
111 ms ± 4.44 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit pd.read_csv('/tmp/delthis', sep='\t', index_col=(0,1,2))
2.02 s ± 61.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@scarrazza
Copy link
Copy Markdown
Member

Ok, the parquet solution looks good to me.

@Zaharid
Copy link
Copy Markdown
Contributor Author

Zaharid commented Mar 20, 2019

Ok, so I am leaning towards proposing that the output for fit purposes is going to be a table with the same index as now and the columns are going to be the indexes in the FlavourMap that are nonzero. You just have to process the columns and the index to construct the appropriate pdf combinations.

It would probably make sense that the data index is the last rather than the first.

@Zaharid
Copy link
Copy Markdown
Contributor Author

Zaharid commented Mar 21, 2019

Here some stub functions that produce convolutions. AFAICT from a quick test these give comparable results to c++. A lot could be done to optimize them in various ways, but, it is rather easy to do by matching the pandas indices (and some numpy indexing magic).

import pandas as pd
import numpy as np

from validphys.pdfbases import evolution
from validphys.pdfgrids import xplotting_grid

FK_FLAVOURS = evolution.to_known_elements(
    [
        "photon",
        "singlet",
        "g",
        "V",
        "V3",
        "V8",
        "V15",
        "V24",
        "V35",
        "T3",
        "T8",
        "T15",
        "T24",
        "T35",
    ]
)


def hadron_predictions(pdf, loaded_fk):
    xgrid = loaded_fk["xGrid"]
    Q = float(loaded_fk["TheoryInfo"]["Q0"])
    sigma = loaded_fk["FastKernel"]
    sigma = sigma.loc[:, loaded_fk["FlavourMap"].ravel()]

    gv = xplotting_grid(
        pdf=pdf, Q=Q, flavours=FK_FLAVOURS, xgrid=xgrid, basis=evolution
    ).grid_values
    fm = sigma.columns
    fl1, fl2 = np.indices((14, 14))
    fl1 = fl1.ravel()[fm]
    fl2 = fl2.ravel()[fm]
    gv1fl = gv[:, fl1, :]
    gv2fl = gv[:, fl2, :]

    def appl(df):
        xx1 = df.index.get_level_values(1)
        xx2 = df.index.get_level_values(2)
        gv1 = gv1fl[:, :, xx1]
        gv2 = gv2fl[:, :, xx2]
        return pd.Series(np.einsum("ijk,ijk,kj->i", gv1, gv2, df.values))

    return sigma.groupby(level=0).apply(appl)


def dis_predictions(pdf, loaded_fk):
    xgrid = loaded_fk["xGrid"]
    Q = float(loaded_fk["TheoryInfo"]["Q0"])
    sigma = loaded_fk["FastKernel"]
    sigma = sigma.loc[:, loaded_fk["FlavourMap"].ravel()]

    fm = sigma.columns
    gv = xplotting_grid(
        pdf=pdf, Q=Q, flavours=FK_FLAVOURS[fm], xgrid=xgrid, basis=evolution
    ).grid_values

    def appl(df):
        xind = df.index.get_level_values(1)
        return pd.Series(np.einsum("ijk,kj->i", gv[:, :, xind], df.values))

    return sigma.groupby(level=0).apply(appl)

Just to emphasise that these functions are not optimized at all. They a more to test for correctness.

@Zaharid
Copy link
Copy Markdown
Contributor Author

Zaharid commented Mar 21, 2019

@scarrazza it would be great if somebody from Milan could start playing with this to see how fast can be made. Note that we don't only care for the fit, but would like to have also primitives for LHAPDF grids more generally useful for vp.

@scarlehoff
Copy link
Copy Markdown
Member

scarlehoff commented Mar 21, 2019

Just to be sure, the idea is to use this in order to access the fktables directly, right?
If that's the case I think we might be able to use this already, is there a way to access it directly from the DataSetSpec objects?

Right now we basically* do:

dataset_c = dataset_spec.load()
raw_fktable = dataset_c.GetFK(i) 
parsed_fktable = fk_parser(raw_fktable)

where "fk_parser" is some more or less complicated function we wrote and parsed_fktable a dictionary with the grid in x, the flavour map, etc.
*the .load() we actually do at the level of the experiment, but I understand internally is the same.

In conclusion, if there were a way to simplify that to:
parsed_fktable = dataset_spec.fktable()
I would be very happy and I could quickly implement it and give you some feedback.

@Zaharid
Copy link
Copy Markdown
Contributor Author

Zaharid commented Mar 21, 2019

Just to be sure, the idea is to use this in order to access the fktables directly, right?

Yes, the idea is to bypass the C++ code, so we can make use of fancier allocation strategies and computation methods. You get a tensor (which at the moment is a pandas dataframe, but doesn't have to) with the required slices, and then do something similar to the functions I showed above to compute the predictions.

If that's the case I think we might be able to use this already, is there a way to access it directly from the DataSetSpec objects?

It would be a little rough at the moment in that there is no support for COMPOUND or cfactors, but you can load the table itself:

In [2]: from validphys.loader import Loader                                                                           

In [3]: l = Loader()                                                                                                  

In [4]: l.check_dataset('ATLASTTBARTOT', theoryid=52, cuts=None)                                                      
Out[4]: DataSetSpec(name='ATLASTTBARTOT', commondata=CommonDataSpec(datafile=PosixPath('/home/zah/anaconda3/share/NNPDF/data/commondata/DATA_ATLASTTBARTOT.dat'), sysfile=PosixPath('/home/zah/anaconda3/share/NNPDF/data/commondata/systypes/SYSTYPE_ATLASTTBARTOT_DEFAULT.dat'), plotfiles=(PosixPath('/home/zah/anaconda3/share/NNPDF/data/commondata/PLOTTINGTYPE_INC.yaml'), PosixPath('/home/zah/anaconda3/share/NNPDF/data/commondata/PLOTTING_ATLASTTBARTOT.yaml'))), fkspecs=(FKTableSpec(fkpath=PosixPath('/home/zah/anaconda3/share/NNPDF/data/theory_52/fastkernel/FK_ATLASTTBARTOT.dat'), cfactors=()),), thspec=TheoryIDSpec(id=52, path=PosixPath('/home/zah/anaconda3/share/NNPDF/data/theory_52')), cuts=None, frac=1, op='NULL', weight=1)

In [5]: ds = l.check_dataset('ATLASTTBARTOT', theoryid=52, cuts=None)                                                 

In [6]: ds.fkspecs                                                                                                    
Out[6]: (FKTableSpec(fkpath=PosixPath('/home/zah/anaconda3/share/NNPDF/data/theory_52/fastkernel/FK_ATLASTTBARTOT.dat'), cfactors=()),)

In [7]: from validphys.fkparser import load_fktable                                                                                                                                                                                           


In [9]: load_fktable(ds.fkspecs[0])   

Right now we basically* do:

dataset_c = dataset_spec.load()
raw_fktable = dataset_c.GetFK(i) 
parsed_fktable = fk_parser(raw_fktable)

where "fk_parser" is some more or less complicated function we wrote and parsed_fktable a dictionary with the grid in x, the flavour map, etc.
*the .load() we actually do at the level of the experiment, but I understand internally is the same.

Yeah, see above. The final interface will change. E.g. you will never see the flavour map, but only the indexes on dataframe, and the will be something with more structure than a dictionary.

In conclusion, if there were a way to simplify that to:
parsed_fktable = dataset_spec.fktable()
I would be very happy and I could quickly implement it and give you some feedback.

@scarlehoff
Copy link
Copy Markdown
Member

It would be a little rough at the moment in that there is no support for COMPOUND or cfactors, but you can load the table itself:

As long as it works for some datasets we can actually check and compare

Yeah, see above. The final interface will change. E.g. you will never see the flavour map, but only the indexes on dataframe, and the will be something with more structure than a dictionary.

Also fine, I just want to have a feeling of how it works in its current state so I don't look at this in one month time and find one billion lines of code I cannot follow.

@scarlehoff
Copy link
Copy Markdown
Member

ps: my last point still stands, being able to do something like:

for experiment in experiments:
    for dataset in experiment.datasets:
        fktable = dataset.fktable()

would be lovely

@Zaharid
Copy link
Copy Markdown
Contributor Author

Zaharid commented Mar 21, 2019

ps: my last point still stands, being able to do something like:

for experiment in experiments:
    for dataset in experiment.datasets:
        fktable = dataset.fktable()

would be lovely

It would have to be more complicated in that in general you have several fktables and an operator to put them together. So at least [load_fktable(fs) for fs in ds.fkspecs], which you can do now (although again this does nothing about cfacors yet). Some version of this can easily be made a method of datasetspec.

@scarrazza
Copy link
Copy Markdown
Member

@Zaharid I think we should decide if we merge or close this PR because maintaining a prototype code alive for too long seems a bad idea (i.e. accumulating conflicts, and people forgetting about its benefits).

@Zaharid
Copy link
Copy Markdown
Contributor Author

Zaharid commented Mar 31, 2020

Well, I have been saying for over a year that this is pretty useless on it and it should be picked up. I am fine with merging it but then again the documentation is full of typos so it has very clearly not been reviewed.

@scarlehoff
Copy link
Copy Markdown
Member

It does say WIP, if it is to be merged I can review it.

@Zaharid
Copy link
Copy Markdown
Contributor Author

Zaharid commented Mar 31, 2020

@scarrazza please note the comments starting from
#404 (comment)

is there anything else?

@scarrazza
Copy link
Copy Markdown
Member

Ok, I did not realize the current implementation is done with read_csv instead of read_parquet.
Fine by me then, this PR implements a useful pure python functionality.

Copy link
Copy Markdown
Member

@scarlehoff scarlehoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if it is just a list with the options I think we want any option we add to the code to have its place at https://docs.nnpdf.science/vp/index.html so that people know that the load_fktable option exists without having to guess.
This is a good place to start such list.

Other than that I'm fine with it.

Comment thread validphys2/src/validphys/fkparser.py Outdated
"NX": int
})
gi = GridInfo(**{k.lower(): v for k, v in d.items()})
return gi, l, h
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have something more descriptive than d, l, h even if it seems obvious (it is not obvious to me)

@scarlehoff
Copy link
Copy Markdown
Member

Mm, this is so old the checks are failing because they try to find packages at https://zigzah.com :_

Zaharid and others added 9 commits March 31, 2020 18:04
Add functionality to read fktables in pure python. This supports
transparently both compressed and plain tables. We have functionality to
process both types of sections defined in the spec, namely blobs and
option lists. We support types casting (and more general parsing) for
the option values and define various specialized parsing for various
blobs (at the moment based on numpy). These still need some work. At the
moment the code relies on the untested assumption that the FastKernel
section comes last. For the moment, return a dictionary with all the
fields.
This is the easiest way to correctly process the indexes. Moreover we
need something more complicated than a one character separator because
there are fktables that use '\t' and ' ' at the same time. For the
moment this only works for hadronic tables, as DIS have different
indexing.
Return a higher level object, with the required information to do
a convolution, and a bunch of metadata. Parse the ThoryInfo (although we
might not use much of it in practice) and refactor a bit the error
checking so as to not clutter the main code path.
Add some lower level functionality to parse the files, which could be
useful if we e.g. did something with the Monte Carlo uncertainty. At
high level, all the cfactors simply multiply the fktable.
@scarrazza scarrazza changed the title [WIP] Fkparser Fkparser Mar 31, 2020
Zaharid added 3 commits April 2, 2020 17:11
These could potentially be obtained in various other ways, so it makes
sense to septate them and make it clear that they are not tied to the
file format.
Make minor cosmetic changes and extend and improve docs.
In fairness one should also test the error paths but these are
unimportant and difficult enough that I can't be bothered.

However more quantitative tests are in fact needed, but those require
some convolution code.
@Zaharid
Copy link
Copy Markdown
Contributor Author

Zaharid commented Apr 2, 2020

Ok, please have a look now. I'll make another PR adding some convolution code and enhance docs and tests accordingly.

1 similar comment
@Zaharid
Copy link
Copy Markdown
Contributor Author

Zaharid commented Apr 2, 2020

Ok, please have a look now. I'll make another PR adding some convolution code and enhance docs and tests accordingly.

Copy link
Copy Markdown
Member

@scarlehoff scarlehoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took the liberty to add a reference to FKTableData in the docs description (got also bitten by the -f thing :_ ) and fix a few typos. This can be merged imo.

Comment thread doc/sphinx/source/vp/pydataobjs.rst Outdated
----------------

Currently only FKTables can be directly without C++ code. This is implemented
in the :py:mod:`validphys.fkarser` module. For example::
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lol, this was unfortunate

@Zaharid Zaharid merged commit baa885a into master Apr 3, 2020
@Zaharid Zaharid deleted the fkparser branch April 22, 2020 11:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Write a validphys function to process fktables

4 participants