Conversation
|
Also, it is fine if |
|
Unfortunately as it stands now, it is quite a lot slower (I think due to pandas.reas_csv doing more work than what we do in c++). |
|
What about using cffi? |
|
wrt to the sigma tensor, my personal experience tells me the "grid full of 0s" is better. |
|
@scarlehoff Well, now you get a multindexed object with the index explicitly attached to the points. Keeping all the zeros would make it quite a bit bigger. And anyhow this is not meant to be used in high level interfaces. |
Well we would have to write or find something that is fast enough to be worth it, and produces something equivalent (and also supports '\s+' as separator because we seem to need that). My preference would be to use this parser to just dump everything to parquet (or whatever) and be done with it. And just upload all the theories again in whatever format we choose. Btw, could you have a look why apfelcomb has this behaviour of using spaces for some values and tabs for others? |
|
@scarlehoff e.g for ATLAS1JET11 you get 99285 rows instead of |
|
...not to mention that only 31 of the 196 flavour combinations we write for this thing have some non zero entry. So we really could make it a lot smaller in memory. |
|
Actually the flavours combinations I'm happy with because you can define your dimensions as (xgrid, flavours combinations, experimental points) and it is still a dense block and consistent between different datasets.
But I didn't realise there were so many empty values in the "xgrid" dimension. It's hard to argue with that. |
|
@scarrazza e.g. instead of having to play games with csv one could just do: |
|
Ok, the parquet solution looks good to me. |
|
Ok, so I am leaning towards proposing that the output for fit purposes is going to be a table with the same index as now and the columns are going to be the indexes in the FlavourMap that are nonzero. You just have to process the columns and the index to construct the appropriate pdf combinations. It would probably make sense that the data index is the last rather than the first. |
|
Here some stub functions that produce convolutions. AFAICT from a quick test these give comparable results to c++. A lot could be done to optimize them in various ways, but, it is rather easy to do by matching the pandas indices (and some numpy indexing magic). import pandas as pd
import numpy as np
from validphys.pdfbases import evolution
from validphys.pdfgrids import xplotting_grid
FK_FLAVOURS = evolution.to_known_elements(
[
"photon",
"singlet",
"g",
"V",
"V3",
"V8",
"V15",
"V24",
"V35",
"T3",
"T8",
"T15",
"T24",
"T35",
]
)
def hadron_predictions(pdf, loaded_fk):
xgrid = loaded_fk["xGrid"]
Q = float(loaded_fk["TheoryInfo"]["Q0"])
sigma = loaded_fk["FastKernel"]
sigma = sigma.loc[:, loaded_fk["FlavourMap"].ravel()]
gv = xplotting_grid(
pdf=pdf, Q=Q, flavours=FK_FLAVOURS, xgrid=xgrid, basis=evolution
).grid_values
fm = sigma.columns
fl1, fl2 = np.indices((14, 14))
fl1 = fl1.ravel()[fm]
fl2 = fl2.ravel()[fm]
gv1fl = gv[:, fl1, :]
gv2fl = gv[:, fl2, :]
def appl(df):
xx1 = df.index.get_level_values(1)
xx2 = df.index.get_level_values(2)
gv1 = gv1fl[:, :, xx1]
gv2 = gv2fl[:, :, xx2]
return pd.Series(np.einsum("ijk,ijk,kj->i", gv1, gv2, df.values))
return sigma.groupby(level=0).apply(appl)
def dis_predictions(pdf, loaded_fk):
xgrid = loaded_fk["xGrid"]
Q = float(loaded_fk["TheoryInfo"]["Q0"])
sigma = loaded_fk["FastKernel"]
sigma = sigma.loc[:, loaded_fk["FlavourMap"].ravel()]
fm = sigma.columns
gv = xplotting_grid(
pdf=pdf, Q=Q, flavours=FK_FLAVOURS[fm], xgrid=xgrid, basis=evolution
).grid_values
def appl(df):
xind = df.index.get_level_values(1)
return pd.Series(np.einsum("ijk,kj->i", gv[:, :, xind], df.values))
return sigma.groupby(level=0).apply(appl)Just to emphasise that these functions are not optimized at all. They a more to test for correctness. |
|
@scarrazza it would be great if somebody from Milan could start playing with this to see how fast can be made. Note that we don't only care for the fit, but would like to have also primitives for LHAPDF grids more generally useful for vp. |
|
Just to be sure, the idea is to use this in order to access the fktables directly, right? Right now we basically* do: where "fk_parser" is some more or less complicated function we wrote and parsed_fktable a dictionary with the grid in x, the flavour map, etc. In conclusion, if there were a way to simplify that to: |
Yes, the idea is to bypass the C++ code, so we can make use of fancier allocation strategies and computation methods. You get a tensor (which at the moment is a pandas dataframe, but doesn't have to) with the required slices, and then do something similar to the functions I showed above to compute the predictions.
It would be a little rough at the moment in that there is no support for COMPOUND or cfactors, but you can load the table itself: In [2]: from validphys.loader import Loader
In [3]: l = Loader()
In [4]: l.check_dataset('ATLASTTBARTOT', theoryid=52, cuts=None)
Out[4]: DataSetSpec(name='ATLASTTBARTOT', commondata=CommonDataSpec(datafile=PosixPath('/home/zah/anaconda3/share/NNPDF/data/commondata/DATA_ATLASTTBARTOT.dat'), sysfile=PosixPath('/home/zah/anaconda3/share/NNPDF/data/commondata/systypes/SYSTYPE_ATLASTTBARTOT_DEFAULT.dat'), plotfiles=(PosixPath('/home/zah/anaconda3/share/NNPDF/data/commondata/PLOTTINGTYPE_INC.yaml'), PosixPath('/home/zah/anaconda3/share/NNPDF/data/commondata/PLOTTING_ATLASTTBARTOT.yaml'))), fkspecs=(FKTableSpec(fkpath=PosixPath('/home/zah/anaconda3/share/NNPDF/data/theory_52/fastkernel/FK_ATLASTTBARTOT.dat'), cfactors=()),), thspec=TheoryIDSpec(id=52, path=PosixPath('/home/zah/anaconda3/share/NNPDF/data/theory_52')), cuts=None, frac=1, op='NULL', weight=1)
In [5]: ds = l.check_dataset('ATLASTTBARTOT', theoryid=52, cuts=None)
In [6]: ds.fkspecs
Out[6]: (FKTableSpec(fkpath=PosixPath('/home/zah/anaconda3/share/NNPDF/data/theory_52/fastkernel/FK_ATLASTTBARTOT.dat'), cfactors=()),)
In [7]: from validphys.fkparser import load_fktable
In [9]: load_fktable(ds.fkspecs[0])
Yeah, see above. The final interface will change. E.g. you will never see the flavour map, but only the indexes on dataframe, and the will be something with more structure than a dictionary.
|
As long as it works for some datasets we can actually check and compare
Also fine, I just want to have a feeling of how it works in its current state so I don't look at this in one month time and find one billion lines of code I cannot follow. |
|
ps: my last point still stands, being able to do something like: would be lovely |
It would have to be more complicated in that in general you have several fktables and an operator to put them together. So at least |
|
@Zaharid I think we should decide if we merge or close this PR because maintaining a prototype code alive for too long seems a bad idea (i.e. accumulating conflicts, and people forgetting about its benefits). |
|
Well, I have been saying for over a year that this is pretty useless on it and it should be picked up. I am fine with merging it but then again the documentation is full of typos so it has very clearly not been reviewed. |
|
It does say WIP, if it is to be merged I can review it. |
|
@scarrazza please note the comments starting from is there anything else? |
|
Ok, I did not realize the current implementation is done with |
There was a problem hiding this comment.
Even if it is just a list with the options I think we want any option we add to the code to have its place at https://docs.nnpdf.science/vp/index.html so that people know that the load_fktable option exists without having to guess.
This is a good place to start such list.
Other than that I'm fine with it.
| "NX": int | ||
| }) | ||
| gi = GridInfo(**{k.lower(): v for k, v in d.items()}) | ||
| return gi, l, h |
There was a problem hiding this comment.
It would be nice to have something more descriptive than d, l, h even if it seems obvious (it is not obvious to me)
|
Mm, this is so old the checks are failing because they try to find packages at |
Add functionality to read fktables in pure python. This supports transparently both compressed and plain tables. We have functionality to process both types of sections defined in the spec, namely blobs and option lists. We support types casting (and more general parsing) for the option values and define various specialized parsing for various blobs (at the moment based on numpy). These still need some work. At the moment the code relies on the untested assumption that the FastKernel section comes last. For the moment, return a dictionary with all the fields.
This is the easiest way to correctly process the indexes. Moreover we need something more complicated than a one character separator because there are fktables that use '\t' and ' ' at the same time. For the moment this only works for hadronic tables, as DIS have different indexing.
Return a higher level object, with the required information to do a convolution, and a bunch of metadata. Parse the ThoryInfo (although we might not use much of it in practice) and refactor a bit the error checking so as to not clutter the main code path.
Add some lower level functionality to parse the files, which could be useful if we e.g. did something with the Monte Carlo uncertainty. At high level, all the cfactors simply multiply the fktable.
These could potentially be obtained in various other ways, so it makes sense to septate them and make it clear that they are not tied to the file format.
Make minor cosmetic changes and extend and improve docs.
In fairness one should also test the error paths but these are unimportant and difficult enough that I can't be bothered. However more quantitative tests are in fact needed, but those require some convolution code.
|
Ok, please have a look now. I'll make another PR adding some convolution code and enhance docs and tests accordingly. |
1 similar comment
|
Ok, please have a look now. I'll make another PR adding some convolution code and enhance docs and tests accordingly. |
scarlehoff
left a comment
There was a problem hiding this comment.
I took the liberty to add a reference to FKTableData in the docs description (got also bitten by the -f thing :_ ) and fix a few typos. This can be merged imo.
| ---------------- | ||
|
|
||
| Currently only FKTables can be directly without C++ code. This is implemented | ||
| in the :py:mod:`validphys.fkarser` module. For example:: |
There was a problem hiding this comment.
Lol, this was unfortunate
This adds a pure python parser for fktables and is work in progress. At the moment it should give sensible results for both compressed and uncompressed hadronic fktables. The most frequent usage should be something like:
Some thoughts and considerations:
I am trying to follow the spec https://github.com/NNPDF/nnpdf/blob/master/doc/data/data_layout.pdf and have not looked at the c++ code in a long time. Lets see how that goes.
The fktable format is very irregular. For example the fktable in the examples uses both (!) tabs and spaces to delimit numbers in different places of the fastkernel grid. This makes parsing slower than it has to be and means that about the only non manual solution that has any hope of working is
pandas.read_csv. This shows you why an opaque binary format that forces you to specify everything precisely is better in the long term.At the moment I am prioritizing good error messages and clear code over performance. I suspect that this will not matter much as most of the time will go into parsing the sigma grids, which is done with an external function.
The theory info field is a bit redundant unless somebody is doing something very wrong. I can't really be bothered to write the types for all the keys. Does anybody thing those might be needed? More generally I think for the fit we will only need the tensor and the boolean mask, possibly also with a flag to specify if it is hadronic or not. The rest of the data might be useful for analysis purposes. There is also some redundancy (e.g. ndata and nx) which will be used to check the consistency of the fktable. In any case there will be some higher level objects than a dictionary at some point, which will check for things like required fields.
I would like very much that this serves to make the parallel mode of validphys faster. For that it would be best to allocate the sigma grid on some shared memory store. I know how to do this manually on linux (
/dev/shm), but not so much in cross platform way. For that I am looking at https://arrow.apache.org/ and in particular https://arrow.apache.org/docs/python/plasma.html. There is the question on whether we could load directly an arrow table on a shared store and avoid copying, but that would mean patching their csv parser (which is more limited but probably faster than the pandas one because it is multithreaded). At the moment I am leaning towards implementing all that as a layer on top of the parser, even thought it is not the most efficient approach.The sigma is effectively encoded as as sparse tensor in the x combinations and a dense tensor in the flavour indexes. So we have all possible 14 flavour combinations, many of which are typically zero, but only some indexed for the xgrids. For example the first few entries of the index of the table above look like
(0, 2, 21), (0, 2, 22), (0, 2, 23), (0, 2, 24), (0, 3, 15), (0, 3, 16), (0, 3, 17), where the first index is the data point and the other two are indexes in the xgrid. @scarrazza @scarlehoff do you want this on you would prefer a full grid full of zeros?Overall I am too old for writing parsers. I think that now that we can process these files into pure python we should just dump them to parquet (https://parquet.apache.org/), because it seems to be the format with most support (and there are direct interfaces for both Arrow and pandas so we really would never have to look inside the file). This would make things easier and faster.
Additionally this needs better docs and tests.
Will close #377.