Transform ekore in header-only C++ library by felixhekhorn · Pull Request #246 · NNPDF/eko

felixhekhorn · 2023-04-14T16:42:56Z

for recent status see this post

it is possible to use cffi under numba - as demonstrated by this PR (currently only for the cern_polygamma function)
in order to run this add CFFI inside the environment and compile the .c file with the given command (of course due to this tests are failing)
however, this is currently much slower than master because we lost caching:

NumbaWarning: Cannot cache compiled function "quad_ker" as it uses dynamic globals (such as ctypes pointers and large global arrays)
    @nb.njit(cache=True)

which results in

Evolution: computing operators - 1/60 took: 99.568683 s
Evolution: computing operators - 2/60 took: 1.401214 s

since I ported cern_polygamma the non-cacheability bubbles all the way up everywhere (as $\psi_k$ is the core for ekore) ...
anything that is not a direct dependency (such as interpolation) should still be cached
CFFI is still lacking full support for complex numbers at the moment, so I had to work a bit around
and of course we're entering the mined territory of pointers
note also e.g. that numba needs in the "header" void*, but of course the actual functions use double* - I believe this is needed to convince numba to resolve the actual address for the call
this PR conserves the LHA LO benchmark

cc @scarrazza as we were discussing this a bit when you've been around 🙃

scarrazza · 2023-04-14T17:23:47Z

@felixhekhorn thanks for this, finally a sensible choice, before moving to messy languages and compilation tools 😆.

felixhekhorn · 2023-04-17T12:33:09Z

this is currently much slower than master

I think, we should drop this PR straight again ... we're not gaining any speed. If we assume that evaluating \psi_0 is the most time consuming part (which should be kind of reasonable) then we get for evaluating 1e4 times \psi_0 at 100 points: t(c) ~ 2s, but t(numba) ~ 1.7s - so the c-interface is in fact slower! (no idea how this can happen though ... I assume numba has some special treatment for cffi calls) In fact, just calling a CFFI function and doing nothing there (i.e. it contains nothing but return 0;) already takes 0.5s (same settings), which is crazy

~~Note that there is no caching here in game as inside a single script the (generated) code is always cached~~

When I double checked the last statement, I realized there is some caching going on ... so I removed all caching (I hope) and now we have a small gain here: t(c) ~ 2s (of course as before) vs. t(numba) ~ 2.6 s. Also, of course just calling CFFI remains at 0.5s

felixhekhorn · 2023-04-17T12:41:52Z

so really the non-cacheability is killing this PR

alecandido · 2023-04-17T13:03:59Z

Is CFFI recompiling at each call? It seems silly and unbelievable...

felixhekhorn · 2023-04-17T13:07:05Z

Is CFFI recompiling at each call? It seems silly and unbelievable...

the .so file is only compiled once, of course (by me - I hope) - the problem is the numba cache which is conflicting

alecandido · 2023-04-17T13:21:12Z

the .so file is only compiled once, of course (by me - I hope) - the problem is the numba cache which is conflicting

Okay, you're right, I forgot you started from CERNlib (I was actually considering it as completely external to EKO).

Is then CFFI usage killing Numba cache?
It is a bit hard to understand from your partially "deleted" comment.

felixhekhorn · 2023-04-18T08:50:13Z

Is then CFFI usage killing Numba cache?

yes! see the warning in the description:

NumbaWarning: Cannot cache compiled function "quad_ker" as it uses dynamic globals (such as ctypes pointers and large global arrays)
@nb.njit(cache=True)

alecandido · 2023-04-18T09:52:11Z

Okay, I see, maybe I got their point: a Numba cached function should be more or less a .so analog, that can refer to other symbols.
However, maybe ctypes interface is not exposing actual discoverable symbols, so they might be using a different mechanism: get a pointer to something that is loaded by someone else.
But in this case, it would be difficult to cache, because you only have a pointer, so you would lose connection to the pointed object once loaded...

There are a lot of maybes in this explanation, I have no idea of Numba internals (other than they are using a stripped down version of LLVM). But yes, if they have no mechanism to link to other .so, the incremental replacing is impossible, unless with drop quad_ker decoration.

What about replacing quad_ker in the first place?
If we can't make Numba accepting callables from somewhere else, maybe we can make the somewhere else accepting callables from Numba.

felixhekhorn · 2023-04-18T16:51:48Z

What about replacing quad_ker in the first place?
If we can't make Numba accepting callables from somewhere else, maybe we can make the somewhere else accepting callables from Numba.

oookay, playing a bit more - this is possible: I attach the relevant files here below.

However, it is not clear to me how this would help us ... fine, we would not need to port everything at once, but we can not tackle the main obstacle: the conjecture is $\psi_0$ - because now we would need to work from top down ...

cfunc.c

// gcc -shared -o cfunc.so -fPIC cfunc.c
#include<stdio.h>

typedef double (*fnc)(double, double);

double do_add(fnc f, double a, double b) {
    return f(a,b);
}

double my_add(double a, double b) { return a + b; }

cfunc.py

from cffi import FFI
import numba as nb
import numpy as np
import ctypes

from cernlibpy import cern_polygamma

# step 1: we create a numba c-function
# ------------------------------------
@nb.cfunc("float64(float64, float64)")
def add(x, y):
    return x + y + np.real(cern_polygamma(4.,0))


print("calling add.ctypes",add.ctypes(4.0, 5.0))  # prints "9.0"
# print(add.address)

# step 2a: we call add in c via cffi
# ----------------------------------------

ffi = FFI()
cernlibc = ffi.dlopen('./cfunc.so')

ffi.cdef("""
double do_add(void* f, double a, double b);
""")

# we need to "activate" the actual function first
c_do_add = cernlibc.do_add

@nb.njit() # we need this here, else FFI complains
def do_add(addr,a,b):
    res = c_do_add(addr, a, b)
    return res

print("calling do_add via numba",do_add(add.address, 4.,6.))

# step 2b: or we call add in c via ctypes
# --------------------------------------- 

libc = ctypes.CDLL("cfunc.so")
my_add_py = libc.my_add
my_add_py.argtypes = [ctypes.c_double, ctypes.c_double]
my_add_py.restype = ctypes.c_double
print("calling my_add via ctypes ",my_add_py(4.,7.))

do_add_py = libc.do_add
CMPFUNC_t = ctypes.CFUNCTYPE(ctypes.c_float, ctypes.c_float, ctypes.c_float)
do_add_py.argtypes = [CMPFUNC_t, ctypes.c_double, ctypes.c_double]
do_add_py.restype = ctypes.c_double

print("calling do_add via ctypes", do_add_py(ctypes.cast(add.address,CMPFUNC_t),4.,8.))

cernlibpy.py

import numpy as np
import numba as nb


@nb.njit()
def cern_polygamma(Z, K):
    # fmt: off
    DELTA = 5e-13
    R1 = 1
    HF = R1/2
    C1 = np.pi**2
    C2 = 2*np.pi**3
    C3 = 2*np.pi**4
    C4 = 8*np.pi**5

    # SGN is originally indexed 0:4 -> no shift
    SGN = [-1,+1,-1,+1,-1]
    # FCT is originally indexed -1:4 -> shift +1
    FCT = [0,1,1,2,6,24]

    # C is originally indexed 1:6 x 0:4 -> swap indices and shift new last -1
    C = nb.typed.List()
    C.append([
            8.33333333333333333e-2,
           -8.33333333333333333e-3,
            3.96825396825396825e-3,
           -4.16666666666666667e-3,
            7.57575757575757576e-3,
           -2.10927960927960928e-2])
    C.append([
            1.66666666666666667e-1,
           -3.33333333333333333e-2,
            2.38095238095238095e-2,
           -3.33333333333333333e-2,
            7.57575757575757576e-2,
           -2.53113553113553114e-1])
    C.append([
            5.00000000000000000e-1,
           -1.66666666666666667e-1,
            1.66666666666666667e-1,
           -3.00000000000000000e-1,
            8.33333333333333333e-1,
           -3.29047619047619048e+0])
    C.append([
            2.00000000000000000e+0,
           -1.00000000000000000e+0,
            1.33333333333333333e+0,
           -3.00000000000000000e+0,
            1.00000000000000000e+1,
           -4.60666666666666667e+1])
    C.append([10., -7., 12., -33., 130., -691.])
    U=Z
    X=np.real(U)
    A=np.abs(X)
    if K < 0 or K > 4:
        raise NotImplementedError("Order K has to be in [0:4]")
    if np.abs(np.imag(U)) < DELTA and np.abs(X+int(A)) < DELTA:
        raise ValueError("Argument Z equals non-positive integer")
    K1=K+1
    if X < 0:
        U=-U
    V=U
    H=0
    if A < 15:
        H=1/V**K1
        for I in range(1,14-int(A)+1):
            V=V+1
            H=H+1/V**K1
        V=V+1
    R=1/V**2
    P=R*C[K][6-1]
    for I in range(5,1-1,-1):
        P=R*(C[K][I-1]+P)
    H=SGN[K]*(FCT[K+1]*H+(V*(FCT[K-1+1]+P)+HF*FCT[K+1])/V**K1)
    if K == 0:
        H=H+np.log(V)
    if X < 0:
        V=np.pi*U
        X=np.real(V)
        Y=np.imag(V)
        A=np.sin(X)
        B=np.cos(X)
        T=np.tanh(Y)
        P=complex(B,-A*T)/complex(A,B*T)
        if K == 0:
            H=H+1/U+np.pi*P
        elif K == 1:
            H=-H+1/U**2+C1*(P**2+1)
        elif K == 2:
            H=H+2/U**3+C2*P*(P**2+1)
        elif K == 3:
            R=P**2
            H=-H+6/U**4+C3*((3*R+4)*R+1)
        elif K == 4:
            R=P**2
            H=H+24/U**5+C4*P*((3*R+5)*R+2)
    return H
    # fmt: on

felixhekhorn · 2023-07-14T14:50:36Z

Okay, this is now in a presentable state, so I'm asking everybody for some feedback on how to proceed @alecandido @giacomomagni @niclaurenti @t7phy (cc @scarrazza ) ... let me summarize some points:

The goals I consider are:

make ekore open for other people
if possible, improve the compilation strategy
if possible, speed up

I'm only considering the most simple setup here: FFNS, LO QCD, LHA

I believe to have addressed some of the goals:
a) The idea of this PR is to make ekore a header-only C++ library: i.e. in practice it will be sufficient to do something like

#include "ekorepp/types.hpp"
#include "ekorepp/harmonics/cache.hpp"
#include "ekorepp/anomalous_dimensions/unpolarized/space_like.hpp"
int main() {
// ...
ekorepp::harmonics::Cache c(n);
const ekorepp::tnsr res = ekorepp::anomalous_dimensions::unpolarized::spacelike::gamma_singlet(order_qcd, c, nf);
// ...
}

this should be sufficient for most languages to interface. I opted for C++ because 1) I don't speak C and 2) I don't want to deal with pointers if I can avoid and we need some lists/tensors.
b) The idea, however, is to only lift the complex analysis part (ekore) out, but keep the rest within numpy, because we like the algebra there (such as matrix multiplication and matrix inverse). Moreover, the conjecture (yet to be proven) is that the computational heavy part is in ekore (and not in matrix manipulation). This is achieved in this PR, by only porting ekore in C++ and returning after the computation back to Numba where we have access to Numpy:

eko/src/eko/evolution_operator/c_quad_ker.cpp

Line 137 in 765582b

return args.py(re.data(), im.data(),

c) as said only ekore is ported, but nothing more. This means that we need in addition the actual interface which is here under eko/ (and not under ekorepp/). I.e. to run this branch you need to execute this line

eko/src/eko/evolution_operator/c_quad_ker.cpp

Line 1 in 765582b

    
           // g++ -shared -o c_quad_ker.so -fPIC src/eko/evolution_operator/c_quad_ker.cpp -O2

to compile the interface. In practice one needs to run this command every time something inside ekore changes.
d) the issue with caching discussed earlier in this PR is completely gone - we're caching everything. We lost JIT compilation, because I need a function pointer address to be passed around. This function we would still need to compile on client side, but (I hope) that should be quick, because there are no 1000s of lines, but just a few. The major compilation (in C++) instead should now be done on wheel creation time.
e) we get a small speed up: running $ EKO_LOG_STDOUT=1 poe lha -m "ffns and lo and not pol" in master: ~50s here: ~12s (without C++ optimation flags) ~9s (with -O2 - no improvement for -O3). See also PS below.
f) despite the branch name there is no trace of CFFI left, but we only use ctypes (which is standard library)

What do you think? do we want to go down this road?

There are of course still some problems left:

deployment: create the correct libraries for the different OS/arch and deploy them to PyPI
unit test: find a good C++ unit test framework and port unit tests
documentation: find a way to export doxygen to sphinx and to display ekorepp docs only
find a way to translate N3LO stuff (up to N2LO, I think, I could do by hand, but not beyond ...)

(any of the above points of course applies also to any other strategy)

PS: I believe in master we're not running full numba under quad - because there is this functools.partial here

eko/src/eko/evolution_operator/__init__.py

Line 801 in 4dc65b4

return functools.partial(

in the middle, which very likely kills us. In fact, to run hard code under quad you need this LowLevelCallable as is done here

eko/src/eko/evolution_operator/__init__.py

Line 389 in 765582b

func = LowLevelCallable(c_quad_ker_qcd, user_data)

and in master we're not doing it

giacomomagni · 2023-07-17T07:57:42Z

LGTM, if we improve by a factor of 5 it's quite good!

documentation: find a way to export doxygen to sphinx and to display ekorepp docs only)

what about this https://breathe.readthedocs.io/en/latest/ ?

find a way to translate N3LO stuff (up to N2LO, I think, I could do by hand, but not beyond ...)

we can write something to automatise the process or import everything again from Mathematica depending on the case.

alecandido · 2023-07-17T08:54:04Z

find a way to translate N3LO stuff (up to N2LO, I think, I could do by hand, but not beyond ...)

we can write something to automatise the process or import everything again from Mathematica depending on the case.

This is just (not too complicated) string manipulation, we can do it :)

felixhekhorn · 2023-07-31T15:56:42Z

deployment: create the correct libraries for the different OS/arch and deploy them to PyPI

https://cibuildwheel.readthedocs.io/en/stable/

felixhekhorn · 2023-07-31T16:58:58Z

2. unit test: find a good C++ unit test framework and port unit tests

felixhekhorn · 2024-01-12T12:19:41Z

closed in favor of #189

felixhekhorn added the refactor Refactor code label Apr 14, 2023

felixhekhorn added 2 commits July 10, 2023 17:06

Port polygamma to CFFI

6a3e8ea

Move to c++

8048334

felixhekhorn force-pushed the cffi branch from 0767eda to 8048334 Compare July 10, 2023 15:07

Implement ad/u/s/as1.gamma_ns

6146a7c

alecandido reviewed Jul 10, 2023

View reviewed changes

Comment thread src/ekorepp/anomalous_dimensions/unpolarized/space_like/as1.hpp

felixhekhorn added 9 commits July 11, 2023 13:03

Add ekorepp/ad/u/s/gamma_singlet

0a43543

Start debugging ekorepp

442edfb

Fix polygamma

6b5f31e

Reactivate c_gamma_singlet

0616a81

Call numba inside C

091072e

Lift C dirctly under quad

3e5f8c8

Reshuffle files

954bdd6

Fix 2 bugs

f707533

Update docs, remove temp hacks

765582b

felixhekhorn changed the title ~~Use CFFI as numba alternative~~ Transform ekore in header-only C++ library Jul 14, 2023

giacomomagni reviewed Jul 17, 2023

View reviewed changes

Comment thread src/eko/evolution_operator/c_quad_ker.cpp

felixhekhorn mentioned this pull request Aug 7, 2023

Introduce Rust extension #189

Merged

felixhekhorn closed this Jan 12, 2024

felixhekhorn deleted the cffi branch January 12, 2024 12:19

Conversation

felixhekhorn commented Apr 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

scarrazza commented Apr 14, 2023

Uh oh!

felixhekhorn commented Apr 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

felixhekhorn commented Apr 17, 2023

Uh oh!

alecandido commented Apr 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

felixhekhorn commented Apr 17, 2023

Uh oh!

alecandido commented Apr 17, 2023

Uh oh!

felixhekhorn commented Apr 18, 2023

Uh oh!

alecandido commented Apr 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

felixhekhorn commented Apr 18, 2023

Uh oh!

Uh oh!

felixhekhorn commented Jul 14, 2023

Uh oh!

Uh oh!

giacomomagni commented Jul 17, 2023

Uh oh!

alecandido commented Jul 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

felixhekhorn commented Jul 31, 2023

Uh oh!

felixhekhorn commented Jul 31, 2023

Uh oh!

felixhekhorn commented Jan 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

felixhekhorn commented Apr 14, 2023 •

edited

Loading

felixhekhorn commented Apr 17, 2023 •

edited

Loading

alecandido commented Apr 17, 2023 •

edited

Loading

alecandido commented Apr 18, 2023 •

edited

Loading

alecandido commented Jul 17, 2023 •

edited

Loading