Skip to content

Transform ekore in header-only C++ library#246

Closed
felixhekhorn wants to merge 12 commits into
masterfrom
cffi
Closed

Transform ekore in header-only C++ library#246
felixhekhorn wants to merge 12 commits into
masterfrom
cffi

Conversation

@felixhekhorn
Copy link
Copy Markdown
Collaborator

@felixhekhorn felixhekhorn commented Apr 14, 2023

Alternative to #189 by @alecandido

for recent status see this post


  • it is possible to use cffi under numba - as demonstrated by this PR (currently only for the cern_polygamma function)
  • in order to run this add CFFI inside the environment and compile the .c file with the given command (of course due to this tests are failing)
  • however, this is currently much slower than master because we lost caching:
NumbaWarning: Cannot cache compiled function "quad_ker" as it uses dynamic globals (such as ctypes pointers and large global arrays)
    @nb.njit(cache=True)

which results in

Evolution: computing operators - 1/60 took: 99.568683 s
Evolution: computing operators - 2/60 took: 1.401214 s
  • since I ported cern_polygamma the non-cacheability bubbles all the way up everywhere (as $\psi_k$ is the core for ekore) ...
  • anything that is not a direct dependency (such as interpolation) should still be cached
  • CFFI is still lacking full support for complex numbers at the moment, so I had to work a bit around
  • and of course we're entering the mined territory of pointers
  • note also e.g. that numba needs in the "header" void*, but of course the actual functions use double* - I believe this is needed to convince numba to resolve the actual address for the call
  • this PR conserves the LHA LO benchmark

cc @scarrazza as we were discussing this a bit when you've been around 🙃

@felixhekhorn felixhekhorn added the refactor Refactor code label Apr 14, 2023
@scarrazza
Copy link
Copy Markdown
Member

@felixhekhorn thanks for this, finally a sensible choice, before moving to messy languages and compilation tools 😆.

@felixhekhorn
Copy link
Copy Markdown
Collaborator Author

felixhekhorn commented Apr 17, 2023

  • this is currently much slower than master

I think, we should drop this PR straight again ... we're not gaining any speed. If we assume that evaluating \psi_0 is the most time consuming part (which should be kind of reasonable) then we get for evaluating 1e4 times \psi_0 at 100 points: t(c) ~ 2s, but t(numba) ~ 1.7s - so the c-interface is in fact slower! (no idea how this can happen though ... I assume numba has some special treatment for cffi calls) In fact, just calling a CFFI function and doing nothing there (i.e. it contains nothing but return 0;) already takes 0.5s (same settings), which is crazy

Note that there is no caching here in game as inside a single script the (generated) code is always cached

When I double checked the last statement, I realized there is some caching going on ... so I removed all caching (I hope) and now we have a small gain here: t(c) ~ 2s (of course as before) vs. t(numba) ~ 2.6 s. Also, of course just calling CFFI remains at 0.5s

@felixhekhorn
Copy link
Copy Markdown
Collaborator Author

so really the non-cacheability is killing this PR

@alecandido
Copy link
Copy Markdown
Collaborator

alecandido commented Apr 17, 2023

Is CFFI recompiling at each call? It seems silly and unbelievable...

@felixhekhorn
Copy link
Copy Markdown
Collaborator Author

Is CFFI recompiling at each call? It seems silly and unbelievable...

the .so file is only compiled once, of course (by me - I hope) - the problem is the numba cache which is conflicting

@alecandido
Copy link
Copy Markdown
Collaborator

the .so file is only compiled once, of course (by me - I hope) - the problem is the numba cache which is conflicting

Okay, you're right, I forgot you started from CERNlib (I was actually considering it as completely external to EKO).

Is then CFFI usage killing Numba cache?
It is a bit hard to understand from your partially "deleted" comment.

@felixhekhorn
Copy link
Copy Markdown
Collaborator Author

Is then CFFI usage killing Numba cache?

yes! see the warning in the description:

NumbaWarning: Cannot cache compiled function "quad_ker" as it uses dynamic globals (such as ctypes pointers and large global arrays)
@nb.njit(cache=True)

@alecandido
Copy link
Copy Markdown
Collaborator

alecandido commented Apr 18, 2023

Okay, I see, maybe I got their point: a Numba cached function should be more or less a .so analog, that can refer to other symbols.
However, maybe ctypes interface is not exposing actual discoverable symbols, so they might be using a different mechanism: get a pointer to something that is loaded by someone else.
But in this case, it would be difficult to cache, because you only have a pointer, so you would lose connection to the pointed object once loaded...

There are a lot of maybes in this explanation, I have no idea of Numba internals (other than they are using a stripped down version of LLVM). But yes, if they have no mechanism to link to other .so, the incremental replacing is impossible, unless with drop quad_ker decoration.

What about replacing quad_ker in the first place?
If we can't make Numba accepting callables from somewhere else, maybe we can make the somewhere else accepting callables from Numba.

@felixhekhorn
Copy link
Copy Markdown
Collaborator Author

What about replacing quad_ker in the first place?
If we can't make Numba accepting callables from somewhere else, maybe we can make the somewhere else accepting callables from Numba.

oookay, playing a bit more - this is possible: I attach the relevant files here below.

However, it is not clear to me how this would help us ... fine, we would not need to port everything at once, but we can not tackle the main obstacle: the conjecture is $\psi_0$ - because now we would need to work from top down ...

cfunc.c
// gcc -shared -o cfunc.so -fPIC cfunc.c
#include<stdio.h>

typedef double (*fnc)(double, double);

double do_add(fnc f, double a, double b) {
    return f(a,b);
}

double my_add(double a, double b) { return a + b; }
cfunc.py
from cffi import FFI
import numba as nb
import numpy as np
import ctypes

from cernlibpy import cern_polygamma

# step 1: we create a numba c-function
# ------------------------------------
@nb.cfunc("float64(float64, float64)")
def add(x, y):
    return x + y + np.real(cern_polygamma(4.,0))


print("calling add.ctypes",add.ctypes(4.0, 5.0))  # prints "9.0"
# print(add.address)

# step 2a: we call add in c via cffi
# ----------------------------------------

ffi = FFI()
cernlibc = ffi.dlopen('./cfunc.so')

ffi.cdef("""
double do_add(void* f, double a, double b);
""")

# we need to "activate" the actual function first
c_do_add = cernlibc.do_add

@nb.njit() # we need this here, else FFI complains
def do_add(addr,a,b):
    res = c_do_add(addr, a, b)
    return res

print("calling do_add via numba",do_add(add.address, 4.,6.))

# step 2b: or we call add in c via ctypes
# --------------------------------------- 

libc = ctypes.CDLL("cfunc.so")
my_add_py = libc.my_add
my_add_py.argtypes = [ctypes.c_double, ctypes.c_double]
my_add_py.restype = ctypes.c_double
print("calling my_add via ctypes ",my_add_py(4.,7.))

do_add_py = libc.do_add
CMPFUNC_t = ctypes.CFUNCTYPE(ctypes.c_float, ctypes.c_float, ctypes.c_float)
do_add_py.argtypes = [CMPFUNC_t, ctypes.c_double, ctypes.c_double]
do_add_py.restype = ctypes.c_double

print("calling do_add via ctypes", do_add_py(ctypes.cast(add.address,CMPFUNC_t),4.,8.))
cernlibpy.py
import numpy as np
import numba as nb


@nb.njit()
def cern_polygamma(Z, K):
    # fmt: off
    DELTA = 5e-13
    R1 = 1
    HF = R1/2
    C1 = np.pi**2
    C2 = 2*np.pi**3
    C3 = 2*np.pi**4
    C4 = 8*np.pi**5

    # SGN is originally indexed 0:4 -> no shift
    SGN = [-1,+1,-1,+1,-1]
    # FCT is originally indexed -1:4 -> shift +1
    FCT = [0,1,1,2,6,24]

    # C is originally indexed 1:6 x 0:4 -> swap indices and shift new last -1
    C = nb.typed.List()
    C.append([
            8.33333333333333333e-2,
           -8.33333333333333333e-3,
            3.96825396825396825e-3,
           -4.16666666666666667e-3,
            7.57575757575757576e-3,
           -2.10927960927960928e-2])
    C.append([
            1.66666666666666667e-1,
           -3.33333333333333333e-2,
            2.38095238095238095e-2,
           -3.33333333333333333e-2,
            7.57575757575757576e-2,
           -2.53113553113553114e-1])
    C.append([
            5.00000000000000000e-1,
           -1.66666666666666667e-1,
            1.66666666666666667e-1,
           -3.00000000000000000e-1,
            8.33333333333333333e-1,
           -3.29047619047619048e+0])
    C.append([
            2.00000000000000000e+0,
           -1.00000000000000000e+0,
            1.33333333333333333e+0,
           -3.00000000000000000e+0,
            1.00000000000000000e+1,
           -4.60666666666666667e+1])
    C.append([10., -7., 12., -33., 130., -691.])
    U=Z
    X=np.real(U)
    A=np.abs(X)
    if K < 0 or K > 4:
        raise NotImplementedError("Order K has to be in [0:4]")
    if np.abs(np.imag(U)) < DELTA and np.abs(X+int(A)) < DELTA:
        raise ValueError("Argument Z equals non-positive integer")
    K1=K+1
    if X < 0:
        U=-U
    V=U
    H=0
    if A < 15:
        H=1/V**K1
        for I in range(1,14-int(A)+1):
            V=V+1
            H=H+1/V**K1
        V=V+1
    R=1/V**2
    P=R*C[K][6-1]
    for I in range(5,1-1,-1):
        P=R*(C[K][I-1]+P)
    H=SGN[K]*(FCT[K+1]*H+(V*(FCT[K-1+1]+P)+HF*FCT[K+1])/V**K1)
    if K == 0:
        H=H+np.log(V)
    if X < 0:
        V=np.pi*U
        X=np.real(V)
        Y=np.imag(V)
        A=np.sin(X)
        B=np.cos(X)
        T=np.tanh(Y)
        P=complex(B,-A*T)/complex(A,B*T)
        if K == 0:
            H=H+1/U+np.pi*P
        elif K == 1:
            H=-H+1/U**2+C1*(P**2+1)
        elif K == 2:
            H=H+2/U**3+C2*P*(P**2+1)
        elif K == 3:
            R=P**2
            H=-H+6/U**4+C3*((3*R+4)*R+1)
        elif K == 4:
            R=P**2
            H=H+24/U**5+C4*P*((3*R+5)*R+2)
    return H
    # fmt: on

Comment thread src/ekorepp/anomalous_dimensions/unpolarized/space_like/as1.hpp
@felixhekhorn felixhekhorn changed the title Use CFFI as numba alternative Transform ekore in header-only C++ library Jul 14, 2023
@felixhekhorn
Copy link
Copy Markdown
Collaborator Author

Okay, this is now in a presentable state, so I'm asking everybody for some feedback on how to proceed @alecandido @giacomomagni @niclaurenti @t7phy (cc @scarrazza ) ... let me summarize some points:

The goals I consider are:

  1. make ekore open for other people
  2. if possible, improve the compilation strategy
  3. if possible, speed up

I'm only considering the most simple setup here: FFNS, LO QCD, LHA

I believe to have addressed some of the goals:
a) The idea of this PR is to make ekore a header-only C++ library: i.e. in practice it will be sufficient to do something like

#include "ekorepp/types.hpp"
#include "ekorepp/harmonics/cache.hpp"
#include "ekorepp/anomalous_dimensions/unpolarized/space_like.hpp"
int main() {
// ...
ekorepp::harmonics::Cache c(n);
const ekorepp::tnsr res = ekorepp::anomalous_dimensions::unpolarized::spacelike::gamma_singlet(order_qcd, c, nf);
// ...
}

this should be sufficient for most languages to interface. I opted for C++ because 1) I don't speak C and 2) I don't want to deal with pointers if I can avoid and we need some lists/tensors.
b) The idea, however, is to only lift the complex analysis part (ekore) out, but keep the rest within numpy, because we like the algebra there (such as matrix multiplication and matrix inverse). Moreover, the conjecture (yet to be proven) is that the computational heavy part is in ekore (and not in matrix manipulation). This is achieved in this PR, by only porting ekore in C++ and returning after the computation back to Numba where we have access to Numpy:

return args.py(re.data(), im.data(),

c) as said only ekore is ported, but nothing more. This means that we need in addition the actual interface which is here under eko/ (and not under ekorepp/). I.e. to run this branch you need to execute this line
// g++ -shared -o c_quad_ker.so -fPIC src/eko/evolution_operator/c_quad_ker.cpp -O2
to compile the interface. In practice one needs to run this command every time something inside ekore changes.
d) the issue with caching discussed earlier in this PR is completely gone - we're caching everything. We lost JIT compilation, because I need a function pointer address to be passed around. This function we would still need to compile on client side, but (I hope) that should be quick, because there are no 1000s of lines, but just a few. The major compilation (in C++) instead should now be done on wheel creation time.
e) we get a small speed up: running $ EKO_LOG_STDOUT=1 poe lha -m "ffns and lo and not pol" in master: ~50s here: ~12s (without C++ optimation flags) ~9s (with -O2 - no improvement for -O3). See also PS below.
f) despite the branch name there is no trace of CFFI left, but we only use ctypes (which is standard library)

What do you think? do we want to go down this road?

There are of course still some problems left:

  1. deployment: create the correct libraries for the different OS/arch and deploy them to PyPI
  2. unit test: find a good C++ unit test framework and port unit tests
  3. documentation: find a way to export doxygen to sphinx and to display ekorepp docs only
  4. find a way to translate N3LO stuff (up to N2LO, I think, I could do by hand, but not beyond ...)

(any of the above points of course applies also to any other strategy)


PS: I believe in master we're not running full numba under quad - because there is this functools.partial here

return functools.partial(
in the middle, which very likely kills us. In fact, to run hard code under quad you need this LowLevelCallable as is done here
func = LowLevelCallable(c_quad_ker_qcd, user_data)
and in master we're not doing it

Comment thread src/eko/evolution_operator/c_quad_ker.cpp
@giacomomagni
Copy link
Copy Markdown
Collaborator

LGTM, if we improve by a factor of 5 it's quite good!

  1. documentation: find a way to export doxygen to sphinx and to display ekorepp docs only)

what about this https://breathe.readthedocs.io/en/latest/ ?

  1. find a way to translate N3LO stuff (up to N2LO, I think, I could do by hand, but not beyond ...)

we can write something to automatise the process or import everything again from Mathematica depending on the case.

@alecandido
Copy link
Copy Markdown
Collaborator

alecandido commented Jul 17, 2023

  1. find a way to translate N3LO stuff (up to N2LO, I think, I could do by hand, but not beyond ...)

we can write something to automatise the process or import everything again from Mathematica depending on the case.

This is just (not too complicated) string manipulation, we can do it :)

@felixhekhorn
Copy link
Copy Markdown
Collaborator Author

  1. deployment: create the correct libraries for the different OS/arch and deploy them to PyPI

https://cibuildwheel.readthedocs.io/en/stable/

@felixhekhorn
Copy link
Copy Markdown
Collaborator Author

2. unit test: find a good C++ unit test framework and port unit tests

@felixhekhorn
Copy link
Copy Markdown
Collaborator Author

closed in favor of #189

@felixhekhorn felixhekhorn deleted the cffi branch January 12, 2024 12:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

refactor Refactor code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants