-
Notifications
You must be signed in to change notification settings - Fork 22
Description
Describe the issue:
Hello, after a few hours of hair pulling trying to profile my code and thinking I was spending too much time in a recursive call, which I apparently wasn't, I think I found the problem, and it has to do with using profiling in recursive calls.
When you initiate recursive calls (like Loop<IDIR>(t,dt)) you can then have several idfx::pushRegion called consecutively by subfunctions like calcflux and calcrhs (if you have pushed/popped them in their code) and then idfx:popRegion() is called several times within that recursion without higher-level pushRegion in the recursion being closed first by their own popRegion().
I suspect this is messing up time accounting in the profiler... I have a code with
EvolveStage()
idfx::pushRegion("Evolvestage")
Do main stuff
Loop<IDIR>
idfx::popRegion()
but in Loop I have calls to calcflux and calcrhs which themselves look like
calcflux()
idfx::pushRegion("Calcflux")
Do stuff
idfx::popRegion()
so the pushRegion("Evolvestage") is not closed by a pop before popRegion() is called in Calcflux (and it gets worse because then loop is called recursively on direction and so is calcflux, before the popRegion of EvolveStage is called).
In this example I noticed that in the profiling output a great deal of the CPU time actually spent in the main part of EvolveStage was attributed to the function (CalcFlux) called in Loop<IDIR> . So I think the popRegion from that CalcFlux increments the time counter for the time spent in this function not with the time really spent in this function but with the time elapsed since the main pushRegion in EvolveStage.
Or something like that...it's very clear in any case that the profiling time accounting in this case is doing weird
things.
when I do a pushRegion in the main of EvolveStage followed by a popRegion() BEFORE Loop is called
and then a sequential push/popRegion enclosing Loop<IDIR> , also removing the push/pop in calcflux, I do get the right timings. So it seems like it is the nesting of pushRegion and popRegion is causing trouble here...
N.B.: I have been careful to pass the right function names to the pushRegion calls, so it's not a problem of incrementing one counter with the timing of several functions.
I haven't looked at the entrails of the mechanics of profilings and of pushRegion/popRegion but I think what I see kind of makes senses considering popRegion does not take as an argument the function that it times, so is maybe blind to the specific pushRegion time information it needs to take into account (and apparently it takes that from the very first push in the nesting).
not sure if it's expected behaviour, and if one is supposed to always call the profiling pushRegion/popRegion functions sequentially ? This is not how we coded things for the PATOU fork in the Ecology::EvolveStage at least,
and a quick look suggests it it also how things are done inside Fluid::EvolveStage with LoopDir and the inner call to calcflux of the Riemann solver
Does this make sense ?
Error message:
runtime information:
Patou merged to recent master of idefix
Idefix version 2.2.01-d5257c4
Built against Kokkos 40301
Compiled on May 1 2025 at 21:14:47
N.B.: all observations for a GPU run, haven't tested on CPUs