5.7. 2024-02-23 Performance analysis with Python wrapper for C callback (C extension)¶

This technical notes records the performance analysis done after the following code changes.

We implemented a Python wrapper for a C callback completely as a C extension. Previously, it was done partially in Python as a class Callback that implements __call__ and used conversion.call_c_func_from_python function to do actual type conversion from Python to C and invoke a C callback.

Implementing this class completely as a C extension allows to allocated memory for all required arguments once, in the initialization, instead of doing memory allocations and deallocations in the above-mentioned conversion.call_c_func_from_python function.

The required amount of memory for holding values as C types is completely determined by the types: that is, we only need to know that, for example, a C callback accepts OIF_F64 and OIF_ARRAY_F64, to allocate the memory for these variables that is then used at each callback invocation.

IMPORTANT As before, the code is modified such that Python implementations still use C callback instead of Python callables.

5.7.1. Procedure¶

We analyze performance using command

python -m memray run -o memray-`dtiso8601`.bin \
    examples/compare_performance_ivp_burgers_eq.py \
    all --n_runs 3

where memray is a memory profiler.

5.7.2. Normalized performance results¶

Figure shows the normalized runtimes (with respect to the “native” results, that is, direct invocation of scipy.integrate.ode objects).

../_images/2024-02-23-ivp_burgers_perf_normalized.png — Fig. 5.7.1 Normalized runtime relative to the “native” code executation of directly calling `scipy.integrate.ode.dopri5` from Python for different grid resolutions. Values less than unity are due to the difference in numerical methods and implementations.¶

5.7.3. Quantitative data¶

           N                  101               201               401               801              1001              2001              4001              8001              10001
        20001
scipy_ode_dopri5           0.84   0.02       1.54   0.02       2.61   0.04       4.40   0.07       5.32   0.07      10.68   0.05      25.21   0.23      65.20   0.42      96.45   1.13
    315.75   6.11
sundials_cvode             0.45   0.02       0.91   0.01       1.70   0.02       3.46   0.01       4.37   0.03       9.03   0.04      32.76   0.34      86.71   1.60     127.89   0.57
    469.90   1.90
native_scipy_ode_dopri5    0.56   0.01       1.08   0.01       1.79   0.01       3.23   0.05       4.03   0.03       8.76   0.13      21.27   0.23      56.60   0.36      83.31   0.72
    295.05   3.80

These are the performance penalty normalized data for scipy_ode_dopri5 versus native call:

Resolution	Normalized run time
101	1,50
1001	1,32
10’001	1,16
20’001	1,07

5.7.4. Memory profiling¶

As I ran tests for runtime at different resolutions under the memray memory profiler, I was able also to plot a memory usage plot to demonstrate that there are no huge memory leaks (I am not sure that there no, but it least there are no huge):

../_images/2024-02-26-memray-profiling.png — Fig. 5.7.2 Memory usage for the script `compare_performance_burgers_eqn.py` for different resolutions. We can see that large resolutions like 10’001 and 20’001 points required significant amount of memory but it seems that all the memory is properly released, when a particular implementation is removed.¶