# 2024-02-23 Performance analysis with Python wrapper for C callback (C extension)

This technical notes records the performance analysis done after the following
code changes.

We implemented a Python wrapper for a C callback completely as a C extension.
Previously, it was done partially in Python as a class `Callback` that
implements `__call__` and used `conversion.call_c_func_from_python` function
to do actual type conversion from Python to C and invoke a C callback.

Implementing this class completely as a C extension allows to allocated
memory for all required arguments once, in the initialization, instead
of doing memory allocations and deallocations in the above-mentioned
`conversion.call_c_func_from_python` function.

The required amount of memory for holding values as C types is completely
determined by the types: that is, we only need to know that, for example,
a C callback accepts `OIF_F64` and `OIF_ARRAY_F64`, to allocate the memory
for these variables that is then used at each callback invocation.

**IMPORTANT** As before, the code is modified such that Python implementations
still use C callback instead of Python callables.

## Procedure

We analyze performance using command
```shell
python -m memray run -o memray-`dtiso8601`.bin \
    examples/compare_performance_ivp_burgers_eq.py \
    all --n_runs 3
```

where `memray` is a memory profiler.

## Normalized performance results

Figure shows the normalized runtimes (with respect to the "native" results,
that is, direct invocation of `scipy.integrate.ode` objects).

```{figure} img/2024-02-23-ivp_burgers_perf_normalized.pdf

Normalized runtime relative to the "native" code executation of directly
calling `scipy.integrate.ode.dopri5` from Python
for different grid resolutions.
Values less than unity are due to the difference in numerical methods and
implementations.
```

## Quantitative data

```
           N                  101               201               401               801              1001              2001              4001              8001              10001
        20001
scipy_ode_dopri5           0.84   0.02       1.54   0.02       2.61   0.04       4.40   0.07       5.32   0.07      10.68   0.05      25.21   0.23      65.20   0.42      96.45   1.13
    315.75   6.11
sundials_cvode             0.45   0.02       0.91   0.01       1.70   0.02       3.46   0.01       4.37   0.03       9.03   0.04      32.76   0.34      86.71   1.60     127.89   0.57
    469.90   1.90
native_scipy_ode_dopri5    0.56   0.01       1.08   0.01       1.79   0.01       3.23   0.05       4.03   0.03       8.76   0.13      21.27   0.23      56.60   0.36      83.31   0.72
    295.05   3.80
```

These are the performance penalty normalized data for `scipy_ode_dopri5` versus
native call:
| Resolution  | Normalized run time |
|------------:|--------------------:|
| 101         | 1,50                |
| 1001        | 1,32                |
|10'001       | 1,16                |
| 20'001      | 1,07                |


## Memory profiling

As I ran tests for runtime at different resolutions under the `memray` memory
profiler, I was able also to plot a memory usage plot to demonstrate that there
are no huge memory leaks (I am not sure that there no, but it least there are
no huge):

```{figure} img/2024-02-26-memray-profiling.png

Memory usage for the script `compare_performance_burgers_eqn.py` for different
resolutions. We can see that large resolutions like 10'001 and 20'001 points
required significant amount of memory but it seems that all the memory is
properly released, when a particular implementation is removed.
```