Acceleration with Numba

We explore how the computation of cost functions can be dramatically accelerated with numba’s JIT compiler.

The run-time of iminuit is usually dominated by the execution time of the cost function. To get good performance, it recommended to use array arthimetic and scipy and numpy functions in the body of the cost function. Python loops should be avoided, but if they are unavoidable, Numba can help. Numba can also parallelize numerical calculations to make full use of multi-core CPUs and even do computations on the GPU.

Note: This tutorial shows how one can generate faster pdfs with Numba. Before you start to write your own pdf, please check whether one is already implemented in the numba_stats library. If you have a pdf that is not included there, please consider contributing it to numba_stats.

[1]:
# !pip install matplotlib numpy numba scipy iminuit
from iminuit import Minuit
import numpy as np
import numba as nb
import math
from scipy.stats import expon, norm
from matplotlib import pyplot as plt
from argparse import Namespace
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
RuntimeError: module compiled against API version 0x10 but this version of numpy is 0xf
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
Input In [1], in <cell line: 6>()
      4 import numba as nb
      5 import math
----> 6 from scipy.stats import expon, norm
      7 from matplotlib import pyplot as plt
      8 from argparse import Namespace

File /usr/lib/python3.10/site-packages/scipy/stats/__init__.py:467, in <module>
      1 """
      2 .. _statsrefmanual:
      3
   (...)
    462
    463 """
    465 from ._warnings_errors import (ConstantInputWarning, NearConstantInputWarning,
    466                                DegenerateDataWarning, FitError)
--> 467 from ._stats_py import *
    468 from ._variation import variation
    469 from .distributions import *

File /usr/lib/python3.10/site-packages/scipy/stats/_stats_py.py:39, in <module>
     36 from numpy.lib import NumpyVersion
     37 from numpy.testing import suppress_warnings
---> 39 from scipy.spatial.distance import cdist
     40 from scipy.ndimage import _measurements
     41 from scipy._lib._util import (check_random_state, MapWrapper,
     42                               rng_integers, _rename_parameter)

File /usr/lib/python3.10/site-packages/scipy/spatial/__init__.py:105, in <module>
      1 """
      2 =============================================================
      3 Spatial algorithms and data structures (:mod:`scipy.spatial`)
   (...)
    102    QhullError
    103 """
--> 105 from ._kdtree import *
    106 from ._ckdtree import *
    107 from ._qhull import *

File /usr/lib/python3.10/site-packages/scipy/spatial/_kdtree.py:5, in <module>
      3 import numpy as np
      4 import warnings
----> 5 from ._ckdtree import cKDTree, cKDTreeNode
      7 __all__ = ['minkowski_distance_p', 'minkowski_distance',
      8            'distance_matrix',
      9            'Rectangle', 'KDTree']
     12 def minkowski_distance_p(x, y, p=2):

File _ckdtree.pyx:10, in init scipy.spatial._ckdtree()

File /usr/lib/python3.10/site-packages/scipy/sparse/__init__.py:267, in <module>
    264 import warnings as _warnings
    266 from ._base import *
--> 267 from ._csr import *
    268 from ._csc import *
    269 from ._lil import *

File /usr/lib/python3.10/site-packages/scipy/sparse/_csr.py:10, in <module>
      7 import numpy as np
      9 from ._base import spmatrix
---> 10 from ._sparsetools import (csr_tocsc, csr_tobsr, csr_count_blocks,
     11                            get_csr_submatrix)
     12 from ._sputils import upcast, get_index_dtype
     14 from ._compressed import _cs_matrix

ImportError: numpy.core.multiarray failed to import

The standard fit in particle physics is the fit of a peak over some smooth background. We generate a Gaussian peak over exponential background, using scipy.

[2]:
np.random.seed(1)  # fix seed

# true parameters for signal and background
truth = Namespace(n_sig=2000, f_bkg=10, sig=(5.0, 0.5), bkg=(0.0, 4.0))
n_bkg = truth.n_sig * truth.f_bkg

# make a data set
x = np.empty(truth.n_sig + n_bkg)

# fill m variables
x[: truth.n_sig] = norm(*truth.sig).rvs(truth.n_sig)
x[truth.n_sig :] = expon(*truth.bkg).rvs(n_bkg)

# cut a range in x
xrange = np.array((1.0, 9.0))
ma = (xrange[0] < x) & (x < xrange[1])
x = x[ma]

plt.hist(
    (x[truth.n_sig :], x[: truth.n_sig]),
    bins=50,
    stacked=True,
    label=("background", "signal"),
)
plt.xlabel("x")
plt.legend();
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [2], in <cell line: 4>()
      1 np.random.seed(1)  # fix seed
      3 # true parameters for signal and background
----> 4 truth = Namespace(n_sig=2000, f_bkg=10, sig=(5.0, 0.5), bkg=(0.0, 4.0))
      5 n_bkg = truth.n_sig * truth.f_bkg
      7 # make a data set

NameError: name 'Namespace' is not defined
[3]:
# ideal starting values for iminuit
start = np.array((truth.n_sig, n_bkg, truth.sig[0], truth.sig[1], truth.bkg[1]))


# iminuit instance factory, will be called a lot in the benchmarks blow
def m_init(fcn):
    m = Minuit(fcn, start, name=("ns", "nb", "mu", "sigma", "lambd"))
    m.limits = ((0, None), (0, None), None, (0, None), (0, None))
    m.errordef = Minuit.LIKELIHOOD
    return m
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [3], in <cell line: 2>()
      1 # ideal starting values for iminuit
----> 2 start = np.array((truth.n_sig, n_bkg, truth.sig[0], truth.sig[1], truth.bkg[1]))
      5 # iminuit instance factory, will be called a lot in the benchmarks blow
      6 def m_init(fcn):

NameError: name 'truth' is not defined
[4]:
# extended likelihood (https://doi.org/10.1016/0168-9002(90)91334-8)
# this version uses numpy and scipy and array arithmetic
def nll(par):
    n_sig, n_bkg, mu, sigma, lambd = par
    s = norm(mu, sigma)
    b = expon(0, lambd)
    # normalisation factors are needed for pdfs, since x range is restricted
    sn = s.cdf(xrange)
    bn = b.cdf(xrange)
    sn = sn[1] - sn[0]
    bn = bn[1] - bn[0]
    return (n_sig + n_bkg) - np.sum(
        np.log(s.pdf(x) / sn * n_sig + b.pdf(x) / bn * n_bkg)
    )


nll(start)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [4], in <cell line: 17>()
     11     bn = bn[1] - bn[0]
     12     return (n_sig + n_bkg) - np.sum(
     13         np.log(s.pdf(x) / sn * n_sig + b.pdf(x) / bn * n_bkg)
     14     )
---> 17 nll(start)

NameError: name 'start' is not defined
[5]:
%%timeit -r 3 -n 1
m = m_init(nll)  # setup time is negligible
m.migrad();
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [5], in <cell line: 1>()
----> 1 get_ipython().run_cell_magic('timeit', '-r 3 -n 1', 'm = m_init(nll)  # setup time is negligible\nm.migrad();\n')

File /usr/lib/python3.10/site-packages/IPython/core/interactiveshell.py:2358, in InteractiveShell.run_cell_magic(self, magic_name, line, cell)
   2356 with self.builtin_trap:
   2357     args = (magic_arg_s, cell)
-> 2358     result = fn(*args, **kwargs)
   2359 return result

File /usr/lib/python3.10/site-packages/IPython/core/magics/execution.py:1166, in ExecutionMagics.timeit(self, line, cell, local_ns)
   1163         if time_number >= 0.2:
   1164             break
-> 1166 all_runs = timer.repeat(repeat, number)
   1167 best = min(all_runs) / number
   1168 worst = max(all_runs) / number

File /usr/lib/python3.10/timeit.py:206, in Timer.repeat(self, repeat, number)
    204 r = []
    205 for i in range(repeat):
--> 206     t = self.timeit(number)
    207     r.append(t)
    208 return r

File /usr/lib/python3.10/site-packages/IPython/core/magics/execution.py:156, in Timer.timeit(self, number)
    154 gc.disable()
    155 try:
--> 156     timing = self.inner(it, self.timer)
    157 finally:
    158     if gcold:

File <magic-timeit>:1, in inner(_it, _timer)

NameError: name 'm_init' is not defined

Let’s see whether we can beat that. The code above is already pretty fast, because numpy and scipy routines are fast, and we spend most of the time in those. But these implementations do not parallelize the execution and are not optimised for this particular CPU, unlike numba-jitted functions.

To use numba, in theory we just need to put the njit decorator on top of the function, but often that doesn’t work out of the box. numba understands many numpy functions, but no scipy. We must evaluate the code that uses scipy in ‘object mode’, which is numba-speak for calling into the Python interpreter.

[6]:
# first attempt to use numba
@nb.njit(parallel=True)
def nll(par):
    n_sig, n_bkg, mu, sigma, lambd = par
    with nb.objmode(spdf="float64[:]", bpdf="float64[:]", sn="float64", bn="float64"):
        s = norm(mu, sigma)
        b = expon(0, lambd)
        # normalisation factors are needed for pdfs, since x range is restricted
        sn = np.diff(s.cdf(xrange))[0]
        bn = np.diff(b.cdf(xrange))[0]
        spdf = s.pdf(x)
        bpdf = b.pdf(x)
    no = n_sig + n_bkg
    return no - np.sum(np.log(spdf / sn * n_sig + bpdf / bn * n_bkg))


nll(start)  # test and warm-up JIT
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [6], in <cell line: 17>()
     13     no = n_sig + n_bkg
     14     return no - np.sum(np.log(spdf / sn * n_sig + bpdf / bn * n_bkg))
---> 17 nll(start)

NameError: name 'start' is not defined
[7]:
%%timeit -r 3 -n 1 m = m_init(nll)
m.migrad()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [7], in <cell line: 1>()
----> 1 get_ipython().run_cell_magic('timeit', '-r 3 -n 1 m = m_init(nll)', 'm.migrad()\n')

File /usr/lib/python3.10/site-packages/IPython/core/interactiveshell.py:2358, in InteractiveShell.run_cell_magic(self, magic_name, line, cell)
   2356 with self.builtin_trap:
   2357     args = (magic_arg_s, cell)
-> 2358     result = fn(*args, **kwargs)
   2359 return result

File /usr/lib/python3.10/site-packages/IPython/core/magics/execution.py:1166, in ExecutionMagics.timeit(self, line, cell, local_ns)
   1163         if time_number >= 0.2:
   1164             break
-> 1166 all_runs = timer.repeat(repeat, number)
   1167 best = min(all_runs) / number
   1168 worst = max(all_runs) / number

File /usr/lib/python3.10/timeit.py:206, in Timer.repeat(self, repeat, number)
    204 r = []
    205 for i in range(repeat):
--> 206     t = self.timeit(number)
    207     r.append(t)
    208 return r

File /usr/lib/python3.10/site-packages/IPython/core/magics/execution.py:156, in Timer.timeit(self, number)
    154 gc.disable()
    155 try:
--> 156     timing = self.inner(it, self.timer)
    157 finally:
    158     if gcold:

File <magic-timeit>:1, in inner(_it, _timer)

NameError: name 'm_init' is not defined

It is even a bit slower. :( Let’s break the original function down by parts to see why.

[8]:
# let's time the body of the function
n_sig, n_bkg, mu, sigma, lambd = start
s = norm(mu, sigma)
b = expon(0, lambd)
# normalisation factors are needed for pdfs, since x range is restricted
sn = np.diff(s.cdf(xrange))[0]
bn = np.diff(b.cdf(xrange))[0]
spdf = s.pdf(x)
bpdf = b.pdf(x)

%timeit -r 3 -n 100 norm(*start[2:4]).pdf(x)
%timeit -r 3 -n 500 expon(0, start[4]).pdf(x)
%timeit -r 3 -n 1000 n_sig + n_bkg - np.sum(np.log(spdf / sn * n_sig + bpdf / bn * n_bkg))
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [8], in <cell line: 2>()
      1 # let's time the body of the function
----> 2 n_sig, n_bkg, mu, sigma, lambd = start
      3 s = norm(mu, sigma)
      4 b = expon(0, lambd)

NameError: name 'start' is not defined

Most of the time is spend in norm and expon which numba could not accelerate and the total time is dominated by the slowest part.

This, unfortunately, means we have to do much more manual work to make the function faster, since we have to replace the scipy routines with Python code that numba can accelerate and run in parallel.

[9]:
# when parallel is enabled, also enable associative math
kwd = {"parallel": True, "fastmath": {"reassoc", "contract", "arcp"}}


@nb.njit(**kwd)
def sum_log(fs, spdf, fb, bpdf):
    return np.sum(np.log(fs * spdf + fb * bpdf))


@nb.njit(**kwd)
def norm_pdf(x, mu, sigma):
    invs = 1.0 / sigma
    z = (x - mu) * invs
    invnorm = 1 / np.sqrt(2 * np.pi) * invs
    return np.exp(-0.5 * z ** 2) * invnorm


@nb.njit(**kwd)
def nb_erf(x):
    y = np.empty_like(x)
    for i in nb.prange(len(x)):
        y[i] = math.erf(x[i])
    return y


@nb.njit(**kwd)
def norm_cdf(x, mu, sigma):
    invs = 1.0 / (sigma * np.sqrt(2))
    z = (x - mu) * invs
    return 0.5 * (1 + nb_erf(z))


@nb.njit(**kwd)
def expon_pdf(x, lambd):
    inv_lambd = 1.0 / lambd
    return inv_lambd * np.exp(-inv_lambd * x)


@nb.njit(**kwd)
def expon_cdf(x, lambd):
    inv_lambd = 1.0 / lambd
    return 1.0 - np.exp(-inv_lambd * x)


def nll(par):
    n_sig, n_bkg, mu, sigma, lambd = par
    # normalisation factors are needed for pdfs, since x range is restricted
    sn = norm_cdf(xrange, mu, sigma)
    bn = expon_cdf(xrange, lambd)
    sn = sn[1] - sn[0]
    bn = bn[1] - bn[0]
    spdf = norm_pdf(x, mu, sigma)
    bpdf = expon_pdf(x, lambd)
    no = n_sig + n_bkg
    return no - sum_log(n_sig / sn, spdf, n_bkg / bn, bpdf)


nll(start)  # test and warm-up JIT
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [9], in <cell line: 58>()
     54     no = n_sig + n_bkg
     55     return no - sum_log(n_sig / sn, spdf, n_bkg / bn, bpdf)
---> 58 nll(start)

NameError: name 'start' is not defined

Let’s see how well these versions do:

[10]:
%timeit -r 5 -n 100 norm_pdf(x, *start[2:4])
%timeit -r 5 -n 500 expon_pdf(x, start[4])
%timeit -r 5 -n 1000 sum_log(n_sig / sn, spdf, n_bkg / bn, bpdf)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [10], in <cell line: 1>()
----> 1 get_ipython().run_line_magic('timeit', '-r 5 -n 100 norm_pdf(x, *start[2:4])')
      2 get_ipython().run_line_magic('timeit', '-r 5 -n 500 expon_pdf(x, start[4])')
      3 get_ipython().run_line_magic('timeit', '-r 5 -n 1000 sum_log(n_sig / sn, spdf, n_bkg / bn, bpdf)')

File /usr/lib/python3.10/site-packages/IPython/core/interactiveshell.py:2305, in InteractiveShell.run_line_magic(self, magic_name, line, _stack_depth)
   2303     kwargs['local_ns'] = self.get_local_scope(stack_depth)
   2304 with self.builtin_trap:
-> 2305     result = fn(*args, **kwargs)
   2306 return result

File /usr/lib/python3.10/site-packages/IPython/core/magics/execution.py:1166, in ExecutionMagics.timeit(self, line, cell, local_ns)
   1163         if time_number >= 0.2:
   1164             break
-> 1166 all_runs = timer.repeat(repeat, number)
   1167 best = min(all_runs) / number
   1168 worst = max(all_runs) / number

File /usr/lib/python3.10/timeit.py:206, in Timer.repeat(self, repeat, number)
    204 r = []
    205 for i in range(repeat):
--> 206     t = self.timeit(number)
    207     r.append(t)
    208 return r

File /usr/lib/python3.10/site-packages/IPython/core/magics/execution.py:156, in Timer.timeit(self, number)
    154 gc.disable()
    155 try:
--> 156     timing = self.inner(it, self.timer)
    157 finally:
    158     if gcold:

File <magic-timeit>:1, in inner(_it, _timer)

NameError: name 'x' is not defined

Only a minor improvement for sum_log, but the pdf calculation was drastically accelerated. Since this was the bottleneck before, we expect also Migrad to finish faster now.

[11]:
%%timeit -r 3 -n 1
m = m_init(nll)  # setup time is negligible
m.migrad();
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [11], in <cell line: 1>()
----> 1 get_ipython().run_cell_magic('timeit', '-r 3 -n 1', 'm = m_init(nll)  # setup time is negligible\nm.migrad();\n')

File /usr/lib/python3.10/site-packages/IPython/core/interactiveshell.py:2358, in InteractiveShell.run_cell_magic(self, magic_name, line, cell)
   2356 with self.builtin_trap:
   2357     args = (magic_arg_s, cell)
-> 2358     result = fn(*args, **kwargs)
   2359 return result

File /usr/lib/python3.10/site-packages/IPython/core/magics/execution.py:1166, in ExecutionMagics.timeit(self, line, cell, local_ns)
   1163         if time_number >= 0.2:
   1164             break
-> 1166 all_runs = timer.repeat(repeat, number)
   1167 best = min(all_runs) / number
   1168 worst = max(all_runs) / number

File /usr/lib/python3.10/timeit.py:206, in Timer.repeat(self, repeat, number)
    204 r = []
    205 for i in range(repeat):
--> 206     t = self.timeit(number)
    207     r.append(t)
    208 return r

File /usr/lib/python3.10/site-packages/IPython/core/magics/execution.py:156, in Timer.timeit(self, number)
    154 gc.disable()
    155 try:
--> 156     timing = self.inner(it, self.timer)
    157 finally:
    158     if gcold:

File <magic-timeit>:1, in inner(_it, _timer)

NameError: name 'm_init' is not defined

Success! We managed to get a big speed improvement over the initial code. This is impressive, but it cost us a lot of developer time. This is not always a good trade-off, especially if you consider that library routines are heavily tested, while you always need to test your own code in addition to writing it.

By putting these faster functions into a library, however, we would only have to pay the developer cost once. You can find those in the numba_stats library.

Try to compile the functions again with parallel=False to see how much of the speed increase came from the parallelization and how much from the generally optimized code that numba generated for our specific CPU. On my machine, the gain was entirely due to numba.

In general, it is good advice to not automatically add parallel=True, because this comes with an overhead of breaking data into chunks, copy chunks to the individual CPUs and finally merging everything back together. For large arrays, this overhead is negligible, but for small arrays, it can be a net loss.

So why is numba so fast even without parallelization? We can look at the assembly code generated.

[12]:
for signature, code in norm_pdf.inspect_asm().items():
    print(f"signature: {signature}\n{'-'*(len(str(signature)) + 11)}\n{code[:1000]}\n[...]")

This code section is very long, but the assembly grammar is very simple. Constants starts with . and SOMETHING: is a jump label for the assembly equivalent of goto. Everything else is an instruction with its name on the left and the arguments are on the right.

The interesting commands are those that end with pd and ps, those are SIMD instructions that operate on up to eight doubles at once. This is where the speed comes from.

[13]:
import re
from collections import Counter

for signature, code in norm_pdf.inspect_asm().items():
    print(f"signature: {signature}\n{'-'*(len(str(signature)) + 11)}")
    simd_instructions = re.findall(" *([a-z]+p[ds])\t*%", code)
    c = Counter(simd_instructions)
    print("SIMD instructions")
    for k, v in c.items():
        print(f"{k:10}: {v:5}")

  • mov: copy values from memory to CPU registers and back

  • sub: subtract numbers

  • mul: multiply numbers

  • xor: binary xor, the compiler often inserts these to zero out memory

You can google all the other commands.

There is a lot of repetition, because the optimizer partially unrolled some loops to make them faster. Using unrolled loops only works if the remaining chunk of data is large enough. Since the compiler does not know the length of the incoming array, it also generates sections which handle shorter chunks and all the code to select which section to use. Finally, there is some code which does the translation from and to Python objects with corresponding error handling.

We don’t need to write SIMD instructions by hand, the optimizer does it for us and in a very sophisticated way.