Numba

Scientific Computing

Prof. Calvin

Why Numba?

What is Numba?

  • numba addresses a long-standing problem in scientific computing.
    • Writing code in Python, instead of C or Fortran, is fast.
    • Running code in Python, instead of C or Fortran, is slow.

C/Fortran

  • C and Fortran have been swimming around the periphery for some time.
    • E.g. NumPy is written in C, uses C numbers.
    • E.g. SymPy can print C/Fortran code for expressions.
  • C is a general purpose language and the highest performance language in existence.
  • Fortran, for “formula transcription”, is the foremost numerical computing platform (for performance).

Scripting

  • Python is a scripting language.
    • Write code in .py file or within python
    • The python program “runs” the code
    • No program is created.

Vs. C/Fortran

  • C and Fortran are compiled languages.
    • Write code.
    • Call a program, a compiler, on the code.
    • The compiler creates a program.
    • Run the program.

Numba

  • Numba is a compiler for Python.
  • More than that - a jit (just-in-time) compiler.
  • Numba looks and feels like Python, but under the hood is compiling Python code in fast, compiled languages.
  • It is a good halfway step to avoid having to learn C/Fortran.

In its own words:

Numba is a compiler for Python array and numerical functions that gives you the power to speed up your applications with high performance functions written directly in Python.

With a few simple annotations, array-oriented and math-heavy Python code can be just-in-time optimized to performance similar as C, C++ and Fortran, without having to switch languages or Python interpreters.

Why not Numba?

  • Just learn C or Fortran
  • Use Cython, an alternative Python compiler.
  • Use SymPy print-to-C/Fortran
  • Use GPU acceleration via PyTorch or Tensorflow for certain applications.
    • The Meta and Google GPU-accelerated frameworks, respectively.
    • numba-cuda (NVIDIA GPU Numba) exists, but I haven’t seen it used much.

Install

Pip again!

  • Nothing special here.
python3 -m pip install numba
  • Let’s try it out.

Decorators

  • We need to introduce something called a decorator
  • It’s a little note before a function definition that starts with @
    • This means we’ll mostly write functions to use Numba.
  • We’ll do an example.

Import

  • Numba is essential a NumPy optimizer, so we’ll include both.
from numba import jit
import numpy as np

Test it

  • @jit is the much-ballyhoo’ed decorator.
x = np.arange(100).reshape(10, 10)

@jit
def go_fast(a): # Function is compiled to machine code when called the first time
    trace = 0.0
    for i in range(a.shape[0]):   # Numba likes loops
        trace += np.tanh(a[i, i]) # Numba likes NumPy functions
    return a + trace              # Numba likes NumPy broadcasting

_ = go_fast(x)

Aside: pandas

  • Numba does not accelerate pandas
  • It is the “numerical” not “data” compiler.
  • Don’t do this:
import pandas as pd

x = {'a': [1, 2, 3], 'b': [20, 30, 40]}

@jit(forceobj=True, looplift=True) # Need to use object mode, try and compile loops!
def use_pandas(a): # Function will not benefit from Numba jit
    df = pd.DataFrame.from_dict(a) # Numba doesn't know about pd.DataFrame
    df += 1                        # Numba doesn't understand what this is
    return df.cov()                # or this!

_ = use_pandas(x)

Modes

  • Numba basically has two modes
    • object mode: Slow Python mode, works for pandas
    • nonpython: Fast compiled mode, works for NumPy
  • I wouldn’t bother with Numba unless you can use nonpython
    • Basically Numba doesn’t do anything.

Timing

time

  • It is hard to motivate Numba without seeing how fast it is.
  • We will introduce time
import time

time.time()
1749072667.539349

perf_counter()

  • time supports time.perf_counter()

Return the value (in fractional seconds) of a performance counter, i.e. a clock with the highest available resolution to measure a short duration.

  • Let’s try it out.

NumPy

  • I have have claimed NumPy vector operations are faster than base Python.
  • Let’s test it.
py_lst = list(range(1000000))
np_arr =  np.arange(1000000)

len(py_lst), len(np_arr)
(1000000, 1000000)

Polynomial

  • We will write a simple polynomial.

The Legendre polynomials were first introduced in 1782 by Adrien-Marie Legendre[5] as the coefficients in the expansion of the Newtonian potential…

…served as the fundamental gravitational potential in Newton’s law of universal gravitation.

\(P_{10}(x)\)

  • We take \(P_{10}(x)\), the 10th Legendre polynomial.

\[ \tfrac{146189x^{10}-109395x^8+90090x^6-30030x^4+3465x^2-63}{256} \]

  • This uses only NumPy vectorizable operations:
def p_ten(x):
    return (46189*x**10-109395*x**8+90090*x**6-30030*x**4+3456*x*2-63)//256

p_ten(np.arange(5))
array([       -1,        14,     96060,   8097412, 162596541])

Timing

  • It is a straightforward matter to time Python vs NumPy
t = time.perf_counter()
[p_ten(i) for i in py_lst] # Pythonic vector
py_t = time.perf_counter() - t
t = time.perf_counter()
p_ten(np_arr) # NumPy vectorization
np_t = time.perf_counter() - t
  • Compare the times - NumPy 24x faster.
py_t, np_t, py_t/np_t, np_t/py_t
(0.9327891999855638,
 0.0382024000864476,
 24.417031335067172,
 0.04095501972689953)

Compilation

Compiling

  • To compile a Numba function, we have to run at least once.
  • We will:
    • Write a function.
    • Time it with NumPy
    • Add Numba decorators
    • Run once
    • Time it with Numba
  • Helpfully, we timed p_ten() with NumPy

Numba time.

  • Declare with @jit
@jit
def p_ten_nb(x):
    return (46189*x**10-109395*x**8+90090*x**6-30030*x**4+3456*x*2-63)//256
  • Compile by running once…
_ = p_ten_nb(np_arr)
  • Time on the same array
t = time.perf_counter()
p_ten_nb(np_arr) # NumPy vectorization
nb_t = time.perf_counter() - t
nb_t
0.00377179984934628

Maybe 10x?

  • At first, not great!
  • Let’s optimize.
nb_t, np_t, nb_t/np_t, np_t/nb_t
(0.00377179984934628,
 0.0382024000864476,
 0.09873201266965256,
 10.12842717332118)

Using jit

Numba provides several utilities for code generation, but its central feature is the numba.jit() decorator. Using this decorator, you can mark a function for optimization by Numba’s JIT compiler. Various invocation modes trigger differing compilation options and behaviours.

Parallel

  • Declare
@jit(nopython=True, parallel=True)
def p_ten_par(x):
    return (46189*x**10-109395*x**8+90090*x**6-30030*x**4+3456*x*2-63)//256

_ = p_ten_par(np_arr)  
  • Time
t = time.perf_counter()
p_ten_par(np_arr) # NumPy vectorization
pt_t = time.perf_counter() - t
pt_t
0.0016263998113572598

For me

  • My device is about twice as fast when parallelized.
  • I suspect much faster if I wasn’t developing these slides while running servers, etc. in the background.
# pt = parallel=true time
pt_t, nb_t, pt_t/nb_t, nb_t/pt_t
(0.0016263998113572598,
 0.00377179984934628,
 0.43119992478899527,
 2.319109866471668)

Or…

  • Or maybe I just need more numbers.
np_arr = np.arange(100000000, dtype=np.int64) # 100 mil
@jit(nopython=True, parallel=False)
def p_ten(x):
    return (46189*x**10-109395*x**8+90090*x**6-30030*x**4+3456*x*2-63)//256

p_ten(np_arr) # Compile
t = time.perf_counter()
p_ten(np_arr)
pf = time.perf_counter() - t
pf
0.3658204998355359
@jit(nopython=True, parallel=True)
def p_ten(x):
    return (46189*x**10-109395*x**8+90090*x**6-30030*x**4+3456*x*2-63)//256

p_ten(np_arr) # Compile
t = time.perf_counter()
p_ten(np_arr)
pt = time.perf_counter() - t
pt
0.1383495000191033

GIL

  • Python has a “Global Interpreter Lock” to ensure consistency of array operations.
  • All our array operations are independent, so we don’t have to worry about any of that, I think.
  • We use nogil

nogil=True

@jit(nopython=True, nogil=True, parallel=True)
def p_ten(x):
    return (46189*x**10-109395*x**8+90090*x**6-30030*x**4+3456*x*2-63)//256

p_ten(np_arr) # Compile
t = time.perf_counter()
p_ten(np_arr)
ng = time.perf_counter() - t
ng, pt # no gil, parallel true
(0.13921239995397627, 0.1383495000191033)
  • Doesn’t do much here.

Signatures

  • Numba probably infers this, but it also benefits from knowing what kind of integer we are working with.
  • These are like the NumPy types.
  • Probably the biggest value…
biggest = p_ten(max(np_arr))
biggest, np.iinfo(np.int32), np.iinfo(np.int64)
C:\Users\cd-desk\AppData\Local\Programs\Python\Python312\Lib\site-packages\numba\core\typed_passes.py:336: NumbaPerformanceWarning:


The keyword argument 'parallel=True' was specified but no transformation for parallel execution was possible.

To find out why, try turning on parallel diagnostics, see https://numba.readthedocs.io/en/stable/user/parallel.html#diagnostics for help.

File "..\..\..\..\..\AppData\Local\Temp\ipykernel_19136\2073800990.py", line 1:
<source missing, REPL/exec in use?>

(22357606058205098,
 iinfo(min=-2147483648, max=2147483647, dtype=int32),
 iinfo(min=-9223372036854775808, max=9223372036854775807, dtype=int64))
  • Fits in np.int64

cfunc

  • If we know the type of the values, we can use cfunc instead of jit
  • Compiles to C, may be faster!
  • Signatures are out(in), so / is float(int, int)
x = 5
y = 2
z = x / y
type(z), type(x), type(y)
(float, int, int)

Run it

from numba import cfunc, int64

@cfunc(int64(int64))
def p_ten(x):
    return (46189*x**10-109395*x**8+90090*x**6-30030*x**4+3456*x*2-63)//256

p_ten(np_arr) # Compile
t = time.perf_counter()
p_ten(np_arr)
time.perf_counter() - t
3.9947370998561382
  • Meh. Works better with formulas than polynomials (exponentials and the like).

Aside: Intel

  • I tried a few Intel packages Numba recommended.
  • I got no noticeable changes from either, but mention them.

Intel SVML

  • These slides were compiled on an Intel device.
  • Numba recommends SVML for Intel devices.

Intel provides a short vector math library (SVML) that contains a large number of optimised transcendental functions available for use as compiler intrinsics

python3 -m pip install intel-cmplr-lib-rt

Threading

  • For parallelism, Numba recommends tbb (Intel) or OpenMP (otherwise).
  • They are not available on all devices.
  • I could use tbb
python3 -m pip install tbb

Floats

  • Numba works just fine with floats!
float_arr = np_arr / 7
# add fastmath to decorator, change // to /
# `njit` means `jit(nopython=True`
from numba import njit

@njit(parallel=True)
def p_ten(x):
    return (46189.*x**10.-109395.*x**8.+90090.*x**6.-30030.*x**4.+3456.*x*2.-63.)/256.

p_ten(np_arr) # Compile
t = time.perf_counter()
p_ten(np_arr)
float_t = time.perf_counter() - t
float_t
0.9699231998529285

Floats

  • I used integers, but if we use floats we should use the fastmath option.
  • It allows greater inaccuracy (remember float rounding) in exchange for faster operations.
@njit(fastmath=True, parallel=True)
def p_ten(x):
    return (46189.*x**10.-109395.*x**8.+90090.*x**6.-30030.*x**4.+3456.*x*2.-63.)/256.

p_ten(np_arr) # Compile
t = time.perf_counter()
p_ten(np_arr)
fast_t = time.perf_counter() - t
fast_t, float_t, fast_t/float_t, float_t/fast_t
(0.1389130000025034, 0.9699231998529285, 0.1432206179041465, 6.982234922832631)

“I’m so Julia.” - Charli XCX

Julia Sets

  • The Julia set is a topic in complex dynamics.

Examples include the mathematical models that describe the swinging of a clock pendulum, the flow of water in a pipe, the random motion of particles in the air, and the number of fish each springtime in a lake.

Complex Quadratic Polynomials

  • While not required, a common class of dynamics systems are complex-valued polynomial functions of the second degree, so we take, e.g.
def f(z):
    return z**2 -.4 + .6j

Complex values

  • In Python, complex values are denoted as multiples of j (i is too frequently use).
z = 0 + 1j
z ** 2
(-1+0j)
  • NumPy, they have dtype=np.complex128
arr = np.array([0 + 1j])
arr.dtype
dtype('complex128')

Common Form

  • Often we refer to the these polynomials as follows:

\[ f_c(z) = z^2 + c \]

  • We refer to the Julia set related to such a polynomial as:

\[ J(f_c) \]

Sets and Functions

  • The Julia set is the elements on the complex plane for which some function converges under repeated iteration.
  • For \(J(f_c)\) it is the elements:

\[ J(f_c) \{ z \in \mathbb{C} : \forall n \in \mathbb{N} : |f_c^nN(z)| \leq R \]

On \(R\)

  • We define \(R\) as \[ R > 0 \land R^2 - R > |c| \]
  • It is simple to restrict \(|c| < 2\) and use \(R = 2\)

Exercise

  • Set a maximum number of iterations, say 1000
  • Set a complex value \(|c| < 2\)
  • Create a 2D array of complex values ranging from -2 to 2 (\(|a| < 2\))
    • I used reshape and linspace together.
  • Iterate \(f_c\) on each value until either:
    • Maximum iterations are reached, or
    • The output exceeds \(R = 2\) - look at np.absolute()

Result

  • Use Numba to achieve higher numbers of iterations (which create sharper images).
  • Use Matplotlib imshow to plot the result.
    • Plot real component as x, imaginary as y, and iterations as color.
  • I recommend using \(c = -0.4 + 0.6i\) which I think looks nice.
    • But you may choose your own \(c\)

\(f_c^n(z)\)

  • Apply f_c
    • Which, recall, is \(z^2+c\)
  • To z
    • The elements of the array.
  • \(n\) times
    • So \(f_c^2(z) = f_c(f_c(z))\)
def f_c_n(z, c, n):
    for i in range(n): # "do thing n times"
        z = z**2 + c
    return z

Solution

Code
import numpy as np
import matplotlib.pyplot as plt
from numba import njit

@njit
def iterator(m, c, z):
    for i in range(m):
        z = z**2 + c
        if (np.absolute(z) > 2):
                return i
    return i

# m: max iterations, c: c
def j_f(m, c):
    real = np.linspace(-2,2,1000)
    imag = (0 + 1j) * real
    comp = real.reshape(-1,1) + imag
    f_vec = np.vectorize(iterator, excluded={0,1})
    return f_vec(m, c, comp)

# m: max iterations, c: c
def visualize(m, c):
    arr = j_f(m,c)
    plt.imshow(arr)
    plt.axis('off')
    plt.savefig("julia.png", bbox_inches="tight", pad_inches=0)

visualize(1000, -0.4 + 0.6j)