Numba

Scientific Computing

Prof. Calvin

Why Numba?

What is Numba?

numba addresses a long-standing problem in scientific computing.
- Writing code in Python, instead of C or Fortran, is fast.
- Running code in Python, instead of C or Fortran, is slow.

C/Fortran

C and Fortran have been swimming around the periphery for some time.
- E.g. NumPy is written in C, uses C numbers.
- E.g. SymPy can print C/Fortran code for expressions.
C is a general purpose language and the highest performance language in existence.
Fortran, for “formula transcription”, is the foremost numerical computing platform (for performance).

Scripting

Python is a scripting language.
- Write code in .py file or within python
- The python program “runs” the code
- No program is created.

Vs. C/Fortran

C and Fortran are compiled languages.
- Write code.
- Call a program, a compiler, on the code.
- The compiler creates a program.
- Run the program.

Numba

Numba is a compiler for Python.
More than that - a jit (just-in-time) compiler.
Numba looks and feels like Python, but under the hood is compiling Python code in fast, compiled languages.
It is a good halfway step to avoid having to learn C/Fortran.

In its own words:

Numba is a compiler for Python array and numerical functions that gives you the power to speed up your applications with high performance functions written directly in Python.

With a few simple annotations, array-oriented and math-heavy Python code can be just-in-time optimized to performance similar as C, C++ and Fortran, without having to switch languages or Python interpreters.

Why not Numba?

Just learn C or Fortran
Use Cython, an alternative Python compiler.
Use SymPy print-to-C/Fortran
Use GPU acceleration via PyTorch or Tensorflow for certain applications.
- The Meta and Google GPU-accelerated frameworks, respectively.
- numba-cuda (NVIDIA GPU Numba) exists, but I haven’t seen it used much.

Install

Pip again!

Nothing special here.

python3 -m pip install numba

Let’s try it out.

Decorators

We need to introduce something called a decorator
It’s a little note before a function definition that starts with @
- This means we’ll mostly write functions to use Numba.
We’ll do an example.

Import

Numba is essential a NumPy optimizer, so we’ll include both.

from numba import jit
import numpy as np

Test it

@jit is the much-ballyhoo’ed decorator.

x = np.arange(100).reshape(10, 10)

@jit
def go_fast(a): # Function is compiled to machine code when called the first time
    trace = 0.0
    for i in range(a.shape[0]):   # Numba likes loops
        trace += np.tanh(a[i, i]) # Numba likes NumPy functions
    return a + trace              # Numba likes NumPy broadcasting

_ = go_fast(x)

Aside: pandas

Numba does not accelerate pandas
It is the “numerical” not “data” compiler.
Don’t do this:

import pandas as pd

x = {'a': [1, 2, 3], 'b': [20, 30, 40]}

@jit(forceobj=True, looplift=True) # Need to use object mode, try and compile loops!
def use_pandas(a): # Function will not benefit from Numba jit
    df = pd.DataFrame.from_dict(a) # Numba doesn't know about pd.DataFrame
    df += 1                        # Numba doesn't understand what this is
    return df.cov()                # or this!

_ = use_pandas(x)

Modes

Numba basically has two modes
- object mode: Slow Python mode, works for pandas
- nonpython: Fast compiled mode, works for NumPy
I wouldn’t bother with Numba unless you can use nonpython
- Basically Numba doesn’t do anything.

Timing

time

It is hard to motivate Numba without seeing how fast it is.
We will introduce time

import time

time.time()

1749072667.539349

perf_counter()

time supports time.perf_counter()

Return the value (in fractional seconds) of a performance counter, i.e. a clock with the highest available resolution to measure a short duration.

Let’s try it out.

NumPy

I have have claimed NumPy vector operations are faster than base Python.
Let’s test it.

py_lst = list(range(1000000))
np_arr =  np.arange(1000000)

len(py_lst), len(np_arr)

(1000000, 1000000)

Polynomial

We will write a simple polynomial.

The Legendre polynomials were first introduced in 1782 by Adrien-Marie Legendre[5] as the coefficients in the expansion of the Newtonian potential…

…served as the fundamental gravitational potential in Newton’s law of universal gravitation.

\(P_{10}(x)\)

We take \(P_{10}(x)\), the 10th Legendre polynomial.

\[ \tfrac{146189x^{10}-109395x^8+90090x^6-30030x^4+3465x^2-63}{256} \]

This uses only NumPy vectorizable operations:

def p_ten(x):
    return (46189*x**10-109395*x**8+90090*x**6-30030*x**4+3456*x*2-63)//256

p_ten(np.arange(5))

array([       -1,        14,     96060,   8097412, 162596541])

Timing

It is a straightforward matter to time Python vs NumPy

t = time.perf_counter()
[p_ten(i) for i in py_lst] # Pythonic vector
py_t = time.perf_counter() - t

t = time.perf_counter()
p_ten(np_arr) # NumPy vectorization
np_t = time.perf_counter() - t

Compare the times - NumPy 24x faster.

py_t, np_t, py_t/np_t, np_t/py_t

(0.9327891999855638,
 0.0382024000864476,
 24.417031335067172,
 0.04095501972689953)

Compilation

Compiling

To compile a Numba function, we have to run at least once.
We will:
- Write a function.
- Time it with NumPy
- Add Numba decorators
- Run once
- Time it with Numba
Helpfully, we timed p_ten() with NumPy

Numba time.

Declare with @jit

@jit
def p_ten_nb(x):
    return (46189*x**10-109395*x**8+90090*x**6-30030*x**4+3456*x*2-63)//256

Compile by running once…

_ = p_ten_nb(np_arr)

Time on the same array

t = time.perf_counter()
p_ten_nb(np_arr) # NumPy vectorization
nb_t = time.perf_counter() - t
nb_t

0.00377179984934628

Maybe 10x?

At first, not great!
Let’s optimize.

nb_t, np_t, nb_t/np_t, np_t/nb_t

(0.00377179984934628,
 0.0382024000864476,
 0.09873201266965256,
 10.12842717332118)

Using `jit`

Numba provides several utilities for code generation, but its central feature is the numba.jit() decorator. Using this decorator, you can mark a function for optimization by Numba’s JIT compiler. Various invocation modes trigger differing compilation options and behaviours.

Parallel

Declare

@jit(nopython=True, parallel=True)
def p_ten_par(x):
    return (46189*x**10-109395*x**8+90090*x**6-30030*x**4+3456*x*2-63)//256

_ = p_ten_par(np_arr)

Time

t = time.perf_counter()
p_ten_par(np_arr) # NumPy vectorization
pt_t = time.perf_counter() - t
pt_t

0.0016263998113572598

For me

My device is about twice as fast when parallelized.
I suspect much faster if I wasn’t developing these slides while running servers, etc. in the background.

# pt = parallel=true time
pt_t, nb_t, pt_t/nb_t, nb_t/pt_t

(0.0016263998113572598,
 0.00377179984934628,
 0.43119992478899527,
 2.319109866471668)

Or…

Or maybe I just need more numbers.

np_arr = np.arange(100000000, dtype=np.int64) # 100 mil

@jit(nopython=True, parallel=False)
def p_ten(x):
    return (46189*x**10-109395*x**8+90090*x**6-30030*x**4+3456*x*2-63)//256

p_ten(np_arr) # Compile
t = time.perf_counter()
p_ten(np_arr)
pf = time.perf_counter() - t
pf

0.3658204998355359

@jit(nopython=True, parallel=True)
def p_ten(x):
    return (46189*x**10-109395*x**8+90090*x**6-30030*x**4+3456*x*2-63)//256

p_ten(np_arr) # Compile
t = time.perf_counter()
p_ten(np_arr)
pt = time.perf_counter() - t
pt

0.1383495000191033

GIL

Python has a “Global Interpreter Lock” to ensure consistency of array operations.
All our array operations are independent, so we don’t have to worry about any of that, I think.
We use nogil

nogil=True

@jit(nopython=True, nogil=True, parallel=True)
def p_ten(x):
    return (46189*x**10-109395*x**8+90090*x**6-30030*x**4+3456*x*2-63)//256

p_ten(np_arr) # Compile
t = time.perf_counter()
p_ten(np_arr)
ng = time.perf_counter() - t
ng, pt # no gil, parallel true

(0.13921239995397627, 0.1383495000191033)

Doesn’t do much here.

Signatures

Numba probably infers this, but it also benefits from knowing what kind of integer we are working with.
These are like the NumPy types.
Probably the biggest value…

biggest = p_ten(max(np_arr))
biggest, np.iinfo(np.int32), np.iinfo(np.int64)

C:\Users\cd-desk\AppData\Local\Programs\Python\Python312\Lib\site-packages\numba\core\typed_passes.py:336: NumbaPerformanceWarning:


The keyword argument 'parallel=True' was specified but no transformation for parallel execution was possible.

To find out why, try turning on parallel diagnostics, see https://numba.readthedocs.io/en/stable/user/parallel.html#diagnostics for help.

File "..\..\..\..\..\AppData\Local\Temp\ipykernel_19136\2073800990.py", line 1:
<source missing, REPL/exec in use?>

(22357606058205098,
 iinfo(min=-2147483648, max=2147483647, dtype=int32),
 iinfo(min=-9223372036854775808, max=9223372036854775807, dtype=int64))

Fits in np.int64

cfunc

If we know the type of the values, we can use cfunc instead of jit
Compiles to C, may be faster!
Signatures are out(in), so / is float(int, int)

x = 5
y = 2
z = x / y
type(z), type(x), type(y)

(float, int, int)

Run it

from numba import cfunc, int64

@cfunc(int64(int64))
def p_ten(x):
    return (46189*x**10-109395*x**8+90090*x**6-30030*x**4+3456*x*2-63)//256

p_ten(np_arr) # Compile
t = time.perf_counter()
p_ten(np_arr)
time.perf_counter() - t

3.9947370998561382

Meh. Works better with formulas than polynomials (exponentials and the like).

Aside: Intel

I tried a few Intel packages Numba recommended.
I got no noticeable changes from either, but mention them.

Intel SVML

These slides were compiled on an Intel device.
Numba recommends SVML for Intel devices.

Intel provides a short vector math library (SVML) that contains a large number of optimised transcendental functions available for use as compiler intrinsics

python3 -m pip install intel-cmplr-lib-rt

Threading

For parallelism, Numba recommends tbb (Intel) or OpenMP (otherwise).
They are not available on all devices.
I could use tbb

python3 -m pip install tbb

Floats

Numba works just fine with floats!

float_arr = np_arr / 7
# add fastmath to decorator, change // to /
# `njit` means `jit(nopython=True`
from numba import njit

@njit(parallel=True)
def p_ten(x):
    return (46189.*x**10.-109395.*x**8.+90090.*x**6.-30030.*x**4.+3456.*x*2.-63.)/256.

p_ten(np_arr) # Compile
t = time.perf_counter()
p_ten(np_arr)
float_t = time.perf_counter() - t
float_t

0.9699231998529285

Floats

I used integers, but if we use floats we should use the fastmath option.
It allows greater inaccuracy (remember float rounding) in exchange for faster operations.

@njit(fastmath=True, parallel=True)
def p_ten(x):
    return (46189.*x**10.-109395.*x**8.+90090.*x**6.-30030.*x**4.+3456.*x*2.-63.)/256.

p_ten(np_arr) # Compile
t = time.perf_counter()
p_ten(np_arr)
fast_t = time.perf_counter() - t
fast_t, float_t, fast_t/float_t, float_t/fast_t

(0.1389130000025034, 0.9699231998529285, 0.1432206179041465, 6.982234922832631)

“I’m so Julia.” - Charli XCX

Julia Sets

The Julia set is a topic in complex dynamics.

Examples include the mathematical models that describe the swinging of a clock pendulum, the flow of water in a pipe, the random motion of particles in the air, and the number of fish each springtime in a lake.

Complex Quadratic Polynomials

While not required, a common class of dynamics systems are complex-valued polynomial functions of the second degree, so we take, e.g.

def f(z):
    return z**2 -.4 + .6j

Complex values

In Python, complex values are denoted as multiples of j (i is too frequently use).

z = 0 + 1j
z ** 2

(-1+0j)

NumPy, they have dtype=np.complex128

arr = np.array([0 + 1j])
arr.dtype

dtype('complex128')

Common Form

Often we refer to the these polynomials as follows:

\[ f_c(z) = z^2 + c \]

We refer to the Julia set related to such a polynomial as:

\[ J(f_c) \]

Sets and Functions

The Julia set is the elements on the complex plane for which some function converges under repeated iteration.
For \(J(f_c)\) it is the elements:

\[ J(f_c) \{ z \in \mathbb{C} : \forall n \in \mathbb{N} : |f_c^nN(z)| \leq R \]

On \(R\)

We define \(R\) as \[ R > 0 \land R^2 - R > |c| \]
It is simple to restrict \(|c| < 2\) and use \(R = 2\)

Exercise

Set a maximum number of iterations, say 1000
Set a complex value \(|c| < 2\)
Create a 2D array of complex values ranging from -2 to 2 (\(|a| < 2\))
- I used reshape and linspace together.
Iterate \(f_c\) on each value until either:
- Maximum iterations are reached, or
- The output exceeds \(R = 2\) - look at np.absolute()

Result

Use Numba to achieve higher numbers of iterations (which create sharper images).
Use Matplotlib imshow to plot the result.
- Plot real component as x, imaginary as y, and iterations as color.
I recommend using \(c = -0.4 + 0.6i\) which I think looks nice.
- But you may choose your own \(c\)

\(f_c^n(z)\)

Apply f_c
- Which, recall, is \(z^2+c\)
To z
- The elements of the array.
\(n\) times
- So \(f_c^2(z) = f_c(f_c(z))\)

def f_c_n(z, c, n):
    for i in range(n): # "do thing n times"
        z = z**2 + c
    return z

Solution

Code

import numpy as np
import matplotlib.pyplot as plt
from numba import njit

@njit
def iterator(m, c, z):
    for i in range(m):
        z = z**2 + c
        if (np.absolute(z) > 2):
                return i
    return i

# m: max iterations, c: c
def j_f(m, c):
    real = np.linspace(-2,2,1000)
    imag = (0 + 1j) * real
    comp = real.reshape(-1,1) + imag
    f_vec = np.vectorize(iterator, excluded={0,1})
    return f_vec(m, c, comp)

# m: max iterations, c: c
def visualize(m, c):
    arr = j_f(m,c)
    plt.imshow(arr)
    plt.axis('off')
    plt.savefig("julia.png", bbox_inches="tight", pad_inches=0)

visualize(1000, -0.4 + 0.6j)

Numba

Why Numba?

What is Numba?

C/Fortran

Scripting

Vs. C/Fortran

Numba

In its own words:

Why not Numba?

Install

Pip again!

Decorators

Import

Test it

Aside: pandas

Modes

Timing

time

perf_counter()

NumPy

Polynomial

\(P_{10}(x)\)

Timing

Compilation

Compiling

Numba time.

Maybe 10x?

Using jit

Parallel

For me

Or…

GIL

nogil=True

Signatures

cfunc

Run it

Aside: Intel

Intel SVML

Threading

Floats

Floats

“I’m so Julia.” - Charli XCX

Julia Sets

Complex Quadratic Polynomials

Complex values

Common Form

Sets and Functions

On \(R\)

Exercise

Result

\(f_c^n(z)\)

Solution

Using `jit`