Following on from the previous blog post here comparing the relative merits of Cython Vs Numba I thought I'd illustrate this with implementations of a relatively simple model: a vanilla 2d Ising Model. This is a prime target for a performance boost since it is a very "loopy" code. I will not cover what the Ising Model is/how it works (you can check that here!). It is a very interesting model and I may return to it at a later date to look at some of its properties and explore it more deeply (and maybe spin glass models more generally). For now it will just be a test subject to explore performant python type code.

In this blog I will present a few different versions, I won't throw the kitchen sink at them to get the absolute best performance but will adopt an 80/20 principle. I will not try any parallelisation or clever memory management outside of what comes pre-canned.

Python

This is a basic python implementation using a lot of looping. Even for a relatively small models this code will likely be fairl slow due to the looping.

import numpy as np

# Grid size: N (int) NxN square latice
N = 1000

# Fix Temp
kT = 2 / np.log(1 + np.sqrt(2))

# Fix seed
np.random.seed(123)

# Random Initialize
spins = 2*np.random.randint(2, size=(N, N))-1

# Get sum of neighbours
def neighbour_sum(i, j, spin_array):
    north, south, east, west = 0, 0, 0, 0
    max_height = spin_array.shape[0]
    max_width = spin_array.shape[1]
    if i > 0:
        north = spin_array[i-1, j]
    if i < max_height-1:
        south = spin_array[i+1, j]
    if j > 0:
        west = spin_array[i, j-1]
    if j < max_width-1:
        east = spin_array[i, j+1]
    res = north + south + east + west
    return res

def dE(i, j, spin_array):
    return 2*spin_array[i, j]*neighbour_sum(i, j, spin_array)

def update(spin_array):
    height = spin_array.shape[0]
    width = spin_array.shape[1]
    
    for y_offset in range(2):
        for x_offset in range(2):
            for i in range(y_offset, height, 2):
                for j in range(x_offset, width, 2):
                    dEtmp = dE(i, j, spin_array)
                    if dEtmp <= 0 or np.exp(-dEtmp / kT) > np.random.random():
                        spin_array[i, j] *= -1

    return spin_array

def _main_code(M, spin_array):
    spin_tmp = spin_array
    for x in range(M):
        spin_tmp = update(spin_tmp)
    return spin_tmp

Numba

For this code we will just take the above code and use the njit decorator (forcing the use of LLVM - a jit decorator will fall back to Python object mode if it cannot work out how to use LLVM). This is literally a few seconds of coding updates.

import numpy as np
from numba import njit

# Grid size: N (int) NxN square latice
N = 1000

# Fix Temp
kT = 2 / np.log(1 + np.sqrt(2))

# Fix seed
np.random.seed(123)

# Random Initialize
spins = 2*np.random.randint(2, size=(N, N))-1

# Get sum of neighbours
@njit
def nb_neighbour_sum(i, j, spin_array):
    north, south, east, west = 0, 0, 0, 0
    max_height = spin_array.shape[0]
    max_width = spin_array.shape[1]
    if i > 0:
        north = spin_array[i-1, j]
    if i < max_height-1:
        south = spin_array[i+1, j]
    if j > 0:
        west = spin_array[i, j-1]
    if j < max_width-1:
        east = spin_array[i, j+1]
    res = north + south + east + west
    return res

@njit
def nb_dE(i, j, spin_array):
    return 2*spin_array[i, j]*nb_neighbour_sum(i, j, spin_array)

@njit
def nb_update(spin_array):
    height = spin_array.shape[0]
    width = spin_array.shape[1]
    
    for y_offset in range(2):
        for x_offset in range(2):
            for i in range(y_offset, height, 2):
                for j in range(x_offset, width, 2):
                    dEtmp = nb_dE(i, j, spin_array)
                    if dEtmp <= 0 or np.exp(-dEtmp / kT) > np.random.random():
                        spin_array[i, j] *= -1
    
    return spin_array

@njit
def nb_main_code(M, spin_array):
    spin_tmp = spin_array
    for x in range(M):
        spin_tmp = nb_update(spin_tmp)
    return spin_tmp

Cython

Again we will modify the python code. We will see that while not too onerous it does require a little more work than the Numba example. The code is largely boilerplate but requires a little more thinking than the Numba example (e.g. in implementing this I initially forgot a static type definition of 1 variable which was causing a 300% increase in runtime).

%load_ext Cython

%%cython

cimport cython

import numpy as np
cimport numpy as cnp

from libc.math cimport exp
from libc.stdlib cimport rand
cdef extern from "limits.h":
    int RAND_MAX

# Grid size: N (int) NxN square latice
cdef int N = 1000

# Fix Temp
cdef float kT = 2 / np.log(1 + np.sqrt(2))
cdef float kTinv = 1 / kT

# Fix seed
np.random.seed(123)

# Random Initialize
spins = 2*np.random.randint(2, size=(N, N))-1

# Get sum of neighbours
@cython.boundscheck(False)
@cython.wraparound(False)
cdef int cy_neighbour_sum(int i, int j, cnp.int32_t[:, :] spin_array):
    cdef int north = 0
    cdef int south = 0
    cdef int east = 0
    cdef int west = 0
    cdef int max_height = spin_array.shape[0]
    cdef int max_width = spin_array.shape[1]
    if i > 0:
        north = spin_array[i-1, j]
    if i < max_height-1:
        south = spin_array[i+1, j]
    if j > 0:
        west = spin_array[i, j-1]
    if j < max_width-1:
        east = spin_array[i, j+1]
    cdef int res = north + south + east + west
    return res

@cython.boundscheck(False)
@cython.wraparound(False)
cdef int cy_dE(int i, int j, cnp.int32_t[:, :] spin_array):
    return 2*spin_array[i, j]*cy_neighbour_sum(i, j, spin_array)

@cython.boundscheck(False)
@cython.wraparound(False)
cdef cnp.int32_t[:, :] cy_update(cnp.int32_t[:, :] spin_array):
    cdef int height = spin_array.shape[0]
    cdef int width = spin_array.shape[1]
    cdef int y_offset, x_offset
    cdef int i, j
    cdef int dEtmp
    
    for y_offset in range(2):
        for x_offset in range(2):
            for i in range(y_offset, height, 2):
                for j in range(x_offset, width, 2):
                    dEtmp = cy_dE(i, j, spin_array)
                    if dEtmp <= 0:  
                        spin_array[i, j] *= -1
                    elif exp(-dEtmp * kTinv) * RAND_MAX > rand():
                        spin_array[i, j] *= -1

    return spin_array

@cython.boundscheck(False)
@cython.wraparound(False)
cdef cnp.int32_t[:, :] cy_main_code(int M, cnp.int32_t[:, :] spin_array):
    cdef int x
    for x in range(M):
        spin_array = cy_update(spin_array)
    return spin_array

@cython.boundscheck(False)
@cython.wraparound(False)
cpdef cnp.int32_t[:, :] cpy_main_code(int M, cnp.int32_t[:, :] spin_array):
    return cy_main_code(M, spin_array)

Timing Results

# Python
%timeit _main_code(10, spins)

# Numba
%timeit nb_main_code(10, spins)

# Cython
%timeit cpy_main_code(10, spins)

52 s ± 2.53 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
244 ms ± 4.48 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
395 ms ± 6.92 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

For our purposes the %timeout magic will be good enough a proxy for performance. We can see that the intial python implementation is very slow. For a 1000x1000 lattice and looping over 10 times, for my machine the python interation takes 50-55s. This would be fairly problematic if we wanted to sweep over the parameter space, find critical temperatures, perform random seed analyses etc.

In contrast Numba only takes 240-250ms, an impressive 2000% speed up. Running the code multiple times now seems much less onerous.

Cython is not quite as quick as the Numba implementation taking 390-400ms but still represents a significant speedup compared to Python. For practical applications the difference between Numba and Cython in this case may be insignificant.

It is worth noting that these times are from my machine on at a specific time. The times achieved on your machine might be slightly different. Similarly changing the size of the lattice may change the ordering of which option is "quickest" - as always it's worth checking the code how it will be used rather than performing a benchmark like this for determining which option to use.

Conclusion

From the results above it may be tempting to claim Numba is the obvious choice given it is not only easier to implement than cython but also offers faster speeds. However I selected the 2d Ising model as an example since I knew the code would work well in Numba (in a sense I have been p-value hacking the experiment!) In certain situations (e.g. a code relying very heavily on class structures) Numba is either unusable or requires a complete code overhaul whereas cython can require only a few lines of boilerplate code.

In other examples you can also see that Cython can severely outperform Numba, I am not sure why this is and the only real way to determine which will perform better is to perform testing (if somebody has an explanation/heuristic I'd love to hear it). It is also possible to interface numba and cython which has been useful to me in the past. For a quick example suppose we want to perform an inverse transform of a Beta(2,0.5) distribution:

from scipy.stats import beta

x = beta.ppf(0.1, 2, 0.5)
x

0.4681225665264196

This cannot be optimised in Numba as it is (beta.ppf is currently not supported functionality - this may change by the time you read this). However we can take the address of the cython special function that this calls. We can then build the function in such a way as it can be seen by Numba:

import ctypes
from numba import types, njit
from numba.extending import get_cython_function_address


betaaddr = get_cython_function_address("scipy.special.cython_special", "btdtri")
functype3d = ctypes.CFUNCTYPE(ctypes.c_double, ctypes.c_double, ctypes.c_double, ctypes.c_double)
beta_fn = functype3d(betaaddr)

@njit
def nb_beta_ppf(p, a, b):
    return beta_fn(a, b, p)

x = nb_beta_ppf(0.1, 2.0, 0.5)
x

0.4681225665264196

As we can see this is a little ugly but it works.

As a result of everything covered in these blog posts I do not believe that one option offers a significant advantage over the other and both offer valid tools for improving runtime. The choise of which to use in a given situation will depend on many factors and requires careful thought as to what is needed in a given setting.