This is the fifth blog post in a series - you can find the previous blog post here

If you have read my previous few blog posts you will have some appreciation for spin-glasses and some models used to study and simulate them. In this blog post we will look a few applications and extensions of spin-glasses.

Potts Spin-Glass Model

The Potts spin-glass model is a generalization of the Ising model we looked at previously. Whereas the Ising model allows only "up" or "down" spins the Potts model is a "q-spin" model where spins can take any of $q$ possible values. The Hamiltonian is typically specified as: $$H = - \sum_{x,y} J_{xy} \delta(\sigma_x \sigma_y) - \sum_x h_x \sigma_x $$ Where $\delta(.)$ represents a Kronecker delta function. For $q=2$ this is equivalent to the Ising model.

Another representation of a Potts model is the vector Potts model, here spins take values as defined as angles: $$ \theta_i = \frac{2 \pi i}{q} $$ For $i$ between 1 and $q$. The Hamiltonian can then be defined as: $$H = - \sum_{x,y} J_{xy} cos(\sigma_x - \sigma_y) - \sum_x h_x \sigma_x $$ The case for $q \to \infty$ is called the XY-model.

On a 2d square lattice we find that a phase transition exists for $q>4$ for $q\leq 4$ there is a 2nd order phase transition (as with the Ising model $q=2$). The techniques and simulation procedures discussed previously can be adapted to the Potts model with minimal change.

Other generalizations of this model have been used in various physical and biological applications. The Potts model has also been used in image processing.

Markov-Random-Field

The Ising model can also be thought of as the simplest example of a Markov-Random-Field (MRF). The theory and application of these could easily take a blog post (or more) by itself. Instead we will just look at some basic definitions and the major theorems.

A Markov-Random-Field is an example of a probabilistic graphical model (PGM). We define a MRF via a set of random variables and an undirected graph: $(X, G)$. Unlike other some other types of PGM an MRF graph is undirected and there is no requirement for the graph to be acyclic. For a given graph $G = (V,E)$ with specified vertices and edges, a set of random variables $(X_v)_{v \in V}$ forms a Markov-Random-Field over G if they satisfy the folowing properties:

Pairwise Markov Property: any two non-adjacent variables are conditionally independent given all other variables: $$X_u \perp X_v | X_{V / \{u,v\}}$$
Local Markov Property: A variable is conditionally independent of all other variables given its neighbors: $$X_u \perp X_{V /N(u)} | X_{N(u)} $$ Where $N(u)$ represents the set of neighbours of vertex $u$ according to $G=(V,E)$
Global Markov Property: Any two subsets of variables are conditionally independent given a separating subset: $$X_A \perp X_B | X_S $$ Where every path from a node in $A$ to a node in $B$ passes through $S$.

One useful way of specifying the MRF is through the conecpt of clique potentials. A subgraph $C \subset G$ is a clique if it is a fully connected subgraph, the set of all cliques of $G$ is denoted: $Cl(G)$. We can then specify a joint distribution of random-variables $\mathbf{X} = (X_v)_{v \in V}$ via clique potential functions: $\phi_C(X_C)$ such that: $$P(\mathbf{X} = \mathbf{x}) = \prod_{C \in Cl(G)} \phi_C (x_C) $$ We note that every joint distribution specified in this was is a MRF. However the reverse is not necessarily true, we can however factorise a joint distribution this way if:

The distribution is strictly positive (known as the Hammersley-Clifford theorem)
The graph is chordal

A convenient form of clique potential is the exponential form which we can denote as: $$ P(\mathbf{X} = \mathbf{x}) = \frac{1}{Z} exp\left(\sum_{C \in Cl(G)} w_C f_C(x_C)\right)$$ With: $$ Z = \sum_{\mathbf{x}} exp\left(\sum_{C \in Cl(G)} w_C f_C(x_C)\right) $$ Which is nothing more than a general form of the Gibbs distribution we looked at in previous blog posts. To see this for the Ising model the cliques are adjacent pairs of spins $C_{xy} = (\sigma_x, \sigma_y)$ with $f_{C_{xy}} = \sigma_x\sigma_y$ and $w_{C_{xy}} = - J_{xy}$ - this is the simplest form of a MRF. We can then extend some of the ideas and results we have looked at previously to a much larger class of probabilistic models.

Markov random fields have been used in a variety of applications relating to computer vision and image/video processing including texture synthesis, compression, enhancement, etc. They have also been used in machine learning and computational biology. The ideas are also used in combinatorial optimization problems owing to their similarity to spin glasses.

Hopfield Network

We now look at a(n artificial) neural-network model developed by John Hopfield in 1982. This model is no longer particularly in favour as it doesn't have particularly strong performance but is interesting nonetheless. We will look at Hopfield networks as a mechanism for pattern storage and recall, they can be used in different applications. The model took spin glass models as a direct inspiration.

A Hopfield network is a single layer recurrent network with a fully connected geometry, each neuron receives input from every other neuron and also gives output to every other neuron. We can specify the network through a weight matrix $(W_{ij})_{ij}$ which has the following properties:

Symmetry: $W_{ij} = W_{ji} \; \; \; \forall i,j$
No self-connectivity: $W_{ii} = 0 \; \; \; \forall i$

The neurons themselves take discrete values of $\sigma_i \pm 1$ denoting "on" or "off" states. At each update step we define a quantity: $$ \theta_i = \sum_j W_{ij} \sigma_j + b_i$$ Where $b_i$ is a bias term (we can take $b_i = 0$). If $\theta_i > 0$ then we set $\sigma_i = 1$ and vice-versa. We can also note that this induces an energy function: $$ E = - \frac{1}{2}\sum_{ij} W_{ij} \sigma_i \sigma_j $$ Which is exactly the Hamiltonian we investigated in our spin glass blog posts. We can see that the Hopfield Network is in fact nothing more than a Sherrington-Kirkpatrick spin glass viewed in a different way. From fairly basic arguments we can see that for each step of the dynamics described above the energy decreases.

Now suppose we want the network to "remember" patterns, that is we want it to remember a certain "on-off" configuration of neurons. To do this we have to select the correct weight matrix. By noting that the dynamics suggest each step leads to an energy decrease we want our stored patterns to be local energy minima of the network. Suppose we want to store an $N$ bit pattern: $\hat{x} = \left( \hat{x}_1, \hat{x}_2, ... , \hat{x}_N \right)$ if we take: $W_{ij} = c \hat{x}_i\hat{x}_j$ for some constant $c>0$ (representing the learning rate), then we have: $$ \theta_i = \sum_j W_{ij} \sigma_j = \sum_j c \hat{x}_i\hat{x}_j \sigma_j = c\sum_j \hat{x}_i = c(N-1)\hat{x}_i$$ And so we have $\hat{x}$ forms a(n attractive) fixed point of the dynamics. Starting from a noisy input configuration eventually we should end up with the stored pattern.

We can naturally extend this further if we want to store $P$ different patterns $\left(\hat{x}^p\right)_{p=1}^P$: $$W_{ij} = c \sum_p \hat{x}_i^p\hat{x}_j^p$$ Typically we would take $c$ to scale with the number of patterns, such as: $c=\frac{1}{P}$.

From basic arguments and intuition we can see that recall will deteriorate as we add more and more patterns. Can we construct an argument to find a "capacity" of the network? Yes! And it is quite simple. We start by re-writing the sum for $\theta_i(\hat{x}^p)$ (that is $\theta_i$ evaluated at one of the stored patterns $p$): $$\theta_i(\hat{x}^p) = \sum_j W_{ij} \hat{x}^p_j = \frac{1}{P} \sum_j\sum_q \hat{x}_i^q\hat{x}_j^q\hat{x}^p_j = \hat{x}_i^p + \frac{1}{P} \sum_j\sum_{q \neq p} \hat{x}_i^q\hat{x}_j^q\hat{x}^p_j $$ Where the second term (the double summation) is referred to as the crosstalk term, if this is less than 1 then $\hat{x}^p$ is a fixed point and the network has "remembered" it. We can then define: $$ C_i^p = - \hat{x}_i^p \frac{1}{P} \sum_j\sum_{q \neq p} \hat{x}_i^q\hat{x}_j^q\hat{x}^p_j $$ If $C_i^p<0$ then $\hat{x}^p$ will be a fixed point. For $C_i^p>1$ then there is instability. We want to find $P(C_i^p > 1)$. By considering the limit of a large number of neurons and patterns we can consider the case of the terms $\hat{x}_i^q\hat{x}_j^q$ being uniformly random. By applying the central limit theorem we get: $C_i^p \sim N(0,\sigma^2)$ with $\sigma^2 =\frac{P}{N}$ - that is the ratio of the number of stored patterns over the number of neurons. For example if we have $\frac{P}{N} = 0.37$ then the probability of observing an error is around $5%$. From further analysis (not shown here) we can find that these errors can "add up", for stability we require: $\frac{P}{N} < 0.138$. So for a $10x10$ neuron array we will only want to store 13 patterns or fewer.

So far we have assumed that each pattern is treated equally by the network, we can remember some patterns "more strongly" by assigning them a multiplicity (essentially the number of times the pattern is "remembered"). With higher multiplicity the pattern will have a larger basin of attraction. However we have to consider the sum of multiplicities now as opposed to the number of unique patterns. For example for the $10x10$ neuron case, if we have a pattern with multiplicity $10$ then we can only store a maximum of 3 extra patterns (of multiplicity 1).

So far the picture looks rosy with our Hopfield-Network, unfortunately there are a few issues that prevents them being widely used. Firstly we have the occurence of "spurious states" - these are non-remembered states that are attractors of the network dynamics. If we start near one of these spurious states the dynamics will converge towards them, which is obviously not ideal. These are due to the complex energy landscape of the Sherrington-Kirkpatrick spin-glass. One simple example of a spurious state would be any inverse image of a stored pattern (i.e. flipping +1 neuron activations to -1 and vice versa). There can be other spurious states however.

We now present a basic implementation of a Hopfield network as an illustration. We will consider a $7x7$ neuron grid and aim to store $4$ patterns. We will "flatten" the neuron grid to a $49$ element single dimension array for convenience.

# A code implementing a Hopfield network
# Storing a number of 2d images for recall

import matplotlib.pyplot as plt
import numpy as np

# Set random seed for reproducibility
np.random.seed(123)

num_patterns = 4
grid_width = 7
grid_height = 7
grid_size = grid_width*grid_height

# Fix stored patterns
stored = np.zeros((num_patterns, grid_size))

stored[0] = [-1, -1, +1, +1, -1, -1, -1, \
             -1, +1, -1, +1, -1, -1, -1, \
             -1, -1, -1, +1, -1, -1, -1, \
             -1, -1, -1, +1, -1, -1, -1, \
             -1, -1, -1, +1, -1, -1, -1, \
             -1, -1, -1, +1, -1, -1, -1, \
             -1, +1, +1, +1, +1, +1, -1 ]
stored[1] = [-1, +1, +1, +1, +1, +1, +1, \
             +1, -1, -1, -1, -1, -1, +1, \
             -1, -1, -1, -1, -1, +1, +1, \
             -1, +1, +1, +1, +1, +1, -1, \
             +1, +1, -1, -1, -1, -1, -1, \
             +1, -1, -1, -1, -1, -1, -1, \
             +1, +1, +1, +1, +1, +1, +1 ]
stored[2] = [-1, +1, +1, +1, +1, +1, +1, \
             +1, -1, -1, -1, -1, -1, +1, \
             -1, -1, -1, -1, -1, -1, +1, \
             -1, -1, +1, +1, +1, +1, -1, \
             -1, -1, -1, -1, -1, -1, +1, \
             +1, -1, -1, -1, -1, -1, +1, \
             -1, +1, +1, +1, +1, +1, +1 ]
stored[3] = [-1, -1, -1, +1, +1, -1, -1, \
             -1, -1, +1, -1, +1, -1, -1, \
             -1, +1, -1, -1, +1, -1, -1, \
             +1, -1, -1, -1, +1, -1, -1, \
             +1, +1, +1, +1, +1, +1, +1, \
             -1, -1, -1, -1, +1, -1, -1, \
             -1, -1, -1, -1, +1, -1, -1 ]

# Display the patterns
fig, ax = plt.subplots(1, num_patterns, figsize=(10, 5))

for i in range(num_patterns):
    ax[i].imshow(stored[i].reshape((grid_height, grid_width)), cmap='binary')
    ax[i].set_xticks([])
    ax[i].set_yticks([])
    ax[i].set_title("Pattern %i" %(i+1))
fig.suptitle("Remembered Patterns", x=0.5, y=0.8, size='xx-large')
plt.show()

# Create network weights matrix
W = np.zeros((grid_size, grid_size))

for i in range(grid_size):
    for j in range(grid_size):
        if i == j or W[i, j] != 0.0:
            continue
            
        w = 0.0
        
        for n in range(num_patterns):
            w += stored[n, i] * stored[n, j]
            
        W[i, j] = w / num_patterns
        W[j, i] = W[i, j]

# Test noisy inputs
noise_rate = 0.25
num_cycles = 5

fig, ax = plt.subplots(num_patterns, 3, figsize=(10, 15))

for x in range(num_patterns):
    original_image = stored[x]
    noisy_image = original_image.copy()

    for i in range(grid_size):
        if np.random.random() < noise_rate:
            noisy_image[i] *= -1

    test_image = noisy_image.copy()

    for _ in range(num_cycles):
        for i in range(grid_size):
            test_image[i] = 1.0 if np.dot(W[i], test_image) > 0 else -1.0

    # Display results
    ax[x,0].imshow(noisy_image.reshape((grid_height, grid_width)), cmap='binary')
    ax[x,0].set_xticks([])
    ax[x,0].set_yticks([])
    ax[x,0].set_title("Noisy Input Pattern %i" %(x+1))
    ax[x,1].imshow(test_image.reshape((grid_height, grid_width)), cmap='binary')
    ax[x,1].set_xticks([])
    ax[x,1].set_yticks([])
    ax[x,1].set_title("Hopfield Network Prediction")
    ax[x,2].imshow(original_image.reshape((grid_height, grid_width)), cmap='binary')
    ax[x,2].set_xticks([])
    ax[x,2].set_yticks([])
    ax[x,2].set_title("Target Pattern %i" %(x+1))
fig.suptitle("Noisy Image Retrieval (Noise Rate=%d" %(noise_rate*100) +"$\%$)", x=0.5, y=0.91, size='xx-large')
plt.show()

# Calculate energies of stored patterns
energy = np.zeros(num_patterns)

for x in range(num_patterns):
    e = 0
    for i in range(grid_size):
        for j in range(grid_size):
            e += W[i,j] * stored[x][i]* stored[x][j]
    e *= -0.5
    print("Energy of pattern ",x+1,":",e)

Energy of pattern  1 : -294.0
Energy of pattern  2 : -447.0
Energy of pattern  3 : -454.0
Energy of pattern  4 : -301.0

In this toy example we have seen results that are not unreasonable, however even in this relatively simple example we see that patterns 2 and 3 are not recalled exactly - most likely as there is a high degree of overlap between them, each time the network dynamics converged to a spurious pattern.

Hopfield networks were the basis for the development of Boltzmann machines however which are stochastic versions of the Hopfield network. We will not touch on these here.

NK Model

We now focus our attention on another model with a similarity to spin-glasses: the NK model propsed by Stuart Kauffmann. Described as a "tunable ruggedness" model by Kauffmann, the NK model originally proposed as a model of evolution (not to be confused with an "evolutionary algoirthm"). The model controls the size of a landscape and the number of local optima via the 2 parameters $N$ and $K$. The $N$ parameter denotes the degrees of freedom or the "size" of the system. the parameter $K$ represnts the level of interactivity between those degrees of freedom. $K$ varies from $K=0$ (leading to a smooth landscape) through to $K=N-1$ the most rugged/peaked.

In the context of spin-glass models we can express the Hamiltonians corresponding to the NK model as: \begin{align} K=0 \implies H &= - \sum_{i} J_{i} \sigma_{i} \\ K=1 \implies H &= - \sum_{i} J_{i} \sigma_{i} - \sum_{ij} J_{ij} \sigma_{i}\sigma_{j} \\ K=2 \implies H &= - \sum_{i} J_{i} \sigma_{i} - \sum_{ij} J_{ij} \sigma_{i}\sigma_{j} - \sum_{ijk} J_{ijk} \sigma_{i}\sigma_{j}\sigma_{k} \end{align} And so on. through this interpretation we can see that the NK model is nothing more than a generalization of the spin glass models we have looked at previously. However the model did not use spin-glasses as a basis, it was developed as a way of studying evolutionary systems and how one can navigate a fitness landscape.

As one would expect given the similarity the NK model shares some of the properties of spin-glasses; most notably that it has a very complicated landscape that often makes it hard to find even local optima. If the model sufficiently captures the properties of evolutionary systems this has some interesting implications.

We'll now present an NK model outside of the context of spin-glasses - namely as an evolution model as originally intended. We start by considering a sequence of genes $(s_i)_{i=1}^N$. We want to observe what happens to these genes over time. Within this model we assume an organism ($\mathbf{s}$) is completely defined by its gene sequence. The fitness of an organism is defined as an average of the fitness of its genes: $$ F(\mathbf{s}) = \frac{1}{N} \sum_i f(s_i)$$ The fitness function depends on the value of the gene itself but also other genes - in biology this is known as "epistasis". We use the matrix $\left( A_{ij} \right)_{i,j=1}^N$ to denote dependence. If $A_{ij} = 1$ then gene $i$ depends on gene $j$ and is zero otherwise. Note that this matrix denotes a directed network and need not be symmetric. We can then compute a fitness landscape for each organism according to this equation.

In the NK model we assume that initially the entire population consists of one organism (sequence of genes). At random a mutation occurs to one of the genes, this new mutation either dies out quickly (if it has a lower fitness) or it reproduces faster than the original organism (which dies out) and becomes the sole organism in the system. The assumption here of course is that genetic mutations are far enough apart in time that the system can "settle" at each step.

Given a fitness landscape we can specify the dynamics of the NK model as:

From $\mathbf{s} = (s_1, ..., s_N)$ pick a gene $i \in {1, ... , N}$ at random and mutate its value to create a new organism $\mathbf{\hat{s}}$
If $F(\mathbf{\hat{s}}) > F(\mathbf{s})$ use $\mathbf{\hat{s}}$ as the new state, otherwise keep $\mathbf{s}$ and repeat

Clearly this is simply a greedy hill climber algorithm and the dynamics will end up in a nearby local maxima. The main attraction of the NK model is through the development of the fitness landscape.

The NK model has been developed further in the NKCS model. The NK model essentially looks at genetic variation of one species, the NKCS model extends this idea to multiple species. To do this we assume each species follows its own NK dynamics, its dynamics also interacts with $S$ other species. Each gene in a species is coupled to $C$ randomly chosen genes from the $S$ species. The dynamics can then be described as:

Sequentially select each species during a time step
With the selected species mutate one of the genes
Calculate the updated fitness and either accept or reject this mutation and repeat

Whereas the NK model always ends up in a local fitness maxima this is not necessarily true of the NKCS model since for a fixed point we would require each species to be in a fitness maxima at the same time. This allows the model to display some more sophsticated dynamics such as criticality, self-organised-criticality and co-evolution (and others).

In addition to evolution these models can also be used as a model of immunity. Along with this NKCS models have been used as a model of technological evolution. The model is general enough that it can be applied to, essentially, any evolutionary system.

These models can get fairly complicated, in the future I may write a full length post on models of evolution.

Conclusion

We have now seen that we can extend our spin-glass models to more sophisticated spin-glass models through the Potts model which are more physically plausible. We also saw that spin-glass models were a major influence on some of the first artificial neural networks. Lastly by looking at spin-glasses from a different point of view we have seen how they can inform our understanding of evolutionary systems.

We did not cover it here but spin-glasses have also influenced combinatorial optimization problems. We touched upon this in a previous post where we introduced the concept of simulated annealing. These techniques have been used in computer chip design (where one needs to optimize size/geometry of chip design versus performance). They have also been used in studying protein folding where there is a large universe of possible orientations that need to be searched according to a number of constraints.