In this blog post we shall be looking at the concept of "entropy" which appears under a number of guises. We will look at a number of areas in which entropy arises and look at similiarities and differences. In the interests of brevity this will be a whistle-stop tour of each area and we will not cover all the details. Each area has enough results/theorems/etc. to cover a series of blog posts by themselves. In this blog post we are simply concerned with the concept of entropy.

(Note: This blog post follows the research by Rudolf Hanel and Stefan Thurner and in particular the first half of the 6th chapter of the book: "Inroduction to the Theory of Complex Systems" by Stefan Thurner, Rudolf Hanel and Peter Klimek)

## Entropy Thermodynamics and Statistical Mechanics

When we mention "entropy" what many people first think of is a measure of "disorder" or "uncertainty" of a system. We imagine a system whereby an individual configuration is called a microstate. A microstate can be associated with a global behaviour called a macrostate. For each macrostate if we take (the logarithm of) the number of microstates that lead to the macrostate then we have the "entropy" quantity (Boltzmann entropy). A macrostate that has more generating microstates is considered to be more "disordered". To take a simple example taking water ($H_2 O$) in solid (ice) form the individual molecules are (somewhat) forced into position in the crystaline structure, as such there are relatively few different orientations of molecules and hence a low entropy. As we move into the liquid phase the molecules move around more freely and so there are many more positions which the molecules can take, leading to a higher entropy. In the gaseous phase the molecules are whizzing around with higher energies and even more orientations are possible and so the entropy increases yet further.

The first definition of entropy comes from Rudolf Clausius in the mid 1800s through work in thermodynamics. He asserts that internal energy ($U$), temperature ($T$), pressure ($P$) and volume ($V$) are related via entropy ($S$) as: $$dU = T dS - P dV$$ Through this definition we find that entropy is a way of relating the macroscopic behaviours of a system. This holds for reversible processes that are quasi-stationary (essentially: there can be an absorbing state that we reach almost-surely but initial conditions are such that the system will take a long time to reach it). We sometimes re-arrange this formulation slightly to give: $$\frac{dQ}{T} = dS$$ Where $Q$ is the heat added/removed to/from the system. There are a couple of extra properties we require of entropy, firstly it must be additive. If for example we take 2 identical systems and combine them the entropy must double, in other words: $$S_{A+B} = S_A + S_B$$ For systems $A$ and $B$. Also we have that the second law of thermodynamics must hold - that is: for a closed system entropy must never be decreasing: $$dS \geq 0$$ Which means if we know the macrostate at a given time then the state at a later time must be of equal or higher entropy. Moreover if the closed system is in equilibrium: $dS = 0$.

Another definition of entropy via statistical mechanics is: $$S_B = k_B log \Omega$$ Where $k_B$ is a unit-correcting constant and $\Omega$ is the number of microstates consistent with a macrostate. If the microstates are not equally probable we can use the Gibbs-Boltzmann form of entropy: $$S_B = -k_b \sum_i p_i ln p_i$$ This can be shown to be equivalent to Clausius entropy by invoking the Boltzmann distribution, that is: $$p_i = \frac{1}{Z} exp\left( \frac{-E_i}{k_B T} \right)$$ Where $E_i$ is the energy corresponding to the microstate, $T$ is the temperature and $Z$ is the partition function (normalizing constant).

To show the equivalency between Gibbs-Boltzmann and Clausius entropy formulations: \begin{align} dS_B &= -k_b \sum_i dp_i (ln p_i) \\ &= -k_b \sum_i dp_i \left(\frac{-E_i}{k_B T} - lnZ\right) \\ &= \sum_i \frac{E_i dp_i}{T} \\ &= \sum_i \frac{d(E_i p_i) - (dE_i)p_i}{T} \\ &= \frac{dE - dW}{T} \\ &= \frac{dQ}{T} \end{align} Which is equivalent to above. In the last line we used the first law of thermodynamics: $dE = dW + dQ$ - the change in energy is the work done plus the change in heat. We also used $\sum d(E_i p_i) = dE$ (average energy) and $\sum d(E_i)p_i = dW$ (work done). For this to hold we require that the system is in a thermal equilibrium.

This equivalence is quite an impressive and powerful result, we now have a mechanism by which we can relate the macrostates of a system to the microstates - that is we habe a link between the mircoscopic behaviour (particle behaviour) to macroscopic behaviour (phase of matter). The key assumption however is that the system is in equilibrium.

## Entropy in Information Theory

Another area where entropy appears is in information theory. Put very simply information theory is concerned with the problem of encoding messages and how much information is required to encode a message so that it can be decoded again with a certian level of accuracy. If we know what messages we want to send and a noise-rate then we can work out what capacity of channel we would require.

In the context of information theory entropy is the term used to quantify "uncertainty" in a message/signal. High entropy represents a large uncertainty and vice versa. Typically this is taken to be the Shannon entropy: $$S_S = - \sum_i p_i ln p_i$$ Which looks somewhat familiar now we have looked at Boltzmann entropy ($S_B$) however we should note these represent fundamentally different things and any resemblance is merely coincidence.

Why is Shannon entropy a good measure in this setting? Let's consider an example where we have $N$ signals with associated probabilites: $(p_1, p_2, ... , P_N)$. If we have a determinstic case: $(p_1=1, p_2=0,...,p_N=0)$ then the Shannon entropy is zero. Noting: $$\lim_{x \to 0} xlnx = 0$$ and $$1ln1=0$$ But also we can see that Shannon entropy is maximised for $p_i = \frac{1}{N}$ - uniform distribution. This fits with our notion of "maximum uncertainty". We also have for two systems $(A=(p_1, p_2, ... , P_N), B=(q_1, q_2, ..., Q_M)$ then: $$S_S(AB) = S_S(A) + S_S(B | A)$$ Where $(B|A)$ represents the distribution of $B$ conditional on $A$ - if $A$ and $B$ are fully independent then $(B|A)=B$ (for example flipping two coins, the probability of heads on one coin will not impact the probability of heads on the other - unless you taped the coins together or something equally silly).

By defining entropy in this way we can derive Shannon's noisy channel theorem (we will not prove this here) in layman's terms this states that: "we can create a coding scheme for a message that allows it to be transmitted (essentially) error free if the information capacity of the channel is greater than the source entropy." We have been a little sloppy in not defining all the terms here but we only need the gist for our purposes. We can see that the entropy is an important concept.

However, we have not applied any restriction to our "source" of the signals/messages. Since we are dealing with uncertainty we typically rely on a probabilistic description. Further for the statement above to hold for $S_S$ we require the information source process to be:

1. Markov
2. Ergodic The first criteria states that the next signal to be sent depends only on the current signal, not the entire history of sent signals. The second criteria ensures that the process exhibits stationarity (rates for individual signals do not vary in time) and moreover there is no state we can start a trajectory in that makes it impossible to reach any other state. This assumption underlies most of the classic information theory results.

However Shannon also proposed 3 axioms (SK1, SK2, SK4) representing fundamental requirements that an entropy measure $S(p)$ should have for dealing with Markov-Ergodic ("simple") information sources. Later Khinchin added a 4th (SK3).

• SK1. - Entropy $S(p)$ must be continuous and depend only on arguments $p_i$ - probability of individual states and nothing else
• SK2. - $S(p)$ must take it's maximum for $p_i = \frac{1}{N}$
• SK3. - Adding an additional state with zero probability does not change the entropy value
• SK4. - The composition of 2 systems should result in: $S(AB) = S(A) + S(B|A)$ - if the systems are fully independent this becomes $S(AB) = S(A) + S(B)$

Using these axioms we find that any functional $S(p)$ satisfying these axioms must be of the form: $$S(p) = - k \sum_i p_i ln p_i$$ Which is nothing more than Shannon entropy upto a multiplicative constant.

## Statistical Inference

In statistical inference we are trying to answer the question: what is the best model that captures some data? Or given some model what is the most likely outcome? There are a number of methods we can use to do this, one of which is the "maximum entropy principle" which loosely states that the distribution with the highest entropy given our current data is our "best guess" for the distribution that generated it. This was originally popularized by Jaynes in the late 1950s.

To illustrate the method let's imagine that we have $N$ possible states of a system. We have observations $k=(k_1, k_2, ... , k_N)$ for how often each has occured in our data. The process generating the data has "true" distribution $q=(q_1, q_2, ... , q_N)$ with $\sum_i q_i = 1$. We then have the probability of our observation (assuming independence) following a multinomial distribution: $$P(k|q) = \frac{N!}{k_1! k_2!...k_N!} q_1^{k_1} q_2^{k_2} ... q_N^{k_N}$$ Which we notice we can factorize as: $$P(k|q) = M(k) G(k|q)$$ With: \begin{align} M(k) &= \frac{N!}{k_1! k_2!...k_N!} \\ G(q|k) &= q_1^{k_1} q_2^{k_2} ... q_N^{k_N} \end{align} We call the function $M(k)$ the multiplicity, and $G(q|k)$ the probability. If we take logarithms and divide by N this formula becomes: $$\frac{1}{N} ln P(k|q) = \frac{1}{N} ln M(k) + \frac{1}{N} ln G(q|k)$$ Where the left hand side of the equation is called the relative entropy (or Kullback-Leibler divergence). The right hand term containing $M(k)$ is called the entropy and the final term the cross entropy. Via Stirling's approximation we have: $$M(k) \approx \frac{N^N}{k_1^{k_1}k_2^{k_2}...k_N^{k_N}}$$ If we denote the sample average: $$p_i = \frac{k_i}{N}$$ Then via Stirling: $$S_J = \frac{1}{N} ln M(k) = - \sum_i p_i ln p_i$$ Which is the (now familiar) entropy funcational. We can also substitute $p_i$ into the formula above, in a similar way we can relate the relative entropy, entropy and cross entropy via: $$- \sum_i p_i ln \frac{p_i}{q_i} = - \sum_i p_i ln p_i + \sum_i p_i ln q_i$$ We can make the assumption $q_i = exp^{-\alpha - \beta \epsilon_i}$. And then: $$\frac{1}{N} ln P(k|q) = - \sum_i p_i ln p_i - \alpha \sum_i p_i - \beta \sum_i p_i \epsilon_i$$ By differentiating and setting to zero we end up with a set of equations we can solve for $p^*_i$ that is the maximum entropy distribution: $$0 = \frac{\partial}{\partial p_i} ln P(k|q) = \frac{\partial}{\partial p_i} \left(- \sum_i p_i ln p_i - \alpha \sum_i p_i - \beta \sum_i p_i \epsilon_i \right)$$ The constants $\alpha$ and $\beta$ represent Lagrangian multipliers here, $\alpha$ ensures normalization of $p_i$ and $\beta$ ensures the first moment (mean) is correct. If we have additional information (e.g. we know higher order moments) we can build this in via additional Lagrangian multipliers.

We can see that entropy in this sense has the same functional as before. The key assumption however is that the process is multinomial.

## Equivalence of Entropy

We have now seen 3 classic examples of where the term entropy is used - the Boltzmann, Shannon and Jaynes entropy functionals. Since the functionals are identical (upto a multiplicative constant) it is natural for us to consider them equivalent. This is a mistake, although they are all called "entropy" and have the same form they represent different things.

It is worth noting that for each energy functional to make sense we make some pretty cavalier assumptions about the underlying systems:

1. Boltzmann entropy assumes a process is in equilibrium
2. Shannon entropy assumes an ergodic process
3. Jaynes entropy assumes a multinomial process

But what happens if we remove these assumptions?

## Entropy and Complexity

Complex systems by their nature do not follow the assumptions above. For example most complex systems are non-ergodic. In particular they may be "evolutionary" systems that show path dependent behaviour which violates ergodicity. They may also display long term "memory" of previous states, which again is non-ergodic. So what can we say in these cases?

Using the Shannon-Khinchin axioms above we can see the most restrictive is $SK4$ - this is a requirement built off the properties of an ergodic system. What if we simply drop this axiom? Thurner-Hanel-et. al. have shown we can do a lot with the following axioms:

• SK1. - Entropy $S(p)$ must be continuous and depend only on arguments $p_i$ - probability of individual states and nothing else
• SK2. - $S(p)$ must take it's maximum for $p_i = \frac{1}{N}$
• SK3. - Adding an additional state with zero probability does not change the entropy value
• $S(p)$ takes a trace form: $S(p) = \sum_i g(p_i)$

We can re-write these axioms further by taking the form: $S(p) = \sum_i g(p_i)$ and noting that:

• $g(.)$ must be continuous (by SK1)
• $g(.)$ must be concave (by SK2)
• $g(0) = 0$ (by SK3)

If we define: $$S_g(N) = \sum_i g\left( \frac{1}{N} \right) = N g\left( \frac{1}{N} \right)$$

We can then elicit scaling behaviour: \begin{align} \lim_{N \to \infty} \frac{S_g(N \lambda)}{S_g(N)} &= \lambda^{1-c} \\ \lim_{N \to \infty} \frac{S_g(N^{1+a})}{S_g(N)N^{a(1-c)}} &= (1+a)^d \end{align} We can then see there are 2 constants that define scaling laws for entropy following this trace form. We have $0 \leq c \leq 1$ and $d$ real valued. If $c>1$ then SK2 is violated and if $c<0$ SK3 is violated. Moreover it has been shown that an entropy funcitonal can be of the form: \begin{align} S_{c,d}(p) &= \frac{r}{c} A^{-d} e^A \left( \sum_i \Gamma(1+d, A - c ln p_i) - rp_i \right) \\ A &= \frac{cdr}{1 - (1-c)r} \\ \Gamma(a,b) &= \int_b^{\infty} t^{a-1} e^{-t} dt \end{align} Where $c$ and $d$ are the same as in the scaling properties above. There is an additional constant $r$ taking a range of values dependant on $c$ and $d$ to ensure the expression makes sense, it does not affect the scaling behaviour. We can thus use constants $c$ and $d$ to create equivalence classes of complex systems.

(Note: the expression of $S_{c,d}$ is not unique and simpler expressions exist, however this is a general form that allows for the widest range of $c$ and $d$ values available).

What happens if we take $c=1$ and $d=1$ as an example? Then $A=r$ and via some work (and the fact $\Gamma(2,x) = e^{-x}(x+1)$) we get: $$S_{1,1}(p) = (r+1) - re^r - \sum_i p_i ln p_i$$ Which we see as equivalent to the Boltzmann/Shannon/Jaynes entropy above (the additive constant non-withstanding which is a consequence of allowing SK4 to be relaxed - careful selection of $r$ can remove this constant).

So far we have limited ourselves to a trace form function: $S_{c,d}(p) = \sum_i g_{c,d}(p_i)$ - however this is not a requirement. We can use entropy functionals that are not of trace form and still use the same $(c,d)$ criteria for equivalence classes. The table below shows some entropy functionals proposed and their equivalence class:

Name---------------------- Entropy-Functional-Form----------------------------------------- ------c------ --d--
Boltzmann $-\sum_i p_i ln p_i$ 1 1
Reyni $\frac{1}{1-\alpha}ln \sum_i p_i^{\alpha}$ 1 1
Tsallis $\frac{1 - \sum_i p_i^q}{q-1}$ $q$ 0
Abe $- \frac{\sum_i p_i^q - p_i^{1/q}}{q - 1/q}$ $q$ 0
Sharma-Mittal $- \sum_i p_i^r (p_i^{\kappa} - p_i^{-\kappa}) / 2\kappa$ $r-\kappa$ 0
Landsber-Vedral $\left( \left( \sum_i p_i^q \right)^{-1} - 1 \right) / (q-1)$ $2-q$ 0
Exponential $\sum_i p_i \left( 1 - e^{ \frac{p_i-1}{p_i}} \right)$ 1 0
Anteonodo-Plastino $\sum_i \left( \Gamma \left( \frac{\eta+1}{\eta}, -lnp_i \right) -p_i \Gamma \left( \frac{\eta+1}{\eta} \right) \right)$ 1 $\frac{1}{\eta}$
Shafee $-\sum_i p_i^{\beta} lnp_i$ $\beta$ 1
Hanel-Thurner $\frac{r}{c} A^{-d} e^A \left( \sum_i \Gamma(1+d, A - c ln p_i) - rp_i \right)$ c d

From this we can see that equivalence classes can contain functionals that appear very different from each-other. For example with the right choice of parameter Landsber-Vedral displays the same scaling as the Sharma-Mittal.

For any trace form of entropy we can denote the generalized logarithm of that entropy as: $$\Lambda(p_i) = - \frac{d}{dp_i} g(p_i)$$

We can apply scaling to $g$ to ensure $\Lambda(1) = 0$ and $\Lambda ' (1) = 1$ (this does not change the equivalence class of the entropy functional). This determines the probability distribution function of the system via: $$p(x) = \Lambda^{-1}(-x)$$

Given: $$S_{c,d}(p) = \frac{r}{c} A^{-d} e^A \left( \sum_i \Gamma(1+d, A - c ln p_i) - rp_i \right)$$ Then: $$\Lambda_{c,d,r}(x) = r \left( 1 - x^{c-1} \left(1 - \frac{1-(1-c)r}{dr} ln(x) \right)^d \right)$$ Which we can invert to give: $$p_{c,d,r}(x) \sim exp \left( -\frac{d}{1-c} W_k \left( \frac{(1-c)r}{1-(1-c)r} exp \left(\frac{(1-c)r}{1-(1-c)r}\right) \left(1 + \frac{x}{r}\right)^{\frac{1}{d}} \right) \right)$$ Where $W_k$ is the Lambert-W function, which only has real solutions for $k=0$ ($d\geq 0$) or $k=-1$ ($d < 0$).

By taking $c=d=r=1$ we regain the Boltzmann distribution: $$p_{1,1,1}(x) = e^{-x}$$

Similarly for $c=1$ and we have a stretched exponential distribution: $$p_{1,d,r}(x) = e^{-dr\left(\left(1+ \frac{x}{r}\right)^{\frac{1}{d}}-1\right)}$$

For $d=0$ and $r=\frac{1}{1-c}$ we hae a power-law: $$p_{c,0,r} = (1+(1-c)x)^{\frac{-1}{1-c}}$$

In general we find that the distribution functions will look very similar to the power-law case. This demonstrates the uniquity of power-law relations in non-ergodic systems.

## Conclusion

In this blog post we have seen that the 3 "classic" uses of entropy (Boltzmann, Shannon and Jaynes) are in fact distinct despite having the same functional form. This leads to much confusion. Moreover we have seen that the degeneracy in functional form arises out of some pretty strong assumptions on the underlying process namely: equilibrium, Markov-Ergodicity and multinomial distributions. For some systems these assumptions are either true or reasonable approximations of reality, however in some cases these aren't good assumptions and using a Boltzmann entropy functional is asking for trouble.

We also looked at the Shannon-Khinchin axioms for defining a "good" entropy function. We saw that by dropping $SK4$ we were no longer bound to Ergodicity. As such we considered the scaling behaviour of an entropy functional and found 2 key forms of scaling $c$ and $d$, which we could use to create equivlance classes of entropy functionals and non-Ergodic systems.

By further assuming a trace form ($S(p) = \sum_i g(p_i)$) of entropy functional we found a general form for the entropy functional. By taking the generalized logarithm of this function we find that the distribution function for the system must follow a Lambert-W exponential form. For all intents and purposes this leads to power-law behaviour which acts as justification for the ubiquity of power-laws in non-Ergodic systems.

In this blog we talk about "modelling" a lot. The result that non-Ergodic systems follow a distribution that is (at least somewhat like) a power-law suggests that our default modelling assumption for statistical systems should be a power-law type distribution unless we can specifically prove otherwise.

## References

Most of the work in this area (that I know about) is from Stefan Thurner and Rudolf Hanel. Some suggested reading is:

1. Introduction to the Theory of Complex Systems - Thurner, Hamel, Klimek (This blog post follows the first half of chapter 6 of this book)
2. https://arxiv.org/abs/1310.5959 - Generalized (c,d)-entropy and aging random walks - Hanel, Thurner
3. https://arxiv.org/abs/1211.2257 - Generalized entropies and logarithms and their duality relations - Hanel, Thurner, Gell-Mann
4. https://arxiv.org/abs/1005.0138 - A comprehensive classification of complex statistical systems - Hanel, Thurner
5. https://arxiv.org/pdf/1705.07714.pdf - The three faces of entropy for complex systems - Thurner, Corominas-Murtra, Hanel