## Algorithm

A Boltzmann machine is structured in a very similar way to a Hopfield network. It is comprised of a number of 'units' or 'neurons', which are essentially the nodes or vertices of a graph model. Each unit is binary; therefore, it can only take on one of two values. Typically, these are represented as 0 and 1, though sometimes it is more convenient to represent them as —1 and +1. A Hopfield network contains exactly one unit for each component of the data vectors being stored by the network. A Boltzmann machine will likewise contain one 'visible' unit for each component of the data vectors on which it is operating. However, this model also provides for the existence of 'hidden' units, which do not correspond to any part of the data being represented. Figure 1 illustrates a small Boltzmann machine with five visible and two hidden units.

Every unit has a connection to every other unit, and the strength of each connection is known as its weight. These connections are symmetric, so that wj — ; Vi, j where wij is the connection from unit i to unit j. The effect of these weights is to influence the state of the unit by an amount equal to the weight times the state of the unit. That is, unit i would exert an influence of the

Figure 1 Simple Boltzmann machine.

amount wij if unit i is in the 'on' state, 1, and no influence at all if it is in the 'off' state 0 (though unit j may still receive influences from other units).

No unit contains a connection to itself wu — 0, Vi

However, each unit does possess a distinct bias, or threshold Oj, which is a constant influence on the unit, above and beyond the influences from all other units. This is equivalent to proposing that every unit has a connection, with a weight Oj, to some other unit which is always in the 'on' state. If a network contains Nunits, and we consider the weights to be fixed, then the network can be in any one of 2N distinct states. The Boltzmann machine uses the same energy function as the Hopfield network to assign a level of 'energy' to each of these possible states

where si and sj are the activities of ith and jth units, respectively.

It is easy to see from this that the energy of the network will be lowest when units with positive weights are on together, and units with negative weights are not on together. The difference in the energy of the network by changing the state of just the kth unit is

A Hopfield network uses a deterministic rule, the binary threshold rule, to update the states of the units so as to monotonically decrease the energy of the network. This rule, however, ensures convergence only with some local minima, not the global minima of the energy function. In order to help escape from such local minima and achieve a global minimum energy, the Boltzmann machine uses the following probabilistic rule:

where pk is the probability that the kth unit is in the on state, and T is a global 'temperature' parameter.

The plot below (Figure 2) shows this probability distribution of pk over values of AEk and T. Since this probability distribution controls the change of state of the units, it will also determine whether the overall system energy, Ek, rises or falls. As one can see, as T gets larger, the probability distribution becomes nearly 0.5 for all energies. This means that there is an equal chance, at high temperatures, for the network to move 'uphill' toward higher energies as there is for it to move 'downhill' to lower energies. As the temperature lowers, however, the distribution becomes increasingly sigmoidal, with the probability being almost 1 for large positive values of AEk, and almost 0 for large negative values, while staying at 0.5 for AEk of 0. This means that at low temperatures, the network will almost certainly move only in the 'downhill' direction, settling towards lower energies. Once the network reaches a state in which it is very unlikely for the state of any unit to switch, it is said to have reached 'thermal equilibrium'.

Allowing the network to occasionally make uphill jumps in energy allows for the possibility that the network can overcome the energy barriers which separate local minima from deeper global minima. These uphill jumps in energy can be made more probable by adjusting the temperature parameter. If one follows a regimen of starting with a high temperature and gradually lowering it while the units are updating, it will eventually bring the network through the optimal distribution for overcoming each energy barrier, toward a globally minimal energy. This technique is known as 'simulated annealing', as it is similar to the annealing technique for obtaining the most stable configuration in glass or metal by heating the material evenly and then allowing it to cool slowly. The resulting material is very strong, because the internal configuration of molecules has reached some point nearing a globally optimal low energy, and so internal stresses are minimized (see Simulated Annealing). This method for using simulated annealing is very effective in Boltzmann machines, because the dimensionality of the space of the energy landscape is so large. In such a high-dimensional space, there are many more paths available to overcome an energy barrier, and so the probability that one of them will be followed increases dramatically.

The connection between the Boltzmann machine algorithm and the physical theory of thermodynamics is quite intentional. It can be shown that the probability of the network being in one of two global states at thermal equilibrium follows a Boltzmann distribution:

Pa Pb

-eb)/t where PA (PB) is the probability of being in global state A (B) and EA (EB) is the energy of that state. The ideal state of the network has the same distribution over the visible units as the distribution over observed data vectors sampled from the environment. In order to find such a configuration of weights, the learning rule must compare how the network behaves, in response to data vectors, to how it responds to random vectors. By reducing the distance between these two distributions, the network's natural distribution moves toward vectors which are highly probable, given the data.

Figure 2 Joint probability distribution of energy differences and temperature: probability (z-axis), energy differences (y-axis) and temperature (x-axis).

In order to sample these two distributions, the network must be run many times in two phases, the positive phase (p+), and the negative (p_) phase. In the positive phase, the network will be allowed to settle to thermal equilibrium a number of times with the visible units 'clamped', or fixed, to some data vector taken from the training data set. In the negative phase, the network settles to thermal equilibrium with the visible units allowed to change freely. An information theoretic measure of the distance, measured in bits, between these two sampled distributions can be used to derive a learning rule:

This learning rule is essentially a descent along a gradient representing the distance between the probability distributions of the positive and negative phases.

This learning algorithm is similar to the estimation-maximization (EM) rule used in many statistical models, wherein the system alternates between an estimation phase where it measures how well each data point fits to a number of competing sets of model parameters, and a maximization phase where the model parameters are updated to maximize their probabilities for their assigned data vectors. The result is a set of models, each of which specializes in describing only some subset of the data, but which, together, give a good description of the probability distribution over the entire set of observations. The Boltzmann machine acts in a similar manner where each hidden unit is like an independent 'model' of the distribution over the visible units, and as the learning algorithm goes through the positive and negative phases, these units are adjusted in a way similar to a model going through the estimation and maximization phases in EM.

### Restricted Boltzmann Machine

One serious drawback to the Boltzmann machine architecture is that it often can take quite a long time for the network to reach thermal equilibrium, and this must be done numerous times for the training procedure to be effective. Ultimately, a great deal of computational power must be expended to properly train such a network, as compared to other ANN models. This is largely due to the nonlinear way in which influences propagate through the network, because every unit is connected to every other unit. By restricting some of the connectivity of the network, it is possible to vastly increase the time to reach thermal equilibrium without sacrificing the generality of the model. A restricted Boltzmann machine, also known as a harmonium, does not allow connections between the hidden units, or between the visible units. The only connections, therefore, are between hidden and visible units. The updating of the unit states can be performed much

Figure 3 Restricted Boltzmann machine.

more quickly, as all of the hidden units can be updated simultaneously when the visible units are fixed, and vice versa. It can also reach thermal equilibrium very quickly -in just one step for the p+. It has been shown that a restricted Boltzmann machine can model any probability distribution over the visible units to an arbitrary degree of precision. Figure 3 illustrates the design of a restricted Boltzmann machine.

### Applications of the Boltzmann Machine

As with the back-propagation and Hopfield networks, the Boltzmann machine can be used to classify data in a manner similar to nonlinear regression. It has found limited applications in genetics and bioinformatics in the prediction of splicing sites from nucleotide sequences, protein-DNA interactions, sequencing protein structure, and classifying patients or specimens for applications in medicine. A Boltzmann machine might find application in any context where a nonlinear model describes the relationship between two data sets, especially when these data can be easily transformed into a binary vector. Since neural networks are in general robust to noisy data, it may also be a useful technique when the data quality is too poor for other forms of analysis to yield reasonable results.

One new area of application is the use of the Boltzmann machine to simulate learning in an agent-based simulation where multiple agents are adapting to environmental change. Agent-based simulation models represent a system from the bottom-up. Instead of using a series of equations to model how the large components of the system change over time, an agent-based model represents each individual organism in the environment and can even be used to represent organisms, such as bacteria, in other organisms, such as insect vectors. Because the Boltzmann machine can create a model of the underlying structure of a dataset, in principle, it can learn the underlying structure of its environment. Creating these types of simulation models requires a deeper background of object-oriented programming

(see Computer Languages) and the reinforcement learning algorithms that reward the agent for choosing the right action. Although a complete discussion on this topic is beyond the scope of this article, this is a potentially fruitful research direction in ecology.