1. Randomize the weights w to small random values.
2. Select an instance t, a pair of input and output patterns, from the training set.
3. Apply the network input vector to network.
4. Calculate the network output vector z.
5. Calculate the errors for each of the outputs k, the difference (5) between the desired output and the network output.
6. Calculate the necessary updates for weights Aw in a way that minimizes this error.
7. Add up the calculated weights' updates Aw to the accumulated total updates AW.
8. Repeat steps 2-7 for several instances comprising an epoch.
9. Adjust the weights w of the network by the updates AW.
10. Repeat steps 2-9 until all instances in the training set are processed. This constitutes one iteration.
11. Repeat the iteration of steps 2-10 until the error for the entire system (error 5 defined above or the error on cross-validation set) is acceptably low, or the predefined number of iterations is reached.
Sigmoid function m
Figure 3 Three types of transfer functions commonly used in ANN models.
successive layer, every neuron sums its inputs and then applies a transfer function to compute its output. The output layer of the network then produces the final response, that is, the estimate of target value.
The backward-propagating step begins with the comparison of the network's output pattern to the target value, when difference (or error 6) is calculated. This parameter is used during the weight-correction procedure.
If output layer is designed by k, then its error value is
where tk is the target value of unit k, xk is the output value for unit k, f' is the derivative of the sigmoid function, ak is the weighted sum of input to k, and the quantity (tk - xk) reflects the amount of error. The f9 part of the term is to force a stronger correction when the sum ak is near the rapid rise in the sigmoid curve.
For the hidden layer j), the error value is computed as
The adjustment of the connection weights is done using the 6 values of the processing unit. Each weight is adjusted by taking into account the 6 value of the unit that receives input from that interconnection. The connection weight adjustment is done as follows:
The adjustment of weight wkj, which goes to unit k from unit j, depends on three factors: 6k (error value ofthe target unit), Xj (output value for the originating unit), and r\. This weight-adjustment equation is known as the generalized 6 rule. r is a learning rate, commonly between 0 and 1, chosen by the user, and reflects the rate of learning of the network. A very large value of r can lead to instability in the network and unsatisfactory learning. Values too small of r can lead to excessively slow learning. Sometimes, the learning rate is varied to produce efficient learning of the network during the training procedure. For example, to obtain a better learning performance, the value of r is high at the beginning, and decreases during the learning session.
The backpropagation algorithm performs gradient descent on this error surface by modifying each weight in proportion to the gradient of the surface at its location (Figure 4). It is known that gradient descent can sometimes cause networks to get stuck in a depression in the error surface should such a depression exist. These are called 'local minima', which correspond to a partial solution for the network in response to the training data. Ideally, we seek a global minimum (lowest error value possible); nevertheless, the local minima are surrounded and the network usually does not leave it by the standard algorithm. Special techniques should be used to get out of a local minimum: changing the learning parameter, the number ofhidden units, but notably by the use ofmomen-tum term (a) in the algorithm. The momentum term is chosen generally between 0 and 1. Taking into account this last term, the formula of modifications of weights at epoch t + 1 are given by
The learning rate (r) and the momentum term (a) play an important role in the learning process of BPN. If the values of these parameters are wrong, the network can oscillate, or more seriously it can get stuck in a local minimum. In most of our study, we obtain a good convergence of the networks by making initially a — 0.7 and r — 0.01; then, they are modified according to the importance of the error by the following algorithm:
if present_error > previous_error* 1.04 then r — r* 0.75.
Figure 4 Error surface as function of a weight showing gradient and local and global minima.
Testing the network
Typically an application of BPN requires both training and test sets. The first one is used to train the network, and the second one is served to assess the performance of the network after training is complete. In the testing phase, the input patterns are fed to the network and the desired output patterns are compared with those given by the neural network. The agreement or the disagreement of these two sets gives an indication of the performance of the neural network model. The trained network should be validated with the third independent data matrix completely independent.
If enough examples are available, the data may be divided randomly in two parts into the training and test sets. The proportion may be 1:1, 2:1, 3:1, etc., for these two sets. However, the training set still has to be large enough to be representative of the problem and the test set has to be large enough to allow correct validation of the network. This procedure of partitioning the data is called ¿-fold cross-validation, sometimes called the holdout procedure.
If there are not enough examples available to permit splitting of the data set into representative training and test sets, other strategies may be used, like cross-validation. In this case, the data set is divided into n parts, usually smaller, that is, containing fewer examples of data. The MLP may now be trained with n - 1 parts, and tested with the last part. The same network structure may be repeated to use every set of parts once for a test set in one of the n procedures. The results of these tests together allow determining the performance of the model. Sometimes, in the extreme case, the test set can have only one example, and this is called the leave-one-out or Jacknife procedure. The case is often used in ecology when either we have a small database available or each observation is unique information different from the others.
If a network is overfitted (or overtrained), it has a good memory in the detail of data. In such case, the network will not learn the general features inherently present in the training, but it will learn perfectly more and more of the specific details of the training data set. Thus the network loses its capacity to generalize. Several rules were developed by many researchers regarding approximate determination of the required network parameters to avoid overfitting. Two parameters were the response to this phenomenon: number of epochs and number of hidden layers and number of neurons for each of them. The determination of the appropriate number of these parameters is the most crucial matter in MLP modeling. Previously, the optimum size of epochs, hidden layers, or hidden nodes was determined by trial or error using training and test sets of data. A typical graph of training
and generalization errors versus number of parameters is shown in Figure 5. We can see that the errors decrease rapidly as a function of parameter complexities. If the error in the training set decreases constantly, the error of the test set can increase after minima values, that is, the model is no longer able to generalize. The training procedure must be stopped when the error on the test set is lowest, that is, the zone corresponding to the best compromise between the bias and variance.
Was this article helpful?