Figure 3 shows a general appearance of a neuron with its connections. Each connection from ith to the jth neuron has associated with a quantity called weight or connection strength (wji). A net input (called activation) for each neuron is the sum of all its input values multiplied by their corresponding connection weights, expressed as aj — y ^ xi wji + Oj i where i is the total of neurons in the previous layer, Oj is a bias term which influences the horizontal offset of the
Box 1 Backpropagation learning algorithm in the MLP
1. Randomize the weights w to small random values.
2. Select an instance f, a pair of input and output patterns, from the training set.
3. Apply the network input vector to network.
4. Calculate the network output vector z.
5. Calculate the errors for each of the outputs k, the difference (5) between the desired output and the network output.
6. Calculate the necessary updates for weights Aw in a way that minimizes this error.
7. Add up the calculated weights' updates Aw to the accumulated total updates AW.
8. Repeat steps 2-7 for several instances comprising an epoch.
9. Adjust the weights w of the network by the updates AW.
10. Repeat steps 2-9 until all instances in the training set are processed. This constitutes one iteration.
11. Repeat the iteration of steps 2-10 until the error for the entire system (error 5 defined above or the error on cross-validation set) is acceptably low, or the predefined number of iterations is reached.
function (fixed value of 1). Once the activation of neuron is calculated, we can determine the output value (i.e., the response) by applying a transfer function:
Many transfer functions may be used, for example, a linear function, a threshold function, a sigmoid function, etc. A sigmoid function is often used, because it has nonlinearity, which is given by
The weights play an important role in the propagation of the signal in the network. They establish a link between input pattern and its associated output pattern, that is, they contain the knowledge of the neural network about the problem-solution relation.
The forward-propagation step begins with the presentation of an input pattern to the input layer, and continues as activation-level calculations propagate forward till the output layer through the hidden layer(s). In each successive layer, every neuron sums its inputs and then applies a transfer function to compute its output. The output layer of the network then produces the final response, that is, the estimated of target value.
The backward-propagating step begins with the comparison of the network's output pattern to the target value, when difference (or error 6) is calculated. This parameter is used during the weight-correction procedure.
If output layer is designed by k, then its error value is where tk is the target value of unit k, xk is the output value for unit k, f is the derivative of the sigmoid function, ak is the weighted sum of input to k, and the quantity (tk — xk) reflects the amount of error. The f 9 part of the term is to force a stronger correction when the sum ak is near the rapid rise in the sigmoid curve.
For the hidden layer (j), the error value is computed as f'(a
The adjustment of the connection weights is done using the 6 values of the processing unit. Each weight is adjusted by taking into account the 6 value of unit that receives input from that interconnection. The connection weight adjustment is done as follows:
The adjustment of weight wkj, which goes to unit k from unit j, depends on three factors: 6k (error value of the target unit), xy (output value for the originating unit), and q. This weight-adjustment equation is known as the generalized 6 rule. q is a learning rate, commonly between 0 and 1, chosen by the user, and reflects the rate of learning of the network. A very large value of q can lead to instability in the network and unsatisfactory learning. Too small values of q can lead to excessively slow learning.
The backpropagation algorithm performs gradient descent on this error surface by modifying each weight in proportion to the gradient of the surface at its location. It is known that gradient descent can sometimes cause networks to get stuck in a depression in the error surface should such a depression exist. These are called 'local minima', which corresponds to a partial solution for the network in response to the training data. Ideally, we seek a global minimum (lowest error value possible); nevertheless, the local minima are surrounded and the network usually does not leave it by the standard backpropagation algorithm. Special techniques should be used to get out of a local minimum: changing the learning parameter, the number of hidden units, but notably by the use of momentum term (a) in the algorithm. The momentum term is chosen generally between 0 and 1. Taking into account this last term, the formula for modifications of weights at epoch t + 1 is given by
The learning rate (q) and the momentum term (a) play an important role in the backpropagation algorithm. If the values of these parameters are wrong, the network can oscillate, or more seriously it can get stuck in a local minimum.
Testing the network
Typically an application of backpropagation requires both training and test sets. The first one is used to train the network, and the second one is served to assess the performance of the network after the training is complete. In the testing phase, the input patterns are fed to the network and the desired output patterns are compared with those given by the neural network. The agreement or the disagreement of these two sets gives an indication of the performance of the neural network model. The trained network should be validated with the third independent data matrix completely independently.
If enough examples are available, the data may be divided randomly in two parts into the training and test sets. The proportion may be 1:1, 2:1, 3:1, etc., for these two sets. However, the training set still has to be large enough to be representative of the problem and the test set has to be large enough to allow correct validation of the network. This procedure of partitioning the data is called ¿-fold cross-validation, sometimes named the holdout procedure. If there are not enough examples available to permit splitting of the data set into representative training and test sets, other strategies may be used, such as cross-validation.
If a network is overfitted (or overtrained), it has a good memory in the detail of data. In such a case, the network will not learn the general features inherently present in the training, but it will learn perfectly more and more of the specific details of the training data set. Thus the network loses its capacity to generalize. Several rules were developed by many researchers with regard to approximate determination of the required network parameters to avoid overtraining. Two parameters are response to this phenomenon: number of epochs and number of hidden layers and number of neurons for each of them. The determination of the appropriate number of these parameters is the most crucial matter in MLP modeling. Previously, the optimum size of epochs, hidden layers, or hidden nodes was determined by trial and error using training and test sets of data. A typical graph of training and generalization errors versus number of parameters is shown in Figure 4. We can see the errors decrease rapidly as function of parameter complexities. If the error in the training set decreases constantly, the error of the test set can increase after minima values, that is, the model is no longer able to generalize. The training procedure must be stopped when the error on the test set is lowest, that is, the zone corresponding to the best compromise between the bias and variance.
Was this article helpful?