Back propagation algorithm principle and simple implementation

Posted May 29, 20208 min read

Content:bp neural network

bp network principle and examples

The main idea:

  • Solve the problem that the adaline network cannot be divided linearly and XOR

Nature:

  • Converting the original problem of linear inseparability in low dimensions to high dimensions through a multi-layer network and multiple nodes, it is very likely that this problem becomes a linearly separable problem

Main difficulties:

  • In the adaline network, if you adjust the node connection weight according to the gradient descent method, it is relatively simple. You only need to compare the calculation expectations with the sample expectations to calculate the error. It is also very simple to calculate the gradient from the error Things, but in the bp network, because it involves multiple hidden layers, the error of the middle layer is not easy to calculate, so it is not possible to directly calculate the local gradient

Solution:

  • If the weight is adjusted according to the normal gradient descent, the formula should be

    $\ Delta \ omega = \ eta \ * \ delta \ _j(n) \ * y \ _j(n) \\\ Delta \ omega:adjusted weights \\\ eta:learning rate \\ y \ _j(n):output signal of neuron j \\\ delta \ _j(n):local gradient $

  • The key to the problem lies in solving the local gradient

  • At the output layer

    • $$\ delta \ _j(n) = e \ _j(n) \ * \ varphi ^ \ `(v \ _j(n)) \\ e:error \\\ varphi:activation Function \\ v \ _j(n):neuron output(non-output signal) $$
  • In the hidden layer, it can be calculated according to the differential chain derivative rule

    • $$\ delta \ _j(n) = \ varphi ^ \ `(v \ _j(n)) \ sum \ _k \ delta \ _k(n) w \ _ {kj}(n) $$
    • It should be noted here that k = j + 1, which is the node of the next layer of j.
  • After solving the problem of parameter adjustment, the bp network can be divided into two stages:

    • Forward stage, calculate sample expectation
    • In the backward stage, adjust the weight of each node according to the error of the output layer.

Examples

  • Now we are facing such a long network

Network Signal.png

    • Briefly analyze the network, three input nodes, one of which is biased, two layers of three-node hidden layers, one of which is biased and one output node.
  • The parameters explain the weights and inputs of each layer:

    $$\ begin {align \ *} X = \ left \ [\ begin {matrix} x \ _ {0} \\ x \ _ {1} \\ x \ _ {2} \ \\ \ end {matrix} \ right ]\ W ^ 0 = \ left \ [\ begin {matrix} w ^ 0 \ _ {10} & w ^ 0 \ _ {20} \ \ w ^ 0 \ _ {11} & w ^ 0 \ _ {21} \\ w ^ 0 \ _ {21} & w ^ 0 \ _ {22} \ end {matrix} \ right ]\ \\ W ^ 1 = \ left \ [\ begin {matrix} w ^ 1 \ _ {10} & w ^ 1 \ _ {20} \\ w ^ 1 \ _ {11 } & w ^ 1 \ _ {21} \\ w ^ 1 \ _ {21} & w ^ 1 \ _ {22} \ end {matrix} \ right ]\ W ^ 2 = \ left \ [\ begin {matrix} w ^ 2 \ _ {10} \\ w ^ 2 \ _ {11} \\ w ^ 2 \ _ {21} \ end {matrix} \ right ]\ \ end {align \ *} $$

  • Forward process

    $$X ^ T \ * W ^ 0 = V ^ 0 | \ _ {2 \ * 1} \ quad V ^ 0 = \ left \ [Y ^ 0 \ _0, V ^ 0 \ _1, V ^ 0 \ _2 \ right ]^ T \ quad Y ^ 0 | \ _ {3 \ * 1} = \ varphi(V ^ 0) \ quad Y ^ 0 \ * W ^ 1 = V ^ 1 | \ _ {2 \ * 1} \\ $$

    $$Y ^ 1 = \ left \ [Y ^ 1 \ _0, V ^ 1 \ _1, V ^ 1 \ _2 \ right ]^ T \ quad Y ^ 1 | \ _ {3 \ * 1} = \ varphi(V ^ 1) \ quad Y ^ 1 \ * W ^ 2 = V ^ 2 | \ _ {1 \ * 1} \ quad Y ^ 2 | \ _ {1 \ * 1} = \ \ varphi(V ^ 2) $$

    • For V ^ 0, add an offset Y ^ 0 \ _0 to form the input of the purple layer
    • The calculated sample expectation is $Y ^ 2 $, which is the total output of the network
  • Backward process

    • Output layer(weight in front of orange layer)

$$\ delta ^ 2 | \ _ {1 \ * 1} = e \ * \ varphi ^ \ `(v ^ 2) \ quad \ Delta \ omega ^ 2 | \ _ {3 \ * 1 } = \ eta \ * \ delta ^ 2 \ * y ^ 1(n) \ quad w ^ 2 = w ^ 2 + \ Delta w ^ 2 $$

  • The second hidden layer(purple layer)

    $$\ delta ^ 1 | \ _ {2 \ * 1} = \ frac {\ varphi ^ \ (v ^ 1)} {\ _ {1 \ * 1}} \\ sum \ _k \\ delta ^ 2 \ _k(n) w ^ 2(n) \\\\ = \\ frac {\\ varphi ^ \ (v ^ 1)} {\ _ {1 \ * 1}} \ * \ left \ [\ begin {matrix} \ delta ^ 2 \ * w ^ 2 \ _ {01} \\ \ delta ^ 2 \ * w ^ 2 \ _ {02} \ end {matrix} \ right ]\\ $$

    $$\\\ Delta \ omega ^ 1 | \ _ {3 \ * 1} = \ eta \ * \ frac {y ^ 0(n)} {\ _ {3 \ * 3}} \ * \ frac {\ delta ^ 1} {\ _ {1 \ * 2}} ^ T \ quad w ^ 1 = w ^ 1 + \ frac {y ^ 0(n) ^ T} {\ _ {3 \ * 1}} $$

    • Because the offset should not be included in the formula to adjust the weight change, the weight of the offset is omitted
  • Input layer(green layer)

    $$\ delta ^ 0 | \ _ {2 \ * 1} = \ frac {\ varphi ^ \ (v ^ 0)} {\ _ {2 \ * 1}} \\ sum \ _k \\ delta ^ 1 \ _k(n) w ^ 1(n) \\\\ = \\ left \ [\\ begin {matrix} \\ varphi ^ \ (v ^ 0 \ _1)(\ delta ^ 1 \ _1 \ * w ^ 1 \ _ {11} + \ delta ^ 1 \ _2 \ * w ^ 1 \ _ {12}) \\ \ varphi ^ \ `(v ^ 0)(\ delta ^ 1 \ _1 \ * w ^ 1 \ _ {21} + \ delta ^ 1 \ _2 \ * w ^ 1 \ _ {22} \\ \ end {matrix} \ right ]| \ _ { 2 \ * 1} \\ $$

    $$\\\ Delta \ omega ^ 0 | \ _ {3 \ * 1} = \ eta \ * \ frac {X} {\ _ {3 \ * 1}} \ * \ frac {\ delta ^ 0} {\ _ {1 \ * 2}} ^ T \ quad ~ w ^ 1 = w ^ 1 + \ frac {X} {\ _ {3 \ * 1}} $$

  • For a group of inputs, this has been adjusted once

bp network programming implementation

Network structure and data

  • The data type is bimonthly data, there are three input layers, one of which is added with a bias of 1.
  • The first hidden layer, which is the layer after the input layer, has 20 nodes, and the second hidden layer has 10 nodes
  • One output layer

work process:

from graphviz import Digraph
dot = Digraph(comment = 'The Round Table')
dot.node('A', "Generate Data")
dot.node('B', "parameter determination/weight determination")
dot.node('C', 'Network construction/parameter adjustment')
dot.node('D', 'Decision plane drawing')

dot.edges(["AB", "BC", "CD"])
dot

![ Snipaste_2020-05-29_16-49-39.png 1]()

Experimental steps

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import random

# Data generation
"" "
Finally generate train_data variable to store data
"" "


def halfmoon(rad, width, d, n_samp):
    '' 'Generate half-month data
    @param rad:radius
    @param width:width
    @param d:distance
    @param n_samp:quantity
    '' '
    if n_samp%2! = 0:# make sure the number is even
        n_samp + = 1

    data = np.zeros((3, n_samp))
    # Generate a 0 matrix, generate a matrix of 3 rows of n_samp columns
    aa = np.random.random((2, int(n_samp/2)))
    radius =(rad-width/2) + width * aa [0,:]
    theta = np.pi * aa [1,:]

    x = radius * np.cos(theta)
    y = radius * np.sin(theta)
    label = np.ones((1, len(x))) # label for Class 1

    x1 = radius * np.cos(-theta) + rad # Move rad units to the right based on x
    y1 = radius * np.sin(-theta) + d # Move d units down based on the opposite of y
    label1 = 0 * np.ones((1, len(x1))) # label for Class 2

    data [0,:]= np.concatenate([x, x1])
    data [1,:]= np.concatenate([y, y1])
    data [2,:]= np.concatenate([label, label1], axis = 1)
    # Merge data
    return data


dataNum = 1000
data = halfmoon(10, 5, 5, dataNum)
pos_data = data [:, 0:int(dataNum/2)]
neg_data = data [:, int(dataNum/2):dataNum]
plt.figure()
plt.scatter(pos_data [0,:], pos_data [1,:], c = "b", s = 10)
plt.scatter(neg_data [0,:], neg_data [1,:], c = "r", s = 10)
plt.show()
train_data = []
test_data = []
tmp = []
i = 0
for i in range(1000):
    tmp.append(i)
random.shuffle(tmp)
for i in range(len(tmp)):
    train_data.append(data [:, tmp [i]])
train_data = np.array(train_data) .T

Determine the parameters that should be used

  • Including the number of nodes in each of the two hidden layers, the number of nodes in the input layer, and the learning rate.

  • Generate random weights for each layer

    Parameter determination

    inp_num = 3 # Number of input layer nodes
    out_num = 1 # Number of nodes in the output layer
    hid_num_1 = 20 # The first hidden layer has 20 nodes
    hid_num_2 = 10 # The second hidden layer has 10 nodes
    w1 = 0.2 * np.random.random((inp_num, hid_num_1)) # initialize the input layer weight matrix
    w2 = 0.2 * np.random.random((hid_num_1, hid_num_2)) # initialize the weight matrix of the first hidden layer
    w3 = 0.2 * np.random.random((hid_num_2, out_num)) # initialize the weight matrix of the second hidden layer
    inp_lrate = 0.3 # Input layer weight learning rate
    hid_lrate = 0.3 # hidden layer weights
    out_lrate = 0.3 # output layer learning rate
    a = 1 # sigmoid function parameters

Definition of activation function and error function

def get_act(a, x):# activation function
    act_vec = []
    for i in x:
        act_vec.append((1/(1 + math.exp(-1 * a * i))))
    act_vec = np.array(act_vec)
    return act_vec


def get_err(e):# error function
    return 0.5 * np.dot(e, e)

learning process

import math
'' '
Cycle through the generated 1000 nodes until the error function is less than the set threshold
'' '
err_store = []
have = True
while(have):
    e_store = []
    for count in range(0, 1000):
        t_label = np.zeros(out_num) # output result storage
        # Forward process
        tmp = []
        tmp.append(train_data [0][count])
        tmp.append(train_data [1][count])
        tmp.append(1)
    # Three inputs
        hid_value_1 = np.dot(np.array(tmp), w1) # hid_value_1 = v1
        hid_act_1 = get_act(a, hid_value_1) # hid_act_1 = y1
        hid_value_2 = np.dot(hid_act_1, w2) # hid_value_2 = v2
        hid_act_2 = get_act(a, hid_value_2) # hid_act_2 = y2
        hid_value_3 = np.dot(hid_act_2, w3) # output value of the output layer
        tmp_1 = []
        tmp_1.append(hid_value_3)
        out_act = get_act(a, tmp_1) # output layer activation function value/final calculation result

        # Backward process
        e = train_data [2][count]-out_act # error calculation
        e_store.append(get_err(e)) # store error function
        out_delta = a * e * out_act *(1-out_act) # output layer error gradient
        hid_delta_2 = a * hid_act_2 * \
           (1-hid_act_2) * np.dot(w3, out_delta) # Error gradient of the second hidden layer
        hid_delta_1 = a * hid_act_1 * \
           (1-hid_act_1) * np.dot(w2, hid_delta_2) # error gradient of the first hidden layer
        for i in range(0, out_num):
            w3 [:, i]+ = out_lrate * out_delta * hid_act_2
        for i in range(0, hid_num_2):
            w2 [:, i]+ = hid_lrate * hid_delta_2 [i]* hid_act_1
        for i in range(0, hid_num_1):
            w1 [:, i]+ = inp_lrate * hid_delta_1 [i]* np.array(tmp)
    err_sum = 0
    for item in e_store:
        err_sum + = item
    err_store.append(err_sum)
    if(err_sum <0.1):# 0.1 is the set error threshold
        have = False

Error surface drawing

x = []
for item in range(len(err_store)):
    x.append(item)
plt.figure()
plt.scatter(x, err_store)

<matplotlib.collections.PathCollection at 0x1b04ea014c0>

output_14_1.png

  • It can be seen that during the first round, the error is actually very large. The decline rate is the fastest when about 10 rounds, but there will be a rebound around 80 and 120 rounds.

Decision surface drawing

x = []
y = []
for i in range(-15, 25):
    x.append(i/1)
for i in range(-8, 13):
    y.append(i/1)

green_x = []
green_y = []
red_x = []
red_y = []
for i in x:
    for j in y:
        tmp = [i, j, 1]
        hid_value_1 = np.dot(np.array(tmp), w1) # hid_value_1 = v1
        hid_act_1 = get_act(a, hid_value_1) # hid_act_1 = y1
        hid_value_2 = np.dot(hid_act_1, w2) # hid_value_2 = v2
        hid_act_2 = get_act(a, hid_value_2) # hid_act_2 = y2
        hid_value_3 = np.dot(hid_act_2, w3) # output value of the output layer
        tmp_1 = []
        tmp_1.append(hid_value_3)
        out_act = get_act(a, tmp_1) # output layer activation function value/final calculation result
        if(out_act> 0.5):
            green_x.append(i)
            green_y.append(j)
        else:
            red_x.append(i)
            red_y.append(j)


plt.figure()
plt.scatter(pos_data [0,:], pos_data [1,:], c = "b", s = 10)
plt.scatter(neg_data [0,:], neg_data [1,:], c = "b", s = 10)
plt.scatter(green_x, green_y, c = "g", s = 10)
plt.scatter(red_x, red_y, c = "r", s = 10)
plt.show()

output_17_0.png

  • As can be seen from the decision surface, the effect of classification is still relatively obvious

Experimental summary

Problems encountered and solutions

  • When calculating the backward process, because the algorithm is not clearly understood, the surface phenomenon of the calculation result is that the scale of the matrix is not right, and the deep level is that the formula is not yet understood

    • In fact, I feel that when doing this type of network, the foundation is still mathematics, and the mathematical formulas should be strictly deduced before programming, so that you will not make mistakes in programming
  • The understanding of the programming language itself should still be strengthened, read more documents and try more

  • When there is no stop condition for online learning, I try to use a fixed round of learning and find that there will be some more interesting situations.

  • 1.png

    • I think the reason why this happens is because the proper stop rules are not set, resulting in incomplete learning.
    • After encountering this situation, I began to think about adding the stop rule, which reduced the error

Experience

  • The BP neural network can realize non-linear classification, but it needs multiple iterations to learn itself. Now the data set point is very small, only 1000 data, you need to learn nearly 150 rounds, if you say a larger number of data sets It may require a long learning cycle.
  • The stopping rules of neural network learning are still very important.

References

Code:The core step is to refer to the following website ~
github