# Back propagation algorithm principle and simple implementation

Posted May 29, 20208 min read

# Content:bp neural network

## bp network principle and examples

### The main idea:

• Solve the problem that the adaline network cannot be divided linearly and XOR

### Nature:

• Converting the original problem of linear inseparability in low dimensions to high dimensions through a multi-layer network and multiple nodes, it is very likely that this problem becomes a linearly separable problem

### Main difficulties:

• In the adaline network, if you adjust the node connection weight according to the gradient descent method, it is relatively simple. You only need to compare the calculation expectations with the sample expectations to calculate the error. It is also very simple to calculate the gradient from the error Things, but in the bp network, because it involves multiple hidden layers, the error of the middle layer is not easy to calculate, so it is not possible to directly calculate the local gradient

### Solution:

• If the weight is adjusted according to the normal gradient descent, the formula should be

\$\ Delta \ omega = \ eta \ * \ delta \ _j(n) \ * y \ _j(n) \\\ Delta \ omega:adjusted weights \\\ eta:learning rate \\ y \ _j(n):output signal of neuron j \\\ delta \ _j(n):local gradient \$

• The key to the problem lies in solving the local gradient

• At the output layer

• \$\$\ delta \ _j(n) = e \ _j(n) \ * \ varphi ^ \ `(v \ _j(n)) \\ e:error \\\ varphi:activation Function \\ v \ _j(n):neuron output(non-output signal) \$\$
• In the hidden layer, it can be calculated according to the differential chain derivative rule

• \$\$\ delta \ _j(n) = \ varphi ^ \ `(v \ _j(n)) \ sum \ _k \ delta \ _k(n) w \ _ {kj}(n) \$\$
• It should be noted here that k = j + 1, which is the node of the next layer of j.
• After solving the problem of parameter adjustment, the bp network can be divided into two stages:

• Forward stage, calculate sample expectation
• In the backward stage, adjust the weight of each node according to the error of the output layer.

### Examples

• Now we are facing such a long network • Briefly analyze the network, three input nodes, one of which is biased, two layers of three-node hidden layers, one of which is biased and one output node.
• The parameters explain the weights and inputs of each layer:

\$\$\ begin {align \ *} X = \ left \ [\ begin {matrix} x \ _ {0} \\ x \ _ {1} \\ x \ _ {2} \ \\ \ end {matrix} \ right ]\ W ^ 0 = \ left \ [\ begin {matrix} w ^ 0 \ _ {10} & w ^ 0 \ _ {20} \ \ w ^ 0 \ _ {11} & w ^ 0 \ _ {21} \\ w ^ 0 \ _ {21} & w ^ 0 \ _ {22} \ end {matrix} \ right ]\ \\ W ^ 1 = \ left \ [\ begin {matrix} w ^ 1 \ _ {10} & w ^ 1 \ _ {20} \\ w ^ 1 \ _ {11 } & w ^ 1 \ _ {21} \\ w ^ 1 \ _ {21} & w ^ 1 \ _ {22} \ end {matrix} \ right ]\ W ^ 2 = \ left \ [\ begin {matrix} w ^ 2 \ _ {10} \\ w ^ 2 \ _ {11} \\ w ^ 2 \ _ {21} \ end {matrix} \ right ]\ \ end {align \ *} \$\$

• Forward process

\$\$X ^ T \ * W ^ 0 = V ^ 0 | \ _ {2 \ * 1} \ quad V ^ 0 = \ left \ [Y ^ 0 \ _0, V ^ 0 \ _1, V ^ 0 \ _2 \ right ]^ T \ quad Y ^ 0 | \ _ {3 \ * 1} = \ varphi(V ^ 0) \ quad Y ^ 0 \ * W ^ 1 = V ^ 1 | \ _ {2 \ * 1} \\ \$\$

\$\$Y ^ 1 = \ left \ [Y ^ 1 \ _0, V ^ 1 \ _1, V ^ 1 \ _2 \ right ]^ T \ quad Y ^ 1 | \ _ {3 \ * 1} = \ varphi(V ^ 1) \ quad Y ^ 1 \ * W ^ 2 = V ^ 2 | \ _ {1 \ * 1} \ quad Y ^ 2 | \ _ {1 \ * 1} = \ \ varphi(V ^ 2) \$\$

• For V ^ 0, add an offset Y ^ 0 \ _0 to form the input of the purple layer
• The calculated sample expectation is \$Y ^ 2 \$, which is the total output of the network
• Backward process

• Output layer(weight in front of orange layer)

\$\$\ delta ^ 2 | \ _ {1 \ * 1} = e \ * \ varphi ^ \ `(v ^ 2) \ quad \ Delta \ omega ^ 2 | \ _ {3 \ * 1 } = \ eta \ * \ delta ^ 2 \ * y ^ 1(n) \ quad w ^ 2 = w ^ 2 + \ Delta w ^ 2 \$\$

• The second hidden layer(purple layer)

\$\$\ delta ^ 1 | \ _ {2 \ * 1} = \ frac {\ varphi ^ \ `(v ^ 1)} {\ _ {1 \ * 1}} \\ sum \ _k \\ delta ^ 2 \ _k(n) w ^ 2(n) \\\\ = \\ frac {\\ varphi ^ \ `(v ^ 1)} {\ _ {1 \ * 1}} \ * \ left \ [\ begin {matrix} \ delta ^ 2 \ * w ^ 2 \ _ {01} \\ \ delta ^ 2 \ * w ^ 2 \ _ {02} \ end {matrix} \ right ]\\ \$\$

\$\$\\\ Delta \ omega ^ 1 | \ _ {3 \ * 1} = \ eta \ * \ frac {y ^ 0(n)} {\ _ {3 \ * 3}} \ * \ frac {\ delta ^ 1} {\ _ {1 \ * 2}} ^ T \ quad w ^ 1 = w ^ 1 + \ frac {y ^ 0(n) ^ T} {\ _ {3 \ * 1}} \$\$

• Because the offset should not be included in the formula to adjust the weight change, the weight of the offset is omitted
• Input layer(green layer)

\$\$\ delta ^ 0 | \ _ {2 \ * 1} = \ frac {\ varphi ^ \ `(v ^ 0)} {\ _ {2 \ * 1}} \\ sum \ _k \\ delta ^ 1 \ _k(n) w ^ 1(n) \\\\ = \\ left \ [\\ begin {matrix} \\ varphi ^ \ `(v ^ 0 \ _1)(\ delta ^ 1 \ _1 \ * w ^ 1 \ _ {11} + \ delta ^ 1 \ _2 \ * w ^ 1 \ _ {12}) \\ \ varphi ^ \ `(v ^ 0)(\ delta ^ 1 \ _1 \ * w ^ 1 \ _ {21} + \ delta ^ 1 \ _2 \ * w ^ 1 \ _ {22} \\ \ end {matrix} \ right ]| \ _ { 2 \ * 1} \\ \$\$

\$\$\\\ Delta \ omega ^ 0 | \ _ {3 \ * 1} = \ eta \ * \ frac {X} {\ _ {3 \ * 1}} \ * \ frac {\ delta ^ 0} {\ _ {1 \ * 2}} ^ T \ quad ~ w ^ 1 = w ^ 1 + \ frac {X} {\ _ {3 \ * 1}} \$\$

• For a group of inputs, this has been adjusted once

## bp network programming implementation

### Network structure and data

• The data type is bimonthly data, there are three input layers, one of which is added with a bias of 1.
• The first hidden layer, which is the layer after the input layer, has 20 nodes, and the second hidden layer has 10 nodes
• One output layer

### work process:

``````from graphviz import Digraph
dot = Digraph(comment = 'The Round Table')
dot.node('A', "Generate Data")
dot.node('B', "parameter determination/weight determination")
dot.node('D', 'Decision plane drawing')

dot.edges(["AB", "BC", "CD"])
dot``````

![ 1]()

### Experimental steps

``````import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import random

# Data generation
"" "
Finally generate train_data variable to store data
"" "

'' 'Generate half-month data
@param width:width
@param d:distance
@param n_samp:quantity
'' '
if n_samp%2! = 0:# make sure the number is even
n_samp + = 1

data = np.zeros((3, n_samp))
# Generate a 0 matrix, generate a matrix of 3 rows of n_samp columns
aa = np.random.random((2, int(n_samp/2)))
theta = np.pi * aa [1,:]

label = np.ones((1, len(x))) # label for Class 1

x1 = radius * np.cos(-theta) + rad # Move rad units to the right based on x
y1 = radius * np.sin(-theta) + d # Move d units down based on the opposite of y
label1 = 0 * np.ones((1, len(x1))) # label for Class 2

data [0,:]= np.concatenate([x, x1])
data [1,:]= np.concatenate([y, y1])
data [2,:]= np.concatenate([label, label1], axis = 1)
# Merge data
return data

dataNum = 1000
data = halfmoon(10, 5, 5, dataNum)
pos_data = data [:, 0:int(dataNum/2)]
neg_data = data [:, int(dataNum/2):dataNum]
plt.figure()
plt.scatter(pos_data [0,:], pos_data [1,:], c = "b", s = 10)
plt.scatter(neg_data [0,:], neg_data [1,:], c = "r", s = 10)
plt.show()
train_data = []
test_data = []
tmp = []
i = 0
for i in range(1000):
tmp.append(i)
random.shuffle(tmp)
for i in range(len(tmp)):
train_data.append(data [:, tmp [i]])
train_data = np.array(train_data) .T``````

#### Determine the parameters that should be used

• Including the number of nodes in each of the two hidden layers, the number of nodes in the input layer, and the learning rate.

• Generate random weights for each layer

# Parameter determination

inp_num = 3 # Number of input layer nodes
out_num = 1 # Number of nodes in the output layer
hid_num_1 = 20 # The first hidden layer has 20 nodes
hid_num_2 = 10 # The second hidden layer has 10 nodes
w1 = 0.2 * np.random.random((inp_num, hid_num_1)) # initialize the input layer weight matrix
w2 = 0.2 * np.random.random((hid_num_1, hid_num_2)) # initialize the weight matrix of the first hidden layer
w3 = 0.2 * np.random.random((hid_num_2, out_num)) # initialize the weight matrix of the second hidden layer
inp_lrate = 0.3 # Input layer weight learning rate
hid_lrate = 0.3 # hidden layer weights
out_lrate = 0.3 # output layer learning rate
a = 1 # sigmoid function parameters

#### Definition of activation function and error function

``````def get_act(a, x):# activation function
act_vec = []
for i in x:
act_vec.append((1/(1 + math.exp(-1 * a * i))))
act_vec = np.array(act_vec)
return act_vec

def get_err(e):# error function
return 0.5 * np.dot(e, e)``````

#### learning process

``````import math
'' '
Cycle through the generated 1000 nodes until the error function is less than the set threshold
'' '
err_store = []
have = True
while(have):
e_store = []
for count in range(0, 1000):
t_label = np.zeros(out_num) # output result storage
# Forward process
tmp = []
tmp.append(train_data [count])
tmp.append(train_data [count])
tmp.append(1)
# Three inputs
hid_value_1 = np.dot(np.array(tmp), w1) # hid_value_1 = v1
hid_act_1 = get_act(a, hid_value_1) # hid_act_1 = y1
hid_value_2 = np.dot(hid_act_1, w2) # hid_value_2 = v2
hid_act_2 = get_act(a, hid_value_2) # hid_act_2 = y2
hid_value_3 = np.dot(hid_act_2, w3) # output value of the output layer
tmp_1 = []
tmp_1.append(hid_value_3)
out_act = get_act(a, tmp_1) # output layer activation function value/final calculation result

# Backward process
e = train_data [count]-out_act # error calculation
e_store.append(get_err(e)) # store error function
out_delta = a * e * out_act *(1-out_act) # output layer error gradient
hid_delta_2 = a * hid_act_2 * \
(1-hid_act_2) * np.dot(w3, out_delta) # Error gradient of the second hidden layer
hid_delta_1 = a * hid_act_1 * \
(1-hid_act_1) * np.dot(w2, hid_delta_2) # error gradient of the first hidden layer
for i in range(0, out_num):
w3 [:, i]+ = out_lrate * out_delta * hid_act_2
for i in range(0, hid_num_2):
w2 [:, i]+ = hid_lrate * hid_delta_2 [i]* hid_act_1
for i in range(0, hid_num_1):
w1 [:, i]+ = inp_lrate * hid_delta_1 [i]* np.array(tmp)
err_sum = 0
for item in e_store:
err_sum + = item
err_store.append(err_sum)
if(err_sum <0.1):# 0.1 is the set error threshold
have = False``````

#### Error surface drawing

``````x = []
for item in range(len(err_store)):
x.append(item)
plt.figure()
plt.scatter(x, err_store)

<matplotlib.collections.PathCollection at 0x1b04ea014c0>`````` • It can be seen that during the first round, the error is actually very large. The decline rate is the fastest when about 10 rounds, but there will be a rebound around 80 and 120 rounds.

#### Decision surface drawing

``````x = []
y = []
for i in range(-15, 25):
x.append(i/1)
for i in range(-8, 13):
y.append(i/1)

green_x = []
green_y = []
red_x = []
red_y = []
for i in x:
for j in y:
tmp = [i, j, 1]
hid_value_1 = np.dot(np.array(tmp), w1) # hid_value_1 = v1
hid_act_1 = get_act(a, hid_value_1) # hid_act_1 = y1
hid_value_2 = np.dot(hid_act_1, w2) # hid_value_2 = v2
hid_act_2 = get_act(a, hid_value_2) # hid_act_2 = y2
hid_value_3 = np.dot(hid_act_2, w3) # output value of the output layer
tmp_1 = []
tmp_1.append(hid_value_3)
out_act = get_act(a, tmp_1) # output layer activation function value/final calculation result
if(out_act> 0.5):
green_x.append(i)
green_y.append(j)
else:
red_x.append(i)
red_y.append(j)

plt.figure()
plt.scatter(pos_data [0,:], pos_data [1,:], c = "b", s = 10)
plt.scatter(neg_data [0,:], neg_data [1,:], c = "b", s = 10)
plt.scatter(green_x, green_y, c = "g", s = 10)
plt.scatter(red_x, red_y, c = "r", s = 10)
plt.show()`````` • As can be seen from the decision surface, the effect of classification is still relatively obvious

## Experimental summary

### Problems encountered and solutions

• When calculating the backward process, because the algorithm is not clearly understood, the surface phenomenon of the calculation result is that the scale of the matrix is not right, and the deep level is that the formula is not yet understood

• In fact, I feel that when doing this type of network, the foundation is still mathematics, and the mathematical formulas should be strictly deduced before programming, so that you will not make mistakes in programming
• The understanding of the programming language itself should still be strengthened, read more documents and try more

• When there is no stop condition for online learning, I try to use a fixed round of learning and find that there will be some more interesting situations.

• • I think the reason why this happens is because the proper stop rules are not set, resulting in incomplete learning.
• After encountering this situation, I began to think about adding the stop rule, which reduced the error

### Experience

• The BP neural network can realize non-linear classification, but it needs multiple iterations to learn itself. Now the data set point is very small, only 1000 data, you need to learn nearly 150 rounds, if you say a larger number of data sets It may require a long learning cycle.
• The stopping rules of neural network learning are still very important.

Code:The core step is to refer to the following website ~
github