Brookside Beauty: Artificial Neural Network

1. One Neuron

2. Multi-layer Neural Network Feedforward Prediction

2.1. One class neural network

2.2. Multi-class neural network

Prediction code in Octave/MatLib:

function p = predict(Theta1, Theta2, X)

%PREDICT Predict the label of an input given a trained neural network

% p = PREDICT(Theta1, Theta2, X) outputs the predicted label of X given the

% trained weights of a neural network (Theta1, Theta2)

% Useful values

m = size(X, 1); % m is the number of examples

num_labels = size(Theta2, 1);

A1 = X';

A1 = [ones(1, m); A1]; % add a0=1

A2 = sigmoid(Theta1 * A1);

A2 = [ones(1, m); A2]; % add a0=1

A3 = sigmoid(Theta2 * A2);

[v, i] = max(A3, [], 1); % Obtain the max value and index for each column

p = i'; % set p to a vector containing labels between 1 to num_labels.

end

The matrix X contains the examples in rows. When you complete the code in predict.m, you will need to add the column of 1's to the matrix. The matrices Theta1 and Theta2 contain the parameters for each unit in rows. Speci cally, the 1st row of Theta1corresponds to the 1st hidden unit in the second layer.

function g = sigmoid(z)

%SIGMOID Compute sigmoid functoon

% J = SIGMOID(z) computes the sigmoid of z.

g = 1.0 ./ (1.0 + exp(-z));

end

3. Multi-layer Neural Network Training Algorithm

3.1. Traditional BP algorithm without regularization

This part is in Chinese, but it has a very good mathematical derivation of why the derivation of the error is $\dfrac{\partial E}{\partial w_{ij}} = \delta_{j} o_{i}$ .

概括

反向传播算法（BP算法）主要由两个阶段:激励传播、和权重更新。

第1阶段：激励传播

每次迭代中的传播环节包含两步：

（前向传播阶段）将训练输入送入网络以获得激励响应；
（反向传播阶段）将激励响应同训练输入对应的目标输出求差，从而获得隐层和输出层的响应误差。

第2阶段：权重更新

对于每个突触上的权重，按照以下步骤进行更新：

将输入激励和响应误差相乘，从而获得权重的梯度；
将这个梯度乘上一个比例并取反后加到权重上。

这个比例（百分比）将会影响到训练过程的速度和效果，因此称为“训练因子”。梯度的方向指明了误差扩大的方向，因此在更新权重的时候需要对其取反，从而减小权重引起的误差。

第1和第2阶段可以反复循环迭代，直到网络的对输入的响应达到满意的预定的目标范围为止。

算法

三层网络算法（只有一个隐藏层）：

  初始化网络权值（通常是小的随机值）
  do
     forEach 训练样本 ex
        prediction = neural-net-output(network, ex)  // 正向传递
        actual = teacher-output(ex)
        计算输出单元的误差 (prediction - actual)
        计算  $\Delta w_{h}$  $\Delta w_{h}$   对于所有隐藏层到输出层的权值                           // 反向传递
        计算  $\Delta w_{i}$  $\Delta w_{i}$   对于所有输入层到隐藏层的权值                           // 继续反向传递
        更新网络权值 // 输入层不会被误差估计改变
  until 所有样本正确分类或满足其他停止标准
  return 该网络

这个算法的名称意味着误差会从输出结点反向传播到输入结点。严格地讲，反向传播算法对网络的可修改权值计算了网络误差的梯度。这个梯度会在简单随机梯度下降法中经常用来求最小化误差的权重。通常“反向传播”这个词使用更一般的含义，用来指涵盖了计算梯度以及在随机梯度下降法中使用的整个过程。在适用反向传播算法的网络中，它通常可以快速收敛到令人满意的极小值。

BP网络都是多层感知机（通常都会有一个输入层、一个隐藏层及一个输出层）。为了使隐藏层能够适合所有有用的函数，多层网络必须具有用于多个层的非线性激活函数：仅用线性激活函数的多层网络会与相当于单层线性网络。常用的非线性激活函数有Logistic、柔性最大函数和高斯函数。

推导

由于反向传播使用梯度下降法，需要计算平方误差函数对网络权重的导数。假设对于一个输出神经元，平方误差函数为：

E={\tfrac {1}{2}}(t-y)^{2}

，

其中

E

为平方误差，

t

为训练样本的目标输出，

y

为输出神经元的实际输出。

加入系数

\textstyle {\frac {1}{2}}

是为了抵消微分出来的指数。之后，该表达式会乘以一个任意的学习速率，因此在这里乘上一个常系数是没有关系的。对每个神经元

j

，它的输出

o_{{j}}

定义为

o_{{j}}=\varphi ({\mbox{net}}_{{j}})=\varphi \left(\sum _{{k=1}}^{{n}}w_{{kj}}x_{k}\right)

通向一个神经元的输入

{\mbox{net}}_{{j}}

是之前神经元的输出

o_{k}

的加权和。若该神经元输出层后的第一层，输入层的输出

o_{k}

就是网络的输入

x_k

。该神经元的输入数量是

n

。变量

w_{{ij}}

表示神经元

i

与

j

之间的权重。

激活函数

\varphi

一般是非線性可微函数。常用作激活函数的是Logistic function or Sigmoid function：

\varphi (z)={\frac {1}{1+e^{{-z}}}}

其导数的形式很好：

{\frac {\partial \varphi }{\partial z}}=\varphi (1-\varphi )

求误差的导数

计算误差对权重

w_{{ij}}

的偏导数是两次使用链式法则得到的：

{\frac {\partial E}{\partial w_{{ij}}}}={\frac {\partial E}{\partial o_{j}}}{\frac {\partial o_{j}}{\partial {\mathrm {net_{j}}}}}{\frac {\partial {\mathrm {net_{j}}}}{\partial w_{{ij}}}}

在右边的最后一项中，只有加权和

{\mathrm {net_{j}}}

取决于

w_{{ij}}

，因此

{\frac {\partial {\mathrm {net_{j}}}}{\partial w_{{ij}}}}={\frac {\partial }{\partial w_{{ij}}}}\left(\sum _{{k=1}}^{{n}}w_{{kj}}x_{k}\right)=x_{i}

神经元

j

的输出对其输入的导数就是激活函数的偏导数（这里假定使用逻辑函数）：

{\frac {\partial o_{j}}{\partial {\mathrm {net_{j}}}}}={\frac {\partial }{\partial {\mathrm {net_{j}}}}}\varphi ({\mathrm {net_{j}}})=\varphi ({\mathrm {net_{j}}})(1-\varphi ({\mathrm {net_{j}}}))

这就是为什么反向传播需要的激活函数是可微的。

如果神经元在输出层中，因为此时

o_{j}=y

以及

{\frac {\partial E}{\partial o_{j}}}={\frac {\partial E}{\partial y}}={\frac {\partial }{\partial y}}{\frac {1}{2}}(t-y)^{2}=y-t

所以第一项可以直接算出。

但如果

j

是网络中任一内层，求

E

关于

o_{j}

的导数就不太简单了。

考虑

E

为接受来自神经元

j

的输入的所有神经元

L={u,v,\dots ,w}

的输入的函数，

{\frac {\partial E(o_{j})}{\partial o_{j}}}={\frac {\partial E({\mathrm {net}}_{u},{\mathrm {net}}_{v},\dots ,{\mathrm {net}}_{w})}{\partial o_{j}}}

并关于

o_{j}

取全微分，可以得到该导数的一个递归表达式：

{\frac {\partial E}{\partial o_{j}}}=\sum _{{l\in L}}\left({\frac {\partial E}{\partial {\mathrm {net}}_{l}}}{\frac {\partial {\mathrm {net}}_{l}}{\partial o_{j}}}\right)=\sum _{{l\in L}}\left({\frac {\partial E}{\partial o_{{l}}}}{\frac {\partial o_{{l}}}{\partial {\mathrm {net}}_{l}}}w_{{jl}}\right)

因此，若已知所有关于下一层（更接近输出神经元的一层）的输出

o_{l}

的导数，则可以计算

o_{j}

的导数。

把它们放在一起：

{\dfrac {\partial E}{\partial w_{{ij}}}}=\delta _{{j}}x_{{i}}

其中

\delta _{{j}}={\frac {\partial E}{\partial o_{j}}}{\frac {\partial o_{j}}{\partial {\mathrm {net_{j}}}}}={\begin{cases}(o_{{j}}-t_{{j}})\varphi ({\mbox{net}}_{{j}})(1-\varphi ({\mbox{net}}_{{j}}))&{\mbox{if }}j{\mbox{ is an output neuron,}}\\(\sum _{{l\in L}}\delta _{{l}}w_{{jl}})\varphi ({\mbox{net}}_{{j}})(1-\varphi ({\mbox{net}}_{{j}}))&{\mbox{if }}j{\mbox{ is an inner neuron.}}\end{cases}}

要使用梯度下降法更新

w_{{ij}}

，必须选择一个学习速率

\alpha

。要加在原本的权重上的权重的变化，等于学习速率与梯度的乘积，乘以

-1

：

\Delta w_{{ij}}=-\alpha {\frac {\partial E}{\partial w_{{ij}}}}

之所以要乘以

\textstyle -1

是因为要更新误差函数极小值而不是极大值的方向。

3.2. Regularized Training

See study note below for details:

4. Practice

4.1. Code

Below is my implementation of a 3-layer full connected regularized NN.
function [J grad] = nnCostFunction(nn_params, ...
input_layer_size, ...
hidden_layer_size, ...
num_labels, ...
X, y, lambda)
%NNCOSTFUNCTION Implements the neural network cost function for a two layer
%neural network which performs classification
% [J grad] = NNCOSTFUNCTON(nn_params, hidden_layer_size, num_labels, ...
% X, y, lambda) computes the cost and gradient of the neural network. The
% parameters for the neural network are "unrolled" into the vector
% nn_params and need to be converted back into the weight matrices.
%
% The returned parameter grad should be a "unrolled" vector of the
% partial derivatives of the neural network.
%

% Reshape nn_params back into the parameters Theta1 and Theta2, the weight matrices
% for our 2 layer neural network
Theta1 = reshape(nn_params(1:hidden_layer_size * (input_layer_size + 1)), ...
hidden_layer_size, (input_layer_size + 1));

Theta2 = reshape(nn_params((1 + (hidden_layer_size * (input_layer_size + 1))):end), ...
num_labels, (hidden_layer_size + 1));

% Setup some useful variables
m = size(X, 1);

% Part 1: Feedforward the neural network and return the cost in the variable J.
% recode the labels as vectors containing only values 0 or 1
labels = zeros(num_labels, m);
for i = 1:m,
k = y(i);
labels(k, i) = 1;
end

% feedforward to calculate J
for i = 1:m,
a1 = X(i, :)';
a1 = [1; a1];

z2 = Theta1 * a1;
a2 = sigmoid(z2);
a2 = [1; a2];

z3 = Theta2 * a2;
a3 = sigmoid(z3);

J = J + sum((labels(:,i) .* log(a3)) + ((1 - labels(:,i)) .* log(1-a3)));
end
J = (-1/m) * J;

% Regularize J, exclude first column in Theta1 and Theta2
T1 = Theta1(:, 2:end);
T2 = Theta2(:, 2:end);
J = J + lambda/(2*m) * (sum(sum(T1.*T1)) + sum(sum(T2.*T2)));

% Part 2: Implement the backpropagation algorithm to compute the gradients
% Theta1_grad and Theta2_grad. You should return the partial derivatives of
% the cost function with respect to Theta1 and Theta2 in Theta1_grad and
% Theta2_grad.
Theta1_grad = zeros(size(Theta1));
Theta2_grad = zeros(size(Theta2));
for i = 1:m,
%feedforward to compute a1, a2, a3
a1 = X(i, :)';
a1 = [1; a1];

z2 = Theta1 * a1;
a2 = sigmoid(z2);
a2 = [1; a2];

z3 = Theta2 * a2;
a3 = sigmoid(z3);

%get errors of output nodes
delta3 = a3 - labels(:,i);
%back propogate errors to hidder layer nodes
delta2 = (T2' * delta3) .* sigmoidGradient(z2);

%accumulate weight gradients between layer 2 and layer 3
Theta2_grad = Theta2_grad + delta3 * a2';
%accumulate weight gradients between layer 1 and layer 2
Theta1_grad = Theta1_grad + delta2 * a1';
end
%average weight gradients
Theta2_grad = Theta2_grad / m;
Theta1_grad = Theta1_grad / m;

% Part 3: Implement regularization with gradients. Exclude first column in Theta1 and Theta2
Theta2_grad(:,2:end) = Theta2_grad(:,2:end) + lambda/m * T2;
Theta1_grad(:,2:end) = Theta1_grad(:,2:end) + lambda/m * T1;

% Unroll gradients
grad = [Theta1_grad(:) ; Theta2_grad(:)];

end

function g = sigmoid(z)

%SIGMOID Compute sigmoid functoon

% J = SIGMOID(z) computes the sigmoid of z.

g = 1.0 ./ (1.0 + exp(-z));

end

function g = sigmoidGradient(z)

%
Compute the gradient of the sigmoid function evaluated at

%each value of z (z can be a matrix, vector or scalar).

a = sigmoid(z);

g = a .* (1-a);

end

function p = predict(Theta1, Theta2, X)

%PREDICT Predict the label of an input given a trained neural network

% p = PREDICT(Theta1, Theta2, X) outputs the predicted label of X given the

% trained weights of a neural network (Theta1, Theta2)

m = size(X, 1);

num_labels = size(Theta2, 1);

h1 = sigmoid([ones(m, 1) X] * Theta1');

h2 = sigmoid([ones(m, 1) h1] * Theta2');

[dummy, p] = max(h2, [], 2);

end

4.2. Training Accuracy

Apply the code above to the task of handwritten digit recognition. Feed the NN 5000 20X20 images with lambda=1 for 500 iterations. The training accuracy achieved 99.22%.

Here is the code snippet to kickoff training:

options = optimset('MaxIter', 500); % change the MaxIter to a larger value to see how more training helps.
lambda = 1; % You should also try different values of lambda

costFunction = @(p) nnCostFunction(p, ...

input_layer_size, ...
hidden_layer_size, ...
num_labels, X, y, lambda);
[nn_params, cost] = fmincg(costFunction, initial_nn_params, options);

% Obtain Theta1 and Theta2 back from nn_params
Theta1 = reshape(nn_params(1:hidden_layer_size * (input_layer_size + 1)), ...
hidden_layer_size, (input_layer_size + 1));
Theta2 = reshape(nn_params((1 + (hidden_layer_size * (input_layer_size + 1))):end), ...
num_labels, (hidden_layer_size + 1));

% Use the neural network to predict the labels of the training set and compute the training set accuracy.
pred = predict(Theta1, Theta2, X);
fprintf('\nTraining Set Accuracy: %f\n', mean(double(pred == y)) * 100);

4.3. Visualizing hidden layer

Look at the 400 hidden nodes now. What the hell representations did the neural network capture?!

Brookside Beauty

Monday, September 12, 2016

Artificial Neural Network