Brookside Beauty: September 2016

Friday, September 30, 2016

ILSVRC机器学习竞赛: ImageNet Large-Scale Visual Recognition Challenge

The 2014 ILSVRC 竞赛

ILSVRC-2014 训练集包含120,000幅 ImageNet 的图像，共有1000类。验证集和测试集分别包含50,000和150,000幅，也都是同样的1000类。

胜利的团队，基于 Google 之前给出的结果，使用了包含22层的深度卷积网络。他们称此为 GoogLeNet，向 LeNet-5 致敬。GoogLeNet 达到了93.33% 的准确率远超2013年的 88.3% Clarifai 和 2012 年的KSH 84.7%。

那么 GoogLeNet 93.33% 的准确率又是多好呢？在2014年，一个研究团队写了一篇关于
ILSVRC 竞赛的综述文章。其中有个问题是人类在这个竞赛中能表现得如何。为了做这件事，他们构建了一个系统让人类对 ILSVRC 图像进行分类。其作者之一 Andrej Karpathy 在一篇博文中解释道，让人类达到 GoogLeNet 的性能确实很困难.

换言之，一个专家级别的人类，非常艰难地检查图像，付出很大的精力才能够微弱胜过深度神经网络。实际上，Karpathy 指出第二个人类专家，用小点的图像样本训练后，只能达到12.0% 的 top-5 错误率，明显弱于 GoogLeNet。大概有一半的错误都是专家“难以发现和认定正确的类别究竟是什么”。

http://image-net.org/challenges/LSVRC/2016/results#team

Wednesday, September 28, 2016

Adam Algorithm - Probably the Best Adaptive Learning Rate Method for Deep Learning

The name of "Adam" derives from the phrase "adaptive moment estimation".

I implemented it in C# by inheriting the basic NeuralNetwork class I created before for <<Stanford Machine Learning>> online course, and tested it in my Machine Learning Demo application, the cost function is dropped surprisingly fast.

Here is the comparison:
Basic SGD: Cost drops from 0.812 to 0.797 after 3000 iterations
Adam: Cost drops from 0.812 to 0.375 after 3000 iterations

The cost function trend charts are in screenshots.

The Adam algorithm was first time published in this paper below:

My source code:

using System;
using System.Threading.Tasks;

namespace MachineLearning.ANN
{
/// <summary>
/// Adaptive learning rate neural network based on Adam algorithm: http://arxiv.org/pdf/1412.6980v8.pdf
/// </summary>
public class AdamAlgorithmNeuralNetwork : NeuralNetwork
{
private double[][][] _s; //1st moment variables in Adam algorithm
private double[][][] _r; //2nd moment variables in Adam algorithm

public AdamAlgorithmNeuralNetwork(params int[] nodeCountInEachLayer):base(nodeCountInEachLayer)
{
this.Alpha = 0.001f;
this.Ro1 = 0.9f;
this.Ro2 = 0.999f;
}

/// <summary>
/// Exponential decay rates for first moment estimates
/// </summary>
public float Ro2 { get; set; }
/// <summary>
/// Exponential decay rates for second moment estimates
/// </summary>
public float Ro1 { get; set; }

protected override void BeforeTraining()
{
base.BeforeTraining();
_s = AllocateAllLayerThetas(false);
_r = AllocateAllLayerThetas(false);
}

protected override void AfterTraining()
{
base.AfterTraining();
_s = null;
_r = null;
}

/// <summary>
/// Descend thetas by Adam algorithm
/// </summary>
/// <param name="timeStep">time step "t"</param>
protected override void DescendThetas(int timeStep)
{
int t = timeStep + 1;
Parallel.For(0, _thetas.Length, (l) => //for (int l = 0; l < _thetas.Length; l++)
{
var thetas = _thetas[l];
var thetaGrads = _thetaGradients[l];
var sl = _s[l];
var rl = _r[l];
for (int j = 0; j < thetas.Length; j++)
{
var thetaVector = thetas[j];
var thetaGradVector = thetaGrads[j];
var s = sl[j];
var r = rl[j];
for (int i = 0; i < thetaVector.Length; i++)
{
var g = thetaGradVector[i];
var sc =
(s[i] = Ro1*s[i] + (1 - Ro1)*g) //Update biased ﬁrst moment estimate
/ (1 - Math.Pow(Ro1, t)); //Correct bias in ﬁrst moment
var rc =
(r[i] = Ro2*r[i] + (1 - Ro2)*g*g) //update biased second meoment estimate
/ (1 - Math.Pow(Ro2, t)); //Correct bias in second moment

thetaVector[i] -= Alpha*sc/(Math.Sqrt(rc) + 0.00000001);
}
}
});
}
}
}

Tuesday, September 27, 2016

Artificial Neural Network in C#

I implemented ANN algorithm in C# after submitted Machine Learning exercise 5 in Octave/MatLib. Tested it with a four layer network of 2 X 100 X 800 X 1, see demo below:

ANN class inherits IBinaryClassifier in order to be adaptive to AdaBoost algorithm as its weak classifier. The code of the NeuralNetwork class is blow, InternalTrainAsync(), CostFunction(), ComputeThetaGradients() are the most important parts of ANN algorithm.

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Text;
using System.Threading;
using System.Threading.Tasks;

namespace MachineLearning.ANN
{
[Serializable]
public partial class NeuralNetwork : IBinaryClassifier
{
private double[][][] _thetas;

public NeuralNetwork(params int[] nodeCountInEachLayer)
{
this.NodeCounts = nodeCountInEachLayer;
this.Alpha = 0.1f;
this.Lambda = 1f;
this.LoopCount = 1000;
this.CostThreshold = 0.3f;
}

#region properties
public int[] NodeCounts
{
get { return _nodeCounts; }
private set
{
if (value == null)
throw new ArgumentNullException("NodeCounts cannot be null.");
if (value.Length < 2)
throw new ArgumentOutOfRangeException("Neural network cannot have less than 2 layers.");
for (int layer = 0; layer < value.Length; layer++)
{
if (value[layer] <= 0)
throw new ArgumentException("Neural network must have at least one node in a layer.");
}

_nodeCounts = value;

//allocat thetas for all layers
_thetas = AllocateAllLayerThetas(true);
}
}

private int[] _nodeCounts;

public int Depth
{
get { return _nodeCounts?.Length ?? 0; }
}

public int FeatureCount
{
get { return NodeCountInLayer(0); }
}

public int OutputCount
{
get { return NodeCountInLayer(Depth - 1); }
}

/// <summary>
/// Node count in a layer, not include bias node.
/// </summary>
/// <param name="layer"></param>
/// <returns></returns>
public int NodeCountInLayer(int layer)
{
return _nodeCounts?[layer] ?? 0;
}

/// <summary>
/// Learn rate.
/// </summary>
public float Alpha { get; set; }

/// <summary>
/// Regularization parameter.
/// </summary>
public float Lambda { get; set; }

/// <summary>
/// Maximum number of learning loop for training termination
/// </summary>
public int LoopCount { get; set; }

/// <summary>
/// Cost function threshold for training termination.
/// </summary>
public float CostThreshold { get; set; }

/// <summary>
/// NN is binary classifier when there is only one output node.
/// </summary>
public bool IsBinaryClassifier
{
get { return OutputCount == 1; }
}
#endregion

#region IBinaryClassifier features
public event EventHandler<ClassifierTrainingEventArgs> TrainingEvent;

public void CopySettingsFrom(IBinaryClassifier template)
{
var copy = template as NeuralNetwork;
this.NodeCounts = copy.NodeCounts;
this.Alpha = copy.Alpha;
this.Lambda = copy.Lambda;
this.LoopCount = copy.LoopCount;
this.CostThreshold = copy.CostThreshold;
}

public bool Train(float[][] samples, bool[] responses)
{
using (var cts = new CancellationTokenSource())
{
return TrainAsync(samples, responses, cts.Token).Result;
}
}

public bool Train(float[][] samples, bool[] responses, float[] weights)
{
using (var cts = new CancellationTokenSource())
{
return TrainAsync(samples, responses, weights, cts.Token).Result;
}
}

public async Task<bool> TrainAsync(float[][] samples, bool[] responses, CancellationToken ct)
{
float[] weights = new float[samples.Length];
for (int i = 0; i < weights.Length; i++)
{
weights[i] = 1.0f;
}

return await TrainAsync(samples, responses, weights, ct);
}

public async Task<bool> TrainAsync(float[][] samples, bool[] responses, float[] weights, CancellationToken ct)
{
if (!this.IsBinaryClassifier)
throw new Exception("Neural network as binary classifier cannot have more than 1 output node.");

if (samples.Length == 0)
throw new ArgumentException("The number of samples is zero.");
if (samples[0].Length != this.FeatureCount)
throw new ArgumentException("The number of features in a sample is not the same as the number of input nodes.");
if (samples.Length != responses.Length)
throw new ArgumentException("The number of samples must be the same as the number of responses.");
if (samples.Length != weights.Length)
throw new ArgumentException("The number of samples must be the same as the number of weights.");
if (weights.Any(x => x < 0))
throw new ArgumentException("There exists negative weight of sample.");

//resample by top 80% of weights
var selectedIndics = new List<int>(samples.Length);
//select all samples weighted as 1
selectedIndics.AddRange(Enumerable.Range(0, weights.Length).Where(i => weights[i] >= 1));
//select top 80% of the other samples by weights
selectedIndics.AddRange(Enumerable.Range(0, weights.Length)
.Select(i => new { i, weight = weights[i] })
.Where(p => p.weight < 1)
.OrderByDescending(p => p.weight)
.Take((int)(weights.Length * 0.8))
.Where(p => p.weight > 0)
.Select(p => p.i));
if (selectedIndics.Count == 0)
return false;

float[][] X = selectedIndics.Select(i => samples[i]).ToArray();
int[][] Y = selectedIndics.Select(i => new int[1] { responses[i] ? 1 : 0 }).ToArray();

return await InternalTrainAsync(X, Y, ct);
}

public bool Predict(float[] sample)
{
if (!this.IsBinaryClassifier)
throw new Exception("Neural network as binary classifier cannot have more than 1 output node.");

var h = ComputeInputLayerActivations(sample);
for (int l = 1; l < Depth; l++)
{
h = ComputeNonInputLayerActivations(l, h);
}

return h[1] > 0.5; //ignore a0
}
#endregion

#region Mulitple category classifier
public bool Train(float[][] samples, int[] responses)
{
using (var cts = new CancellationTokenSource())
{
return TrainAsync(samples, responses, cts.Token).Result;
}
}

/// <summary>
/// Train NN with samples.
/// </summary>
/// <param name="samples">Samples</param>
/// <param name="responses">The index of categories or labels, starting with 0</param>
/// <param name="ct">Cancellation token</param>
/// <returns></returns>
public async Task<bool> TrainAsync(float[][] samples, int[] responses, CancellationToken ct)
{
if (samples.Length == 0)
throw new ArgumentException("The number of samples is zero.");
if (samples[0].Length != this.FeatureCount)
throw new ArgumentException("The number of features in a sample is not the same as the number of input nodes.");
if (samples.Length != responses.Length)
throw new ArgumentException("The number of samples must be the same as the number of responses.");

//recode the labels as vectors containing only values 0 or 1
int[][] labels = new int[samples.Length][];
for (int i = 0; i < labels.Length; i++)
{
labels[i] = new int[this.OutputCount];
labels[i][responses[i]] = 1;
}

return await InternalTrainAsync(samples, labels, ct);
}

public int PredictMultiCategory(float[] sample)
{
var h = ComputeInputLayerActivations(sample);
for (int l = 1; l < Depth; l++)
{
h = ComputeNonInputLayerActivations(l, h);
}

double max = h[1]; //ignore a0
int index = 1;
for (int i = 2; i < h.Length; i++)
{
if (h[i] > max)
{
index = i;
max = h[i];
}
}
return index - 1;
}
#endregion

public override string ToString()
{
if (_nodeCounts == null)
return null;

var sb = new StringBuilder();
sb.Append(_nodeCounts[0].ToString());
for (int i = 1; i < _nodeCounts.Length; i++)
{
sb.Append(',');
sb.Append(_nodeCounts[i].ToString());
}

return sb.ToString();
}

/// <summary>
/// g(z) = 1/(1+exp(z))
/// </summary>
/// <param name="z"></param>
/// <returns></returns>
public static double Sigmoid(double z)
{
return 1 / (1 + Math.Exp(-z));
}

#region Private methods
private async Task<bool> InternalTrainAsync(float[][] X, int[][] Y, CancellationToken ct)
{
var eventArgs = new ClassifierTrainingEventArgs();

var stopWatch = new Stopwatch();
stopWatch.Start();

//train NN many times by Gradient Descent algorithm
int t = 0;
double cost = 0.0;
for (; t < LoopCount; t++)
{
if (ct.IsCancellationRequested)
{
Debug.WriteLine("Training was cancelled.");
if (TrainingEvent != null)
{
eventArgs.Loop = t;
eventArgs.Cost = cost;
eventArgs.Message = "Training was cancelled.";
TrainingEvent(this, eventArgs);
}
return false;
}

await Task.Run(() =>
{
//calculate weight gradients
var thetaGradients = ComputeThetaGradients(X, Y, Lambda);

//descent weights
Parallel.For(0, _thetas.Length, (l) => //for (int l = 0; l < _thetas.Length; l++)
{
var thetas = _thetas[l];
var thetaGrads = thetaGradients[l];
for (int j = 0; j < thetas.Length; j++)
{
var thetaVector = thetas[j];
var thetaGradVector = thetaGrads[j];
for (int i = 0; i < thetaVector.Length; i++)
{
thetaVector[i] -= Alpha * thetaGradVector[i];
}
}
});

//check cost function
cost = this.CostFunction(X, Y, Lambda);
Debug.WriteLine(string.Format("{0} - cost: {1}", t, cost));
if (TrainingEvent != null)
{
eventArgs.Loop = t;
eventArgs.Cost = cost;
eventArgs.Message = string.Format("{0} - cost: {1}", t, cost);
TrainingEvent(this, eventArgs);
}
});
if (cost <= CostThreshold)
break;
}

stopWatch.Stop();

if (TrainingEvent != null)
{
eventArgs.Loop = t;
eventArgs.Cost = cost;
eventArgs.Message = "Training succeeded in " + stopWatch.ElapsedMilliseconds + "ms";
TrainingEvent(this, eventArgs);
}
return true;
}

private double CostFunction(float[][] X, int[][] Y, float lambda)
{
int m = X.Length;

double cost = 0;
for (int i = 0; i < X.Length; i++)
{
var x = X[i];
var y = Y[i];
//compute hepothesis of sampe x: h(x)
var h = ComputeInputLayerActivations(x);
for (int l = 1; l < Depth; l++)
{
h = ComputeNonInputLayerActivations(l, h);
}

for (int k = 0; k < y.Length; k++)
{
cost += y[k] * Math.Log(h[k + 1]) + (1 - y[k]) * Math.Log(1 - h[k + 1]); //ignore h0
}
}
cost = -cost / m;

//regularization, exclude first column in Theta1, Theta2, ...
double punish = 0;
foreach (var thetasOfLayer in _thetas)
{
foreach (var inputThetas in thetasOfLayer)
{
for (int i = 1; i < inputThetas.Length; i++) //exclude inputThetas[j,0]
{
punish += inputThetas[i] * inputThetas[i];
}
}
}
cost += lambda / (m + m) * punish;

return cost;
}

/// <summary>
/// Compute theta/weight gradients
/// </summary>
/// <param name="X">Samples</param>
/// <param name="Y">Labels of samples X. Y[i][j]=1 when sample X[i] belongs to the j-th category, Y[i][j]=0 otherwise.</param>
/// <param name="lambda">regulalization parameter</param>
/// <returns></returns>
private double[][][] ComputeThetaGradients(float[][] X, int[][] Y, float lambda)
{
int m = X.Length; //the number of samples

// Implement the backpropagation algorithm to compute the gradients of thetas.
// They are the partial derivatives of the cost function with respect to each theta in arraies of _thetas[0], _theta[1], ...
double[][][] thetaGrads = AllocateAllLayerThetas(false);

double[][] nodeOutputs = new double[Depth][];
for (int s = 0; s < X.Length; s++)
{
var x = X[s];
//Feedforward to compute node outputs: a1, a2, a3, ...
Debug.WriteLine("Computer layer 0 activation");
nodeOutputs[0] = ComputeInputLayerActivations(x);
for (int l = 1; l < Depth; l++)
{
Debug.WriteLine("Computer layer {0} activation", l);
nodeOutputs[l] = ComputeNonInputLayerActivations(l, nodeOutputs[l - 1]);
}

//Compute errors of nodes in output layer by deltas[i+1]-labels[i], ignore delta 0
Debug.WriteLine("Computer layer {0} errors", Depth - 1);
double[] deltas = nodeOutputs[Depth - 1]; //reuse the space of last layer's outputs to store its errors
for (int j = 1; j < deltas.Length; j++)
{
deltas[j] -= Y[s][j - 1];
}

//back propogate errors to each hidden layer
for (int l = Depth - 2; l >= 0; l--)
{
//Compute theta grad between layer l and layer l+1
Debug.WriteLine("Update theta gradients between layer {0} and layer {1}", l + 1, l);
var thetaGrad = thetaGrads[l];
UpdateThetaGradients(thetaGrad, nodeOutputs[l], deltas);

if (l > 0)
{
//back propogate errors to hidden layer l
Debug.WriteLine("Computer layer {0} errors", l);
deltas = ComputeHiddenLayerErrors(l, nodeOutputs[l], deltas);
}
}
}

//average weight gradients
double coef = lambda / m;
//for (int l = 0; l < thetaGrads.Length; l++)
Parallel.For(0, thetaGrads.Length, (l) =>
{
var thetas = _thetas[l];
var grads = thetaGrads[l];
for (int j = 0; j < grads.Length; j++)
{
var g = grads[j];
for (int i = 0; i < g.Length; i++)
{
g[i] /= m;
}

// Implement regularization with gradients. Exclude first column in Theta1, Theta2, ...
var t = thetas[j];
for (int i = 1; i < g.Length; i++)
{
g[i] += coef * t[i];
}
}
});

return thetaGrads;
}

private double[] ComputeInputLayerActivations(float[] inputs)
{
var a = new double[NodeCountInLayer(0) + 1];
a[0] = 1; //add bias node
inputs.CopyTo(a, 1);
return a;
}

private double[] ComputeNonInputLayerActivations(int layerIndex, double[] inputs)
{
var a = new double[NodeCountInLayer(layerIndex) + 1];
a[0] = 1; //add bias node, it is useless in output layer

Parallel.For(1, a.Length, (j) => //for (int j = 1; j < a.Length; j++)
{
a[j] = ComputeNonInputNodeActivation(layerIndex, j, inputs);
});
return a;
}

private double ComputeNonInputNodeActivation(int layerIndex, int nodeIndex, double[] inputs)
{
var inputThetas = _thetas[layerIndex - 1][nodeIndex - 1]; //The input weights of layer, nodeIndex-th node in layerIndex-th layer
Debug.Assert(inputThetas.Length == inputs.Length);

double totalWeightedInput = 0.0;
for (int i = 0; i < inputs.Length; i++)
{
totalWeightedInput += inputs[i] * inputThetas[i];
}

return Sigmoid(totalWeightedInput);
}

/// <summary>
/// Compute error of each node in the hidden layer, reuse hiddenLayerNodesOutputs as space to store node errors.
/// </summary>
/// <param name="hiddenLayer">Index of hidden layer (start with 0)</param>
/// <param name="hiddenLayerNodesOutputs">As input, it stores activations/outputs of hidden layer nodes. As output, it stores errors of the same nodes.</param>
/// <param name="higherLayerDeltas">The errors of nodes in next layer.</param>
/// <returns></returns>
private double[] ComputeHiddenLayerErrors(int hiddenLayer, double[] hiddenLayerNodesOutputs, double[] higherLayerDeltas)
{
var thetas = _thetas[hiddenLayer];

double[] delters = hiddenLayerNodesOutputs;
//for (int i = 1; i < hiddenLayerNodesOutputs.Length; i++) //ignore first hidden node
Parallel.For(1, hiddenLayerNodesOutputs.Length, (i) =>
{
double totalWeightedErrorsFromHigherLayer = 0.0;
for (int j = 1; j < higherLayerDeltas.Length; j++) //ignore delta 0
{
totalWeightedErrorsFromHigherLayer += thetas[j - 1][i] * higherLayerDeltas[j];
}

//Let a=sigmoid(x), its derivative sigmoid'(x) = a*(1-a)
double a = hiddenLayerNodesOutputs[i];
delters[i] = totalWeightedErrorsFromHigherLayer * a * (1 - a);
});

return delters;
}

private void UpdateThetaGradients(double[][] thetaGrad, double[] inputNodesActivations, double[] outputNodesDeltas)
{
//for (int j = 1; j < outputNodesDeltas.Length; j++) //ignore delta 0
Parallel.For(1, outputNodesDeltas.Length, (j) =>
{
var outputNodeDelta = outputNodesDeltas[j];
var g = thetaGrad[j - 1];
for (int i = 0; i < inputNodesActivations.Length; i++)
{
g[i] += inputNodesActivations[i] * outputNodeDelta;
}
});
}

private double[][][] AllocateAllLayerThetas(bool initialize)
{
var layers = new double[Depth - 1][][];
for (int l = 0; l < layers.Length; l++)
{
int currentLayerNodeCount = _nodeCounts[l];
int nextLayerNodeCount = _nodeCounts[l + 1];

var layer = layers[l] = new double[nextLayerNodeCount][];
for (int j = 0; j < layer.Length; j++)
{
layer[j] = new double[currentLayerNodeCount + 1];
}
}

if (initialize)
{
//init thetas with small random numbers in [-epsilon, epsilon], epsilon=sqrt(6)/[sqrt(s(layer))+sqrt(s(layer+1))]
const double epsilonInit = 0.12;
var random = new Random();
foreach (var layer in layers)
{
foreach (var inputThetas in layer)
{
for (int i = 0; i < inputThetas.Length; i++)
{
inputThetas[i] = random.NextDouble() * 2 * epsilonInit - epsilonInit;
}
}
}
}
return layers;
}
#endregion
}
}

The second part of the NeuralNetwork class below is the implementation of CheckGradients() method, it is useful to verify if cost function J() and partial derivative of theta is correct.

using System;
using System.Diagnostics;

namespace MachineLearning.ANN
{
public partial class NeuralNetwork
{
#if DEBUG
private double[][][] ComputeNumericalGradients(float[][] X, int[][] Y)
{
const double e = 1e-4;
double[][][] thetaGrads = new double[_thetas.Length][][];
for (int l = 0; l < _thetas.Length; l++)
{
var theta = _thetas[l];
var thetaGrad = thetaGrads[l] = new double[theta.Length][];
for (int j = 0; j < theta.Length; j++)
{
var t = theta[j];
var g = thetaGrad[j] = new double[t.Length];

for (int i = 0; i < t.Length; i++)
{
var oldTheta = t[i];

t[i] -= e;
double loss1 = CostFunction(X, Y, 0);
t[i] = oldTheta + e;
double loss2 = CostFunction(X, Y, 0);

g[i] = (loss2 - loss1) / (2 * e);

t[i] = oldTheta;
}
}
}

return thetaGrads;
}

public static bool CheckGradients()
{
var nn = new NeuralNetwork(3, 5, 3);
//update thetas using "sin"
var thetas = nn._thetas;
for (int l = 0; l < thetas.Length; l++)
{
var a = thetas[l];
for (int x = 1, i = 0; i < a[0].Length; i++)
{
for (int j = 0; j < a.Length; j++)
{
a[j][i] = Math.Sin(x++)/10;
}
}
}

//create samples
int m = 5; //sample count
var X = new float[m][];
var Y = new int[m][];

for (int i = 0; i < X.Length; i++)
{
X[i] = new float[3];
}
for (int x = 1, j = 0; j < 3; j++)
{
for (int i = 0; i < X.Length; i++)
{
X[i][j] = (float)Math.Sin(x++) / 10;
}
}

for (int i = 0; i < Y.Length; i++)
{
Y[i] = new int[3];
Y[i][(i+1) % 3] = 1;
}

var grad = nn.ComputeThetaGradients(X, Y, 0);
var numGrad = nn.ComputeNumericalGradients(X, Y);
for (int l = 0; l < grad.Length; l++)
{
var a = grad[l];
var b = numGrad[l];
for (int j = 0; j < a.Length; j++)
{
var g = a[j];
var n = b[j];

for (int i = 0; i < g.Length; i++)
{
Debug.WriteLine("{0}, {1}", n[i], g[i]);
}
}
}

double diff = numGrad.Minus(grad).Norm() / numGrad.Add(grad).Norm();
Debug.WriteLine("Relative defference should be less than 1e-9: {0}", diff);
return diff < 1e-9;
}
#endif
}
}

In addition, post IBinaryClassifier code here, classes implements it is adaptive to AdaBoost algorithm.
public interface IBinaryClassifier
{
event EventHandler<ClassifierTrainingEventArgs> TrainingEvent;

void CopySettingsFrom(IBinaryClassifier template);
bool Train(float[][] samples, bool[] responses);
bool Train(float[][] samples, bool[] responses, float[] weights);
Task<bool> TrainAsync(float[][] samples, bool[] responses, CancellationToken ct);
Task<bool> TrainAsync(float[][] samples, bool[] responses, float[] weights, CancellationToken ct);
bool Predict(float[] sample);
}

Monday, September 12, 2016

Artificial Neural Network

1. One Neuron

2. Multi-layer Neural Network Feedforward Prediction

2.1. One class neural network

2.2. Multi-class neural network

Prediction code in Octave/MatLib:

function p = predict(Theta1, Theta2, X)

%PREDICT Predict the label of an input given a trained neural network

% p = PREDICT(Theta1, Theta2, X) outputs the predicted label of X given the

% trained weights of a neural network (Theta1, Theta2)

% Useful values

m = size(X, 1); % m is the number of examples

num_labels = size(Theta2, 1);

A1 = X';

A1 = [ones(1, m); A1]; % add a0=1

A2 = sigmoid(Theta1 * A1);

A2 = [ones(1, m); A2]; % add a0=1

A3 = sigmoid(Theta2 * A2);

[v, i] = max(A3, [], 1); % Obtain the max value and index for each column

p = i'; % set p to a vector containing labels between 1 to num_labels.

end

The matrix X contains the examples in rows. When you complete the code in predict.m, you will need to add the column of 1's to the matrix. The matrices Theta1 and Theta2 contain the parameters for each unit in rows. Speci cally, the 1st row of Theta1corresponds to the 1st hidden unit in the second layer.

function g = sigmoid(z)

%SIGMOID Compute sigmoid functoon

% J = SIGMOID(z) computes the sigmoid of z.

g = 1.0 ./ (1.0 + exp(-z));

end

3. Multi-layer Neural Network Training Algorithm

3.1. Traditional BP algorithm without regularization

This part is in Chinese, but it has a very good mathematical derivation of why the derivation of the error is $\dfrac{\partial E}{\partial w_{ij}} = \delta_{j} o_{i}$ .

概括

反向传播算法（BP算法）主要由两个阶段:激励传播、和权重更新。

第1阶段：激励传播

每次迭代中的传播环节包含两步：

（前向传播阶段）将训练输入送入网络以获得激励响应；
（反向传播阶段）将激励响应同训练输入对应的目标输出求差，从而获得隐层和输出层的响应误差。

第2阶段：权重更新

对于每个突触上的权重，按照以下步骤进行更新：

将输入激励和响应误差相乘，从而获得权重的梯度；
将这个梯度乘上一个比例并取反后加到权重上。

这个比例（百分比）将会影响到训练过程的速度和效果，因此称为“训练因子”。梯度的方向指明了误差扩大的方向，因此在更新权重的时候需要对其取反，从而减小权重引起的误差。

第1和第2阶段可以反复循环迭代，直到网络的对输入的响应达到满意的预定的目标范围为止。

算法

三层网络算法（只有一个隐藏层）：

  初始化网络权值（通常是小的随机值）
  do
     forEach 训练样本 ex
        prediction = neural-net-output(network, ex)  // 正向传递
        actual = teacher-output(ex)
        计算输出单元的误差 (prediction - actual)
        计算  $\Delta w_{h}$  $\Delta w_{h}$   对于所有隐藏层到输出层的权值                           // 反向传递
        计算  $\Delta w_{i}$  $\Delta w_{i}$   对于所有输入层到隐藏层的权值                           // 继续反向传递
        更新网络权值 // 输入层不会被误差估计改变
  until 所有样本正确分类或满足其他停止标准
  return 该网络

这个算法的名称意味着误差会从输出结点反向传播到输入结点。严格地讲，反向传播算法对网络的可修改权值计算了网络误差的梯度。这个梯度会在简单随机梯度下降法中经常用来求最小化误差的权重。通常“反向传播”这个词使用更一般的含义，用来指涵盖了计算梯度以及在随机梯度下降法中使用的整个过程。在适用反向传播算法的网络中，它通常可以快速收敛到令人满意的极小值。

BP网络都是多层感知机（通常都会有一个输入层、一个隐藏层及一个输出层）。为了使隐藏层能够适合所有有用的函数，多层网络必须具有用于多个层的非线性激活函数：仅用线性激活函数的多层网络会与相当于单层线性网络。常用的非线性激活函数有Logistic、柔性最大函数和高斯函数。

推导

由于反向传播使用梯度下降法，需要计算平方误差函数对网络权重的导数。假设对于一个输出神经元，平方误差函数为：

E={\tfrac {1}{2}}(t-y)^{2}

，

其中

E

为平方误差，

t

为训练样本的目标输出，

y

为输出神经元的实际输出。

加入系数

\textstyle {\frac {1}{2}}

是为了抵消微分出来的指数。之后，该表达式会乘以一个任意的学习速率，因此在这里乘上一个常系数是没有关系的。对每个神经元

j

，它的输出

o_{{j}}

定义为

o_{{j}}=\varphi ({\mbox{net}}_{{j}})=\varphi \left(\sum _{{k=1}}^{{n}}w_{{kj}}x_{k}\right)

通向一个神经元的输入

{\mbox{net}}_{{j}}

是之前神经元的输出

o_{k}

的加权和。若该神经元输出层后的第一层，输入层的输出

o_{k}

就是网络的输入

x_k

。该神经元的输入数量是

n

。变量

w_{{ij}}

表示神经元

i

与

j

之间的权重。

激活函数

\varphi

一般是非線性可微函数。常用作激活函数的是Logistic function or Sigmoid function：

\varphi (z)={\frac {1}{1+e^{{-z}}}}

其导数的形式很好：

{\frac {\partial \varphi }{\partial z}}=\varphi (1-\varphi )

求误差的导数

计算误差对权重

w_{{ij}}

的偏导数是两次使用链式法则得到的：

{\frac {\partial E}{\partial w_{{ij}}}}={\frac {\partial E}{\partial o_{j}}}{\frac {\partial o_{j}}{\partial {\mathrm {net_{j}}}}}{\frac {\partial {\mathrm {net_{j}}}}{\partial w_{{ij}}}}

在右边的最后一项中，只有加权和

{\mathrm {net_{j}}}

取决于

w_{{ij}}

，因此

{\frac {\partial {\mathrm {net_{j}}}}{\partial w_{{ij}}}}={\frac {\partial }{\partial w_{{ij}}}}\left(\sum _{{k=1}}^{{n}}w_{{kj}}x_{k}\right)=x_{i}

神经元

j

的输出对其输入的导数就是激活函数的偏导数（这里假定使用逻辑函数）：

{\frac {\partial o_{j}}{\partial {\mathrm {net_{j}}}}}={\frac {\partial }{\partial {\mathrm {net_{j}}}}}\varphi ({\mathrm {net_{j}}})=\varphi ({\mathrm {net_{j}}})(1-\varphi ({\mathrm {net_{j}}}))

这就是为什么反向传播需要的激活函数是可微的。

如果神经元在输出层中，因为此时

o_{j}=y

以及

{\frac {\partial E}{\partial o_{j}}}={\frac {\partial E}{\partial y}}={\frac {\partial }{\partial y}}{\frac {1}{2}}(t-y)^{2}=y-t

所以第一项可以直接算出。

但如果

j

是网络中任一内层，求

E

关于

o_{j}

的导数就不太简单了。

考虑

E

为接受来自神经元

j

的输入的所有神经元

L={u,v,\dots ,w}

的输入的函数，

{\frac {\partial E(o_{j})}{\partial o_{j}}}={\frac {\partial E({\mathrm {net}}_{u},{\mathrm {net}}_{v},\dots ,{\mathrm {net}}_{w})}{\partial o_{j}}}

并关于

o_{j}

取全微分，可以得到该导数的一个递归表达式：

{\frac {\partial E}{\partial o_{j}}}=\sum _{{l\in L}}\left({\frac {\partial E}{\partial {\mathrm {net}}_{l}}}{\frac {\partial {\mathrm {net}}_{l}}{\partial o_{j}}}\right)=\sum _{{l\in L}}\left({\frac {\partial E}{\partial o_{{l}}}}{\frac {\partial o_{{l}}}{\partial {\mathrm {net}}_{l}}}w_{{jl}}\right)

因此，若已知所有关于下一层（更接近输出神经元的一层）的输出

o_{l}

的导数，则可以计算

o_{j}

的导数。

把它们放在一起：

{\dfrac {\partial E}{\partial w_{{ij}}}}=\delta _{{j}}x_{{i}}

其中

\delta _{{j}}={\frac {\partial E}{\partial o_{j}}}{\frac {\partial o_{j}}{\partial {\mathrm {net_{j}}}}}={\begin{cases}(o_{{j}}-t_{{j}})\varphi ({\mbox{net}}_{{j}})(1-\varphi ({\mbox{net}}_{{j}}))&{\mbox{if }}j{\mbox{ is an output neuron,}}\\(\sum _{{l\in L}}\delta _{{l}}w_{{jl}})\varphi ({\mbox{net}}_{{j}})(1-\varphi ({\mbox{net}}_{{j}}))&{\mbox{if }}j{\mbox{ is an inner neuron.}}\end{cases}}

要使用梯度下降法更新

w_{{ij}}

，必须选择一个学习速率

\alpha

。要加在原本的权重上的权重的变化，等于学习速率与梯度的乘积，乘以

-1

：

\Delta w_{{ij}}=-\alpha {\frac {\partial E}{\partial w_{{ij}}}}

之所以要乘以

\textstyle -1

是因为要更新误差函数极小值而不是极大值的方向。

3.2. Regularized Training

See study note below for details:

4. Practice

4.1. Code

Below is my implementation of a 3-layer full connected regularized NN.
function [J grad] = nnCostFunction(nn_params, ...
input_layer_size, ...
hidden_layer_size, ...
num_labels, ...
X, y, lambda)
%NNCOSTFUNCTION Implements the neural network cost function for a two layer
%neural network which performs classification
% [J grad] = NNCOSTFUNCTON(nn_params, hidden_layer_size, num_labels, ...
% X, y, lambda) computes the cost and gradient of the neural network. The
% parameters for the neural network are "unrolled" into the vector
% nn_params and need to be converted back into the weight matrices.
%
% The returned parameter grad should be a "unrolled" vector of the
% partial derivatives of the neural network.
%

% Reshape nn_params back into the parameters Theta1 and Theta2, the weight matrices
% for our 2 layer neural network
Theta1 = reshape(nn_params(1:hidden_layer_size * (input_layer_size + 1)), ...
hidden_layer_size, (input_layer_size + 1));

Theta2 = reshape(nn_params((1 + (hidden_layer_size * (input_layer_size + 1))):end), ...
num_labels, (hidden_layer_size + 1));

% Setup some useful variables
m = size(X, 1);

% Part 1: Feedforward the neural network and return the cost in the variable J.
% recode the labels as vectors containing only values 0 or 1
labels = zeros(num_labels, m);
for i = 1:m,
k = y(i);
labels(k, i) = 1;
end

% feedforward to calculate J
for i = 1:m,
a1 = X(i, :)';
a1 = [1; a1];

z2 = Theta1 * a1;
a2 = sigmoid(z2);
a2 = [1; a2];

z3 = Theta2 * a2;
a3 = sigmoid(z3);

J = J + sum((labels(:,i) .* log(a3)) + ((1 - labels(:,i)) .* log(1-a3)));
end
J = (-1/m) * J;

% Regularize J, exclude first column in Theta1 and Theta2
T1 = Theta1(:, 2:end);
T2 = Theta2(:, 2:end);
J = J + lambda/(2*m) * (sum(sum(T1.*T1)) + sum(sum(T2.*T2)));

% Part 2: Implement the backpropagation algorithm to compute the gradients
% Theta1_grad and Theta2_grad. You should return the partial derivatives of
% the cost function with respect to Theta1 and Theta2 in Theta1_grad and
% Theta2_grad.
Theta1_grad = zeros(size(Theta1));
Theta2_grad = zeros(size(Theta2));
for i = 1:m,
%feedforward to compute a1, a2, a3
a1 = X(i, :)';
a1 = [1; a1];

z2 = Theta1 * a1;
a2 = sigmoid(z2);
a2 = [1; a2];

z3 = Theta2 * a2;
a3 = sigmoid(z3);

%get errors of output nodes
delta3 = a3 - labels(:,i);
%back propogate errors to hidder layer nodes
delta2 = (T2' * delta3) .* sigmoidGradient(z2);

%accumulate weight gradients between layer 2 and layer 3
Theta2_grad = Theta2_grad + delta3 * a2';
%accumulate weight gradients between layer 1 and layer 2
Theta1_grad = Theta1_grad + delta2 * a1';
end
%average weight gradients
Theta2_grad = Theta2_grad / m;
Theta1_grad = Theta1_grad / m;

% Part 3: Implement regularization with gradients. Exclude first column in Theta1 and Theta2
Theta2_grad(:,2:end) = Theta2_grad(:,2:end) + lambda/m * T2;
Theta1_grad(:,2:end) = Theta1_grad(:,2:end) + lambda/m * T1;

% Unroll gradients
grad = [Theta1_grad(:) ; Theta2_grad(:)];

end

function g = sigmoid(z)

%SIGMOID Compute sigmoid functoon

% J = SIGMOID(z) computes the sigmoid of z.

g = 1.0 ./ (1.0 + exp(-z));

end

function g = sigmoidGradient(z)

%
Compute the gradient of the sigmoid function evaluated at

%each value of z (z can be a matrix, vector or scalar).

a = sigmoid(z);

g = a .* (1-a);

end

function p = predict(Theta1, Theta2, X)

%PREDICT Predict the label of an input given a trained neural network

% p = PREDICT(Theta1, Theta2, X) outputs the predicted label of X given the

% trained weights of a neural network (Theta1, Theta2)

m = size(X, 1);

num_labels = size(Theta2, 1);

h1 = sigmoid([ones(m, 1) X] * Theta1');

h2 = sigmoid([ones(m, 1) h1] * Theta2');

[dummy, p] = max(h2, [], 2);

end

4.2. Training Accuracy

Apply the code above to the task of handwritten digit recognition. Feed the NN 5000 20X20 images with lambda=1 for 500 iterations. The training accuracy achieved 99.22%.

Here is the code snippet to kickoff training:

options = optimset('MaxIter', 500); % change the MaxIter to a larger value to see how more training helps.
lambda = 1; % You should also try different values of lambda

costFunction = @(p) nnCostFunction(p, ...

input_layer_size, ...
hidden_layer_size, ...
num_labels, X, y, lambda);
[nn_params, cost] = fmincg(costFunction, initial_nn_params, options);

% Obtain Theta1 and Theta2 back from nn_params
Theta1 = reshape(nn_params(1:hidden_layer_size * (input_layer_size + 1)), ...
hidden_layer_size, (input_layer_size + 1));
Theta2 = reshape(nn_params((1 + (hidden_layer_size * (input_layer_size + 1))):end), ...
num_labels, (hidden_layer_size + 1));

% Use the neural network to predict the labels of the training set and compute the training set accuracy.
pred = predict(Theta1, Theta2, X);
fprintf('\nTraining Set Accuracy: %f\n', mean(double(pred == y)) * 100);

4.3. Visualizing hidden layer

Look at the 400 hidden nodes now. What the hell representations did the neural network capture?!