Monday, October 17, 2016

Resumable functions in VC++ 2015

Resumable functions are co-routines. People are expecting C++17 proposal define its usage to be as simple as async/await in C# language.

How to enable resumable functions in VC++2015:

  • Add /await option

Visual C++ 2015 has a new compiler option called /await that can be used to write and call resumable functions, and you would need to manually add it via the Additional Options box under the C/C++ property page.

  • Change  /ZI option into /Zi option in order to disable the debugger’s edit-and-continue feature
  • Disable SDL checks with the /sdl- option
  • Avoid the /RTC option


Here gives C++ code snippets of resumable functions and usage.

The async function below will execute the function asynchronously, typically in a separate thread, and return a future that will hold the result of the call when it’s completed.


#include <future>
#include <iostream>
#include <experimental\resumable>

using namespace std::chrono; 

std::future<int> sum_async(int x, int y)
{
  return std::async([x, y] 
  {
    std::this_thread::sleep_for(2s);
    return x + y; 
  });
}



Now, here’s how this method can be called. The await keyword calls that method asynchronously and waits until the method executes to completion. And then control goes to the next line of code.


std::future<void> call_sum_async(int x, int y)
{
  std::cout << "** async func - before call" << std::endl;
  auto sum = await sum_async(x, y);
  std::cout << "** async func - after call" << std::endl;
}

void main_async()
{
  std::cout << "<< main func - before call" << std::endl;
  auto result = call_sum_async(7, 9);
  std::cout << "<< main func - after call" << std::endl;

  {
    using namespace std::chrono;
    std::cout << "<< main func - before sleep" << std::endl;
    std::this_thread::sleep_for(3s);
    std::cout << "<< main func - after sleep" << std::endl;
  }
  result.get();
  std::cout << "<< main func - after get" << std::endl;
}

Tuesday, October 4, 2016

SVM Study Notes

SVM with Gaussian Kernel: How to find the best training parameters C and sigma that give lowest cross-validation error?

This code find the best C and sigma based on given cross-validation set (Xval, yval):

function [C, sigma] = FindBestSVMParameters(X, y, Xval, yval)

C_vec = [0.01 0.03 0.1 0.3 1 3 10 30]';
sigma_vec = [0.01 0.03 0.1 0.3 1 3 10 30]';
ErrorMatrix = zeros(8, 8);

for i = 1:length(C_vec)
  C = C_vec(i);

  for j = 1:length(sigma_vec)
    sigma = sigma_vec(j);
    model= svmTrain(X, y, C, @(x1, x2) gaussianKernel(x1, x2, sigma)); 
    predictions = svmPredict(model, Xval);
    ErrorMatrix(i,j) = mean(double(predictions ~= yval)); %prediction error rate    
  end
end

% surf(C_vec, sigma_vec, ErrorMatrix);

[i,j] = find(ismember(ErrorMatrix, min(ErrorMatrix(:))));
C = C_vec(i);
sigma = sigma_vec(j);

end
----------------------------------------------------------------------------------
function sim = gaussianKernel(x1, x2, sigma)
% returns a radial basis function (RBF) kernel between x1 and x2

% Ensure that x1 and x2 are column vectors
x1 = x1(:);
x2 = x2(:);
sim = exp( - ((x1 - x2)' * (x1 - x2)) / (2*sigma*sigma) );

end

Validation Curve to Find the Best Regularization Parameter Lambda

Function to create lambdas and their training errors and validation errors:

function [lambda_vec, error_train, error_val] = validationCurve(X, y, Xval, yval)

% Selected values of lambda (you should not change this)
lambda_vec = [0 0.001 0.003 0.01 0.03 0.1 0.3 1 3 10]';

% You need to return these variables correctly.
error_train = zeros(length(lambda_vec), 1);
error_val = zeros(length(lambda_vec), 1);

for i = 1:length(lambda_vec)
  lambda = lambda_vec(i);
  
  % Train the linear model with X, y and lambda
  [theta] = trainLinearReg(X, y, lambda);
  
  % Compute training error
  [J, grad] = linearRegCostFunction(X, y, theta, 0);  %lambda=0 for error computing
  error_train(i) = J;
  % Compute cross valication error
  [J, grad] = linearRegCostFunction(Xval, yval, theta, 0);  %lambda=0 for error computing
  error_val(i) = J;  
end

end

Plot them:
[lambda_vec, error_train, error_val] = validationCurve(X, y, Xvalidation, yvalidation);
plot(lambda_vec, error_train, lambda_vec, error_val);
legend('Train', 'Cross Validation');
xlabel('lambda');
ylabel('Error');

Lambda = 3 gives the smallest cross validation error.

Mini-batch gradient descent

Mini-batch gradient descent finally takes the best of both worlds and performs an update for every mini-batch of n training examples:

This way, it a) reduces the variance of the parameter updates, which can lead to more stable convergence; and b) can make use of highly optimized matrix optimizations common to state-of-the-art deep learning libraries that make computing the gradient w.r.t. a mini-batch very efficient. Common mini-batch sizes range between 50 and 256, but can vary for different applications. Mini-batch gradient descent is typically the algorithm of choice when training a neural network and the term SGD usually is employed also when mini-batches are used. Note: In modifications of SGD in the rest of this post, we leave out the parameters x(i:i+n);y(i:i+n) for simplicity.
In code, instead of iterating over examples, we now iterate over mini-batches of size 50:
for i in range(nb_epochs):
  np.random.shuffle(data)
  for batch in get_batches(data, batch_size=50):
    params_grad = evaluate_gradient(loss_function, batch, params)
    params = params - learning_rate * params_grad

Challenges

Vanilla mini-batch gradient descent, however, does not guarantee good convergence, but offers a few challenges that need to be addressed:
  • Choosing a proper learning rate can be difficult. A learning rate that is too small leads to painfully slow convergence, while a learning rate that is too large can hinder convergence and cause the loss function to fluctuate around the minimum or even to diverge.
  • Learning rate schedules [11] try to adjust the learning rate during training by e.g. annealing, i.e. reducing the learning rate according to a pre-defined schedule or when the change in objective between epochs falls below a threshold. These schedules and thresholds, however, have to be defined in advance and are thus unable to adapt to a dataset's characteristics [10].
  • Additionally, the same learning rate applies to all parameter updates. If our data is sparse and our features have very different frequencies, we might not want to update all of them to the same extent, but perform a larger update for rarely occurring features.
  • Another key challenge of minimizing highly non-convex error functions common for neural networks is avoiding getting trapped in their numerous suboptimal local minima. Dauphin et al. [19] argue that the difficulty arises in fact not from local minima but from saddle points, i.e. points where one dimension slopes up and another slopes down. These saddle points are usually surrounded by a plateau of the same error, which makes it notoriously hard for SGD to escape, as the gradient is close to zero in all dimensions.
http://sebastianruder.com/optimizing-gradient-descent/