What is a neural network

  • A “neuron” in a neural network is a function. It accepts some inputs, applies some calculations on them, and then returns a single number.
  • For regression and classification, we use standard NNs, for image applications, convolutional NNs are more common. For 1D sequence data, such as text and audio, we use recurrent NNs.
  • Traditionally, computers have been better at understanding structured data, in tables, than unstructured data, e.x. audio, pictures, and text. Humans have been the opposite. Thanks to deep learning, computers have improved in understanding unstructured data.
  • Why deep learning has only recently started to take off? Amount of labeled data and computing power has increased.
  • At small training sets, the ordering of algorithms is not well defined. It depends more on how you handle the features.

Basics of neural network programming

Classification

  • A problem that determines if an image shows, for example, a cat or not.
  • To create inputs from an image, we concatenate all RGB values in a single vector.
  • It is easier to implement a neural network if we consider our X (input) matrix column-wise, where each column is one training example and each row is a feature. In other ML implementations, we consider X row-wise.
  • Likewise, it is easier to stack y (output/labels) row-wise, instead of the common column-wise implementation.
  • Given the input X and parameters \(w\) and \(b\), where \(w \in \Re^{n}\) and \(b \in \Re\), the output has the form: \(\hat{y} = Xw + b\), which is a linear function of input X.
  • But since we have a binary classification, we want \(\hat{y}\) to be between zero and one.
  • Therefore, as in logistic regression, output is a sigmoid function of linear regression: \(\hat{y} = \sigma (Xw + b)\).
  • The formula for Sigmoid function is: \(\sigma (z) = \frac{1}{1 + e^{-z}}\).
  • If \(z\) is very large, \(\sigma(z)\) will be very close to 1.
  • If \(z\) is a very large negative number, then \(\sigma(z)\) will be very close to 0.

Cost function

  • Our goal is \(\hat{y}^{(i)} \approx y^{(i)}\).
  • If you use a linear regression loss (error) function, i.e. mean squared error, your optimization problem becomes non-convex.
  • In logistic regression we used the following loss function: \(L(\hat{y}, y) = -(y log(\hat{y}) + (1-y) log(1-\hat{y}))\).
  • Intuition: if \(y=1\), then \(L(\hat{y}, y) = -log(\hat{y})\) (the second term \((1-y) log(1-\hat{y}))\) will be zero). This means that to have the smallest difference between true \(y\) and predicted values \(\hat{y}\), you want \(\hat{y}\) to be as large as possible, which in a sigmoid function is 1.
  • If \(y=0\), the first term in the lost function (\(-(y log(\hat{y})\)) will equal to zero. So \(L(\hat{y}, y) = -log(1-\hat{y})\). So you want right hand of the equation to be large, which makes \(\hat{y}\) as small as possible, which in a sigmoid function cannot be less than zero.
  • The loss function applies to each sample. The cost function averages all loss functions applied to each example.
  • Our cost function is \(C(w,b) = \frac{1}{m}\sum\limits_{i=1}^m L(\hat{y}^{(i)}, y^{(i)})\), which when expanded is \(C(w,b) = -\frac{1}{m}\sum\limits_{i=1}^m [y^{(i)} log(\hat{y}^{(i)}) + (1-y^{(i)}) log(1-\hat{y}^{(i)})]\).

Gradient descent

  • GD changes the input parameters of the cost function in each step so that the output of the function approaches the minimum.
  • Specifically, at each iteration, the input of the cost function,e.g, \(w\), changes to \(w - \alpha \frac{dC(w)}{dw}\), \(\alpha\) is the learning rate which controls the step size.
  • The definition of a derivative is the slope of a function at a point, or height/width. When slope is positive, you subtract from \(w\) (an amount proportional to slope times \(\alpha\)), when it is negative, you add to \(w\).
  • Since our cost function has two paramters \(w\) and \(b\), we update them both separately at each step: \(w = w - \alpha \frac{\partial C(w,b)}{\partial w}\) and \(b = b - \alpha \frac{\partial C(w,b)}{\partial b}\).

Computation graph

  • Used for calculating derivatives.

Gradient descent on m examples

  • The derivative of the cost function, i.e. the average of loss functions for every training example, is the average of derivatives of each loss function.

Shallow neural networks

  • In the logistic regression we used a single neuron. If we include multiple neurons we will have a layer. Multiple layers can be connected to one another where the output of each layer becomes the input of the next layer. What happens inside each neuron is similar to the logistic regression example.
  • To vectorize your code, always remember that samples build the columns of a matrix and features build the rows. In the hidden layers, each node is a feature and inputs are the sample.
  • Outputs of a hidden layer are the inputs (i.e. samples/columns) of its succeeding layer.

Activation functions

  • Sigmoid function makes training slow. Almost never use it, except for the final node/layer in the network where you want to output a binary classification.
  • Other more popular activation functions are tanh (hyperbolic tangent) and ReLU (Rectified Linear Units).
  • tanh is a builtin Julia function.
  • ReLU is defined as max(0, z)
  • A problem with ReLU is that is outputs zero at all negative z values. A better version, although still not as popular as ReLU, is Leaky ReLU.
  • Leaky ReLU is defined as max(0.01z, z)
  • Why not use a linear activation function? Using linear activation functions (ax + b) will result in all layers collapsing into a single node with a linear activation function. Therefore, you must use nonlinear activation functions on all hidden layers.
  • For regression problems, you may use a linear regression or no activation function at all on the output layer.

Random initialization

  • For a NN it is important to initialize it with random numbers, rather than all zeros.
  • The reason is that all hidden units will compute an equal value and even after back-propogation, they will not change.
  • How to initialize randomly:
    1. Produce random numbers for weights and multiply those random numbers to a small number, say 0.01, so that all random numbers are very small. We use small random values because learning is faster when the activation function is sigmoid or tanh. The number 0.01 is good for shallow networks, but deep networks need a better number.
    2. The bias vector can have all zeros.

Matrix dimensions

  • X is of (m, n) dimensions, where m is the number of features and n is the number of samples.
  • \(n^{[l]}\) denotes the number of features (neurons) at each layer of the network.
  • Number of features of \(n^{[0]}\), which is X, equals m.
  • Each neuron in a layer outputs a single number, therefore, the output of each layer is of dimensions \((n^{[l]}, 1)\).
  • For the activation of each layer, we multiply the weights by inputs \(w_1 x_1 + w_2 x_2 + ... + w_n x_n\). The dimensions of the \(W\) matrix is \((n^{[l]}, n^{[l-1]})\).
  • The bias term \(b\) is always of size \((n^{[l]}, 1)\).

Exercise

  1. Write a neural network with two inner layer with 20, 15, neurons and one output layer with 10 neurons to recognize handwritten digits in the dataset below.

    Note that you will not have to use the one-versus-rest algorithm as we did in logistic regression. Instead, you can take the mean of the cost across all samples and all columns.

    The following steps help you in writing the algorithm:

    • Split the data between test and training.
    • One-hot-encode the y_test and y_train, so that you have binary outputs for each digit.
    • Write two activation functions: relu for inner layers, and logistic for the output layer. These functions can be defined for a single number. When you want to apply them to arrays, you can use broadcasting.
    • Write a linear line function that accepts the following arguments: X for data, W for weights or slopes (same is A in linear regression), and b for intercepts. The body of the function has the equation of a linear line.
    • Write a function called model that makes predictions given the following arguments: X and all your model parameters (in this case, since we have a 3 layer network, the parameters are: W1, b1, W2, b2, W3, b3). This function uses the linear and activation functions you have defined. The first layer accepts X, W1, and b1, applies the linear function on them and then the relu function. The second layer takes the outputs of the first layer, and again applies the linear and relu functions on them. Finally, the last layer takes the outputs of the second layer and applies the linear and logistic functions on them.
    • Write a cost function that takes two Real numbers, actual \(y\) and predicted \(\hat{y}\), and returns the cost. This can be the same as the logistic cost function.
    • Write another cost function that accepts all your neural network parameters and X and Y as arguments. The function makes a prediction and uses the previous cost function with broadcasting to calculate the average cost of all predictions.
    • Finally, write a function that uses gradient descent algorithm to tune the parameters. You can use automatic differentiation to get the gradient of the cost function with respect to each parameter:
      • using Flux
        grads = gradient(() -> cost(X, Y, W1, b1, W2, b2, W3, b3), params(W1, b1, W2, b2, W3, b3))
        
        gradient_W1 = grads[W1]
        
    • After your network is trained, make predictions on X_test. What is the accuracy of your algorithm?
    using Flux
    using Flux: onehotbatch, onecold
    using ScikitLearn
    using Statistics: mean
    
    ### Load the data
    @sk_import datasets: load_digits
    @sk_import model_selection: train_test_split
    digits = load_digits();
    X = Float32.(transpose(digits["data"]));  # make the X Float32 to save memory
    y = digits["target"];
    Y = Int32.(onehotbatch(y, 0:9));
    nfeatures, nsamples = size(X)
    

    This is how the same model is built using Flux:

            
     using Flux
     using Flux: onecold, crossentropy, throttle, onehotbatch
     using ScikitLearn
     using Base.Iterators: repeated
     using StatsBase: sample
     @sk_import datasets: load_digits
     @sk_import model_selection: train_test_split
     digits = load_digits();
     X = Float32.(transpose(digits["data"]));  # make the X Float32 to save memory
     y = digits["target"];
     Y = Int32.(onehotbatch(y, 0:9));
     nfeatures, nsamples = size(X)
    
     ## Split train and test
     X_train, X_test, y_train, y_test = train_test_split(X', y, train_size=0.80, stratify=y)
     X_train = X_train'
     X_test = X_test'
    
     ## One hot encode y
     Y_train = onehotbatch(y_train, 0:9)
     Y_test = onehotbatch(y_test, 0:9)
    
     ## Build a network. The output layer uses softmax activation, which suits multiclass classification problems.
     model = Chain(
       Dense(nfeatures, 20, Flux.relu),
       Dense(20, 15, Flux.relu),
       Dense(15, 10),
       softmax
     )
    
     ## cross entropy cost function
     cost(x, y) = crossentropy(model(x), y)
    
     opt = Descent(0.005)  # Choose gradient descent optimizer with alpha=0.005
     dataset = repeated((X_train, Y_train), 2000)  # repeat the dataset 2000 times, equivalent to running 2000 iterations of gradient descent
     Flux.train!(cost, params(model), dataset, opt)
    
     accuracy(x, y) = mean(onecold(model(x)) .== onecold(y))
     accuracy(X_test, Y_test)
    
  2. Optional exercise. Build a network to classify research articles given their abstracts (download the train and test datasets). This is a text classification problem. Working with text is different from the other types of data we have seen so far. Text samples are messy. In this case, articles have different lengths. Converting them into a fixed-sized format that can be used for machine learning is the first step. This is a feature extraction task. A simple approach for categorizing text input is Bag Of Words (BOW) method. It creates a list of all the vocabulary in all the samples. Next, it goes through each sample and either records the presence/absence of each word in each sample or also counts the frequency of present words. Using BOW like this discards any contextual information, i.e. the order of words. To also account for word order in a limited way, we can use N-grams. For example to create 2-grams, we do not record single words, but any word with the word after it, as a couple. This improves the performance of the algorithm significantly. Additionally, we could remove unspecific words, such as articles and pronouns, remove punctuations and white spaces, convert everything to lower case, and take the stem of words with the help of stemming libraries (e.g: “paper” instead of “papers”, and “run” instead of “running”). In Julia, the TextAnalysis package handles all of these steps (see this example).