# Journal

I was trying to implement multivariate linear regression using gradient descent1 in Octave today and got a bit stuck. It was actually a good thing because it made me work out the details and really understand what was happening. I often understand something at the conceptual level, and then get impatient and leave the details for later (later never happens, of course). Today I got stubborn and did manual matrix multiplication and plotted illustrative graphs and now I feel very comfortable with gradient descent.

The example question is: Given the square footage and number of bedrooms in a house, can we predict its price?

## Quick description of gradient descent

Imagine some hypothesis, $h(x) = y\prime$, where $x$ is your input features and $y\prime$ is your prediction. Define an cost function $J$ to describe the difference between your prediction and the actual value in a known training set. The most accurate hypothesis is the $h$ that gives the minimum value of $J$. We can find it by plotting $J$ and $h$ together and finding the minimum value, i.e. where the derivative is 0.

In gradient descent, we start with an arbitrary $h$, calculate the derivative (the gradient) at that point, and go in the direction that takes us downhill (to 0) fastest.

## Some maths

Let:
$X = [x_1^1, x_2^2, \dotsm, x_n^m]$ features of your data, e.g. $x_1 =$ square footage
$\theta = [\theta_0, \theta_1, \theta_2, \dotsm, \theta_n]$ co-efficients, i.e. parameters. $h$ is parameterised by $\theta$
$Y = [y^1, y^2, \dotsm, y^m] =$ expected values

There are $m$ training examples, and $n$ features.

$h_\theta(X) = \theta_0 + \theta_1 x_1 + \dotsm + \theta_n x_n = Y\prime$
$Y\prime - Y =$ difference between prediction using a set of $\theta$s and the actual value of $Y$
$J(\theta) = \frac{1}{2} \cdot \frac{1}{m}\sum_{i = 1}^m (h\theta(x^i) - y^i)^2$ the mean squared error, a way to take an average and get rid of negative values. We multiple by $\frac{1}{2}$ to make it easier to differentiate later. We sum across prediction errors for all training values $m$; this represents total error of our hypothesis $h\theta$

The essence of the gradient descent method is to vary $\theta$ and recalculate $J(\theta)$, trying to find the minimum value.

This is how:

$\alpha$ = some learning rate, e.g. 0.03
repeat {
$\theta_1 = \theta_1 - \alpha \cdot \frac{\partial J(\theta)}{\partial \theta_1}$
$\theta_2 = \theta_2 - \alpha \cdot \frac{\partial J(\theta)}{\partial \theta_2}$
$\dotsm$
$\theta_n = \theta_n - \alpha \cdot \frac{\partial J(\theta)}{\partial \theta_n}$
}

for some number of iterations, or until a stopping condition is met. $J(\theta)$ should decrease throughout and eventually stabilise.

## Now make it work!

I had already written an implementation of univariate linear regression that works with vectors, so I plugged it in, expecting it to work. Instead, I got NaN errors. Some investigation revealed that the gradient was oscillating between positive and negative (and getting further from 0 each time), and the error rate was increasing, where it should be decreasing. It turned out my math was right, but this forced me to take the time to be sure:

$X * \theta = Y\prime$ ($m$ x $n$ size, same as $Y$)

$\frac{\partial J(\theta)}{\partial \theta_n} = \frac{1}{m} \sum_{i=1}^m [(h\theta(x^i) - y^i) \cdot x_n^i]$

Noteworthy in the derivative is that we multiply by $x_n$, which corresponds to $\theta_n$. And we also sum across all $m$ training examples as usual. So we are working with a vector of gradients; one for each $\theta_n$, representing the error across all $m$ examples for that parameter.

I’ll leave out the actual implementation because that might violate Coursera’s honour code.

## Feature normalisation

So the maths was right, but my gradient was still swinging wildly between positive and negative, and $J(\theta)$ was increasing to infinity. Tweaking the learning rate $\alpha$ and number of iterations didn’t change anything. We started to imagine what that might look like, and realised that it was because the features were using very different ranges, with square footage (in thousands) compared to number of bedrooms (around 1 to 5). This results in $J(\theta)$ having a very skinny elliptical shape, and the gradient being heavily skewed in the “direction” of square footage. Every step of gradient descent was overshooting instead of converging towards the minimum.

 Running gradient descent … gradient = -1.6920e+05 -3.7979e+08 -5.5684e+05 J = 3.3326e+23

gradient = 3.7992e+11 8.7748e+14 1.2674e+12 J = 1.7789e+36

gradient = -8.7778e+17 -2.0273e+21 -2.9282e+18 J = 9.4961e+48

gradient = 2.0280e+24 4.6840e+27 6.7653e+24 J = 5.0691e+61

gradient = -4.6856e+30 -1.0822e+34 -1.5631e+31 J = 2.7059e+74 

J explodes, gradient swings between positive and negative.

To normalise the features, two steps:

• subtract the mean of all values from each value (per parameter). This moves the mean to 0
• divide by the standard deviation. This makes the standard deviation 1

$X$ before and after normalisation (the 1s column is to allow for a constant parameter):

 First 5 examples from the dataset: x = [2104 3], y = 399900 x = [1600 3], y = 329900 x = [2400 3], y = 369000 x = [1416 2], y = 232000 x = [3000 4], y = 539900 Program paused. Press enter to continue. Normalizing Features … X =

1.0000e+00 1.3001e-01 -2.2368e-01 1.0000e+00 -5.0419e-01 -2.2368e-01 1.0000e+00 5.0248e-01 -2.2368e-01 1.0000e+00 -7.3572e-01 -1.5378e+00 1.0000e+00 1.2575e+00 1.0904e+00 

Now the gradient is more manageable and $J$ obediently decreases:

 gradient = -1.4654e+04 -3.8848e+03 -3.3323e+02 J = 1.1619e+10

gradient = -1.3188e+04 -3.4863e+03 -8.7711e+01 J = 9.8516e+09

gradient = -1.1869e+04 -3.1403e+03 1.1194e+02 J = 8.4191e+09

gradient = -1.0682e+04 -2.8391e+03 2.7309e+02 J = 7.2572e+09 

As an alternative to normalising the dataset, we can normalise the gradient instead. Without any normalisation, the problem is that the gradient vector is distorted by the too-large range of the input data. Imagine a plane that slopes gently in one direction and hugely in the other. We can normalise the gradient vector by itself - this will take more iterations to converge but the advantage is this implementation is independent of the data set.

It looks like this:

 gradient = gradient/norm(gradient) 

## How/why does normalisation work?

Let’s compare the methods. I calculate an expected value using the normal equation, at approximately 293081.

With gradient normalisation, we ignore $\alpha$ because the gradient regulates itself. It looks like J descends at almost a linear rate at first, and stabilises around 150 iterations with a prediction about 93% of the expected value. After that it decreases slowly but the prediction doesn’t reach the expected value even after 500,000 iterations.

With feature normalisation, J stabilises around 80 iterations and the prediction converges to the expected value.

With gradient and feature normalisation, the prediction again converges to the expected value but not until ~350,000 iterations, which is a couple order of magnitudes slower - significant!

My hunch is that using gradient normalisation prevents convergence to the expected value because it forces the algorithm to take a minimum-size step in every iteration. As it gets close to the minimum, it steps around it instead of converging. Feature normalisation counteracts this somewhat because it evens out the $J$ function enough that there is a clear path to the mininum. But it still takes longer because normalising the gradient flattens out the gradient values near the minimum until they are pretty negligible.

I’m leaving that hunch unchecked because the time/payoff ratio doesn’t seem worthwhile, but at least knowing how the algorithm behaves under these different conditions is pretty useful.

** EDIT:

Stephen Tu helpfully sent me a neat little proof of how gradient descent always converges given the right step size (learning rate). I tried running again with no normalisation and new smaller step size $1 \times 10^{-7}$, and it converged after about 6 iterations, although at ~93% of the expected value (similar to running with gradient normalisation). I was initially under the impression that gradient descent would always converge with small enough step size, and I thought I was just confused - so it’s nice to have this cleared up.

I also experimented with different step sizes that converge, and made some surprising observations that will be too tedious to list here. Maybe an exercise for another day - plot the actual predictions as theta changes. The coursera is focused on getting you up and running with machine learning techniques, which is great, but I also feel like there is a lot more here I would like to understand.

1. From Andrew Ng’s well-regarded machine-learning Coursera, Week 2 programming assignment, optional section