Intro to Gradient Descent

Part Of: Machine Learning sequence
Content Summary: 800 words, 8 min read

Parameter Space vs Feature Space

Let’s recall the equation of a line.

y = mx + b where m is the slope, and b is the y-intercept.

The equation of a line is a function that maps from inputs (x) to outputs (y). Internal to that model (the knobs inside the box) reside parameters like m and b that mold how the function works.

Any model k can be uniquely described by its parameters \{ m_k, b_k \}. Just as we can plot data in a data space using Cartesian coordinates (x, y), we can plot models in a parameter space using coordinates (m,b).

As we traverse parameter space, we can view the corresponding models in data space.

dualspace_explore.gif

As we proceed, it is very important to hold these concepts in mind.

Loss Functions

Consider the following two regression models. Which one is better?

GradientDescent_ Two Regression Models (1)

The answer comes easily: Model A. But as the data volume increases, choosing between models can become quite difficult. Is there a way to automate such comparisons?

Put another way, your judgment about model goodness is an intuition manufactured in your brain. Algorithms don’t have access to your intuitions. We need a loss function that translates intuitions into numbers.

Regression models are functions of the form \hat{y} = f(\textbf{x}) where x is the vector of features (predictors) used to generate the label (prediction). We can define error as \hat{y} - y. In fact, we typically reserve the word error for test data, and residual for train data. Here are the residuals for our two regression models:

GradientDescent_ Loss Function Residuals

The larger the residuals, the worse the model. Let’s use the residual vector to define a loss function. To do this, in the language of database theory, we need to aggregate the column down to a scalar. In the language of linear algebra, we need to compute the length of the vector.

Everyone agrees that residuals matter, when deriving the loss function. Not everyone agrees how to translate the residual vector into a single number. Let me add a couple examples to motivate:

Sum together all prediction errors.

But then an deeply flawed model with residual vector  [ -30, 30, -30, 30] earns the same score as a “perfect model” [0, 0, 0, 0]. The morale: positive and negative errors should not cancel each other out.

Sum together the magnitude of the prediction errors.

But then a larger dataset costs more than a small one. A good model against a large dataset with residual vector [ 1, -1, 1, -1, 1, -1 ] earns the same score as a poor model against small data [ 3, -3 ]. The morale: cost functions should be invariant to data volume.

Find the average magnitude of the prediction errors.

This loss function suffers from fewer bugs. It even has a name: Mean Absolute Error (MAE), also known as the L1-norm.

There are many other valid ways of defining a loss function that we will explore later. I  just used the L1-norm to motivate the topic.

Grading Parameter Space

Let’s return to the question of evaluating model performance. For the following five models, we intuitively judge their performance as steadily worsening:

dataspace_loss (2)

With loss functions, we convert these intuitive judgments into numbers. Let’s include these loss numbers in label space, and encode them as color.

dualspace_loss (2)

Still with me? Something important is going on here.

We have examined the loss of five models. What happens if we evaluate two hundred different models? One thousand? A million? With enough samples, we can gain a high-resolution view of the loss surface. This loss surface can be expressed with loss as color, or loss as height along the z-axis.

sampling_loss_surface (2)

In this case, the loss surface is convex: it is in the shape of a bowl.

Navigating the Loss Surface

The notion of a loss surface takes a while to digest. But it is worth the effort. The loss surface is the reason machine learning is possible.

By looking at a loss surface, you can visually identify the global minimum: the model instantiation with the least amount of loss. In our example above, that is (7,-14) encodes the model y = 7x - 14 with the smallest loss L(\theta) = 2.

Unfortunately, computing the loss surface is computationally intractable. It takes too long to calculate the loss of every possible model. How can we do better?

  1. Start with an arbitrary model.
  2. Figure out how to improve it.
  3. Repeat.

One useful metaphor for this kind of algorithm is a flashlight in the dark. We can’t see the entire landscape, but our flashlight provides information about our immediate surroundings.

flashlight_local_search

But what local information can we use to decide where to move in parameter space? Simple: the gradient (i.e., the slope)! If we move downhill in this bowl-life surface, we will come to a rest at the set of best parameters.

A ball rolling down a hill.

This is how gradient descent works, in both spaces:

This is how prediction machines learn from data.

Until next time.

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s