**Part Of:** Machine Learning sequence

**Followup To**: Prediction Machines: Regression vs Classification

**Content Summary**: 1400 words, 14 min read

Model Complexity

Last time, we discussed how to create** prediction machines**. For example, given an animal’s height and weight, we might inject these facts into a prediction machine to guess what kind of animal it is. While this sounds complicated, in this case prediction machines are simple region-color maps, like these:

These maps have different strengths: the first is cleaner, the second more precise. It turns out that this tradeoff is ubiquitous. Whenever we create a prediction machine, we run the risk of it being overly inaccurate, or overly messy. Let us call the former **underfitted**, the latter **overfitted**, and the tradeoff between them the **complexity tradeoff**.

We motivated the above example with a labeling problem (a **classification problem**). To help show how the above tradeoff generalizes, we move to a different class of problems entirely. A **regression problem** describes the work of inferring causal processes from data. Concretely, it often involves recreating a function f(x) that fits the dataset.

Here is an example of the complexity tradeoff in a regression scenario:

The equations above describe the underlying models. The rightmost function is a polynomial terminating in x^{4}; the leftmost only goes up to x. For *any* polynomial { a*x^{n} + b*x^{n-1} + … + k }, n is called the **polynomial degree**. Polynomial degree is a proxy for **model complexity**: we want to say that f(x) = x^{4} is *more complex than* f(x) = x.

How Noise Defeats Complexity

For many, I suspect, the undesirability of underfitting is obvious: *of course* models with insufficient complexity fail to accurately represent reality. But why should too much overfitting (high model complexity) be undesirable, *especially* since such complexity is so useful in reducing errors?

Consider again the x^{4} model above. What is it getting wrong? I say fails to account for the nature of noise. For now, let us interpret noise as a band of probability density surrounding the blue line, that drops off the further you diverge from its trajectory.

If the amount of noise in the system is very low, this “variance band” would be very narrow, and we would expect future data samples to reside very close to the predictions of our model. But noise is not always negligible: in such cases we must accommodate a wider variance band.

Let us here divide the different causes of noise into two categories:

- We have an accurate model of the primary
**data generation process**(DGP); but some unknown physical process subtly deflecting our data from their intended path. - We have a complete understanding of
*all*data generation processes, but an idiosyncrasy within the sampling technology may distort our sensors.

Perhaps overfitting is best thought of as an exchange. By increasing model complexity, we purchase a reduced error rate at the expense of narrowing the width of our expected variance bands.

Hyperparameters

In our classification example, complexity was the resolution of the kernel. In our regression example, complexity was the polynomial degree. Time to get explicit about the mathematics:

Classification SVM Kernel |
Polynomial Regression |

K(x_{i}, x_{j}) = exp(-γ * ||x_{i} – x_{j}||^{2}) |
h(x) = Θ_{0} + Θ_{1}*x + Θ_{2}*x^{2} + … + Θ_{n}*x^{n} |

On the left, gamma (γ) controls the granularity of the kernel (“the width of the finger”). On the right, n controls the polynomial degree (“number of bends in the curve”). Contrast these with, for example, the array of thetas (Θ). [Θ_{0}..Θ_{n}] are components of the model: they are **parameters** trained by data. In contrast, γ & n represent **hyperparameters**: mathematical objects that govern the shape of first-order parameters.

Thus we have a new way of capturing complexity within an algorithm:

For these hyperparameters, large values create high complexity.

Data Partitioning

So you now appreciate the importance of striking a balance between precision and simplicity. That’s all very nice conceptually, but how might you go about *building* a well-balanced prediction machine ? For example, given a support vector machine doing classification, how can you tell when it’s overfit (when gamma is too large)?

This is not straightforward because variables like “simplicity” and “severity of noise” are not particularly **measurement-affine**. In practice, error is the lifeblood of machine learning. Suppose we are given 10 gigabytes of (historical) data to train our machine. Suppose we train our model parameters against the entire dataset, sampling from a range of possible hyperparameters.

The problem is that the number of errors can only ever decrease. It is *only* when you loose your prediction machine evaluate on live data – the test data, the stuff it didn’t train on – that the bitter fruits of overfitting make itself known:

This graph represents errors vs complexity.

- The blue line represents errors while the model is being trained
- The red line represents errors when the model was released into the world

The bias trade-off is only apparent when the machine is given new data! “*If only* I had practiced against unseen test data earlier”, the statistician might say, “then I could have discovered where to set my hyperparameter before it was too late”.

Read the above regret again. It is the germinating seed of a truly enormous idea.

Many decades ago, some creative mind took the above regret and sought to reform it: “What stops me from treating some of my old, pre-processed data *as if* it were new? Can I not set part of my data aside, and use it to calibrate my model complexity?”

This approach, known as **data partitioning**, is now ubiquitous in the machine learning community. Datasets are typically split in 80-20-20 percent chunks:

With data partitions in place, parameter optimization is performed against the training data, and hyperparameter optimization is done against separate **cross-validation (CV) data**. Data partitioning automates protection from overfitting: an algorithm simply select the hyperparameter values that correspond to the global minimum of the red curve above.

Data Paritioning Example: The Penn Treebank

In closing, let’s look at how data partitioning is used in practice.

As a field, **natural language processing** deals with computational models of language. One subfield that has always interested me is the automatic construction of grammatical trees.

For example, the simple sentence “John loves Mary” is structure primarily as a sentence (S). This sentence is, in turn, composed of a noun (N) and a verb particle (VP).

We can further differentiate different kinds of nouns and verbs:

- “John” is actually a singular proper noun (NNP)
- “loves” is actually verb particle is composed of a third-person singular, present tense verb (VPZ),
- “Mary” is also a singular proper noun (NNP).

Thus, the sentence “John loves Mary” carries with it the following structure:

Possession of grammar trees is a form of power; they yield their bearer an improved ability to produce language translations, for example. But acquiring these trees is hard, it takes expertise in linguistics to construct such artifacts… is it possible to automate such a task?

The answer is yes, of course, by the miracle of machine learning. After feeding a prediction machine enough known grammar trees, it becomes able to construct *novel* trees. But to get there, machine learning algorithms requires *lots* of data… tens of thousands of pre-existent grammar trees… but such a database does not exist.

The solution? Hire a team of linguists, lock them in a room for months on end, and pay them to convert sentences into grammar trees until they are blue in the face. This really happened. The end result was a resource called the Penn Treebank.

After the Penn Treebank was constructed, Hundreds of research articles were written, all drawing from this dataset. It was of course carefully split into three partitions (training, cross-validation, and test).

Scientists using the Treebank are required to only execute their code against the test set *once*, in the final analyses of their systems. Despite these precautions, the community has noticed artifacts of overfitting creeping into their literature. While perhaps no one researcher has “looked at” the test set long enough to corrupt the community’s models, it seems that even the one-time use has, over the decades, been “leaking” information of “shape” of the test set. It may not be long before linguists are recruited to generate a “replacement dataset”, and thus secure meaningful research for the future.

Takeaways

We have learned two interesting facts:

- Explainers who solely optimize against prediction error are in a state of sin.
- Data partitions immunize abductive processes against overfitting.

In our next post, we will explore unsettling implications of these ideas.