Occam’s razor suggests that in machine learning, we should prefer simpler models with fewer coefficients over complex models like ensembles.

Taken at face value, the razor is a heuristic that suggests more complex hypotheses make more assumptions that, in turn, will make them too narrow and not generalize well. In machine learning, it suggests complex models like ensembles will overfit the training dataset and perform poorly on new data.

In practice, ensembles are almost universally the type of model chosen on projects where predictive skill is the most important consideration. Further, empirical results show a continued reduction in generalization error as the complexity of an ensemble learning model is incrementally increased. These findings are at odds with the Occam’s razor principle taken at face value.

In this tutorial, you will discover how to reconcile Occam’s Razor with ensemble machine learning.

After completing this tutorial, you will know:

  • Occam’s razor is a heuristic that suggests choosing simpler machine learning models as they are expected to generalize better.
  • The heuristic can be divided into two razors, one of which is true and remains a useful tool and the other that is false and should be abandoned.
  • Ensemble learning algorithms like boosting provide a specific case of how the second razor fails and added complexity can result in lower generalization error.

Let’s get started.

Ensemble Learning Algorithm Complexity and Occam's Razor

Ensemble Learning Algorithm Complexity and Occam’s Razor
Photo by dylan_odonnell, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

  1. Occam’s Razor for Model Selection
  2. Occam’s Two Razors for Machine Learning
  3. Occam’s Razor and Ensemble Learning

Occam’s Razor for Model Selection

Model selection is the process of choosing one from among possibly many candidate machine learning models for a predictive modeling project.

It is often straightforward to select a model based on its expected performance, e.g. choose the model with the highest accuracy or lowest prediction error.

Another important consideration is to choose simpler models over complex models.

Simpler models are typically defined as models that make fewer assumptions or have fewer elements, most commonly characterized as fewer coefficients (e.g. rules, layers, weights, etc.). The rationale for choosing simpler models is tied back to Occam’s Razor.

The idea is that the best scientific theory is the smallest one that explains all the facts.

— Page 197, Data Mining: Practical Machine Learning Tools and Techniques, 2016.

Occam’s Razor is an approach to problem-solving and is commonly invoked to mean that if all else is equal, we should prefer the simpler solutions.

  • Occam’s Razor: If all else is equal, the simplest solution is correct.

It is named for William of Ockham and was proposed to counter ever more elaborate philosophy without equivalent increases in predictive power.

William of Occam’s famous razor states that “Nunquam ponenda est pluralitas sin necesitate,” which, approximately translated, means “Entities should not be multiplied beyond necessity”.

Occam’s Two Razors: The Sharp and the Blunt, 1998.

It is not a rule, more of a heuristic for problem-solving, and is commonly invoked in science to prefer simpler hypotheses that make fewer assumptions over more complex hypotheses that make more assumptions.

There is a long-standing tradition in science that, other things being equal, simple theories are preferable to complex ones. This is known as Occam’s Razor after the medieval philosopher William of Occam (or Ockham).

— Page 197, Data Mining: Practical Machine Learning Tools and Techniques, 2016.

The problem with complex hypotheses with more assumptions is that they are likely too specific.

They may include details of specific cases that are at hand or easily available, and in turn, may not generalize to new cases. That is, the more assumptions a hypothesis has, the more narrow it is expected to be in its application. Conversely, fewer assumptions suggests a more general hypothesis with greater predictive power to more cases.

  • Simple Hypothesis: Fewer assumptions, and in turn, broad applicability.
  • Complex Hypothesis: More assumptions, and in turn, narrow applicability.

This has implications in machine learning, as we are specifically trying to generalize to new unseen cases from specific observations, referred to as inductive reasoning.

If Occam’s Razor suggests that more complex models don’t generalize well, then in applied machine learning, it suggests we should choose simpler models as they will have lower prediction errors on new data.

If this is true, then how can we justify using an ensemble machine learning algorithm?

By definition, ensemble machine learning algorithms are more complex than a single machine learning model, as they are composed of many individual machine learning models.

Occam’s razor suggests that the added complexity of ensemble learning algorithms means that they will not generalize as well as simpler models fit on the same dataset.

Yet ensemble machine learning algorithms are the dominant solution when predictive skill on new data is the most important concern, such as machine learning competitions. Ensembles have been studied at great length and have been shown not to overfit the training dataset in study after study.

It has been empirically observed that certain ensemble techniques often do not overfit the model, even when the ensemble contains thousands of classifiers.

— Page 40, Pattern Classification Using Ensemble Methods, 2010.

How can this inconsistency be reconciled?

Occam’s Two Razors for Machine Learning

The conflict between the expectation of simpler models generalizing better in theory and complex models like ensembles generalizing better in practice was mostly ignored as an inconvenient empirical finding for a long time.

In the late 1990s, the problem was specifically studied by Pedro Domingos and published in the award-winning 1996 paper titled “Occam’s Two Razors: The Sharp and the Blunt,” and follow-up 1999 journal article “The Role of Occam’s Razor in Knowledge Discovery.”

In the work, Domingos defines the problem as two specific commonly asserted implications of Occam’s Razor in applied machine learning, which he refers to as “Occam’s Two Razors” in machine learning, they are (taken from the paper):

  • First razor: Given two models with the same generalization error, the simpler one should be preferred because simplicity is desirable in itself.
  • Second razor: Given two models with the same training-set error, the simpler one should be preferred because it is likely to have lower generalization error.

Domingos then enumerates a vast number of examples for and against each razor from both theory and empirical studies in machine learning.

The first razor suggests if two models have the same expected performance on data not seen during training, we should prefer the simpler model. Domingos highlights that this razor holds and provides a good heuristic on machine learning projects.

The second razor suggests that if two models have the same performance on a training dataset, then the simpler model should be chosen because it is expected to generalize better when used to make predictions on new data.

This seems sensible on the surface.

It is the argument behind not adopting ensemble algorithms in a machine learning project because they are very complex compared to other models and expected to not generalize.

It turns out that this razor cannot be supported by the evidence from the machine learning literature.

All of this evidence points to the conclusion that not only is the second razor not true in general; it is also typically false in the types of domains KDD has been applied to.

Occam’s Two Razors: The Sharp and the Blunt, 1998.

Occam’s Razor and Ensemble Learning

The finding begins to sound intuitive once you mull on it for a while.

For example, in practice, we would not choose a machine learning model based on its performance on the training dataset alone. We intuitively, or perhaps after a lot of experience, tacitly expect the estimate of performance on a training set to be a poor estimate of performance on a holdout dataset.

We have this expectation because the model can overfit the training dataset.

Yet, less intuitively, overfitting the training dataset can lead to better performance on a holdout test set. This has been observed many times in practice in systematic studies.

A common situation involves plotting the performance of a model on the training dataset and a holdout test dataset each iteration of learning for the model, such as training epochs or iterations for models that support incremental learning.

If learning on the training dataset is set up to continue for a large number of training iterations and the curves observed, it can often be seen that performance on the training dataset will fall to zero error. This is to be expected as we might think that the model will overfit the training dataset given enough resources and time to train. Yet the performance on the test set will continue to improve, even while the performance on the training set remains fixed at zero error.

… occasionally, the generalization error would continue to improve long after the training error had reached zero.

— Page 40, Ensemble Methods in Data Mining, 2010.

This behavior can be observed with ensemble learning algorithms like boosting and bagging, where performance on the holdout dataset will continue to improve as additional model members are added to the ensemble.

One very surprising finding is that performing more boosting iterations can reduce the error on new data long after the classification error of the combined classifier on the training data has dropped to zero.

— Page 489, Data Mining: Practical Machine Learning Tools and Techniques, 2016.

That is, the model complexity is incrementally increased, which systematically decreases the error on unseen data, e.g. generalization error. The additional training cannot improve the performance on the training dataset; it has no possible improvement to make.

Performing more boosting iterations without reducing training error does not explain the training data any better, and it certainly adds complexity to the combined classifier.

— Page 490, Data Mining: Practical Machine Learning Tools and Techniques, 2016.

This finding directly contradicts the second razor and supports Domingos’ argument about abandoning the second razor.

The first one is largely uncontroversial, while the second one, taken literally, is false.

Occam’s Two Razors: The Sharp and the Blunt, 1998.

This problem has been studied and can generally be explained by the ensemble algorithms learning to be more confident in their predictions on the training dataset, which carry over to the hold out data.

The contradiction can be resolved by considering the classifier’s confidence in its predictions.

— Page 490, Data Mining: Practical Machine Learning Tools and Techniques, 2016.

The first razor remains an important heuristic in applied machine learning.

The key aspect of this razor is the predicate of “all else being equal.” That is, if two models are compared, they must be compared using their generalization error on a holdout dataset or estimated using k-fold cross-validation. If their performance is equal under these circumstances, then the razor can come into effect and we can choose the simpler solution.

This is not the only way to choose models.

We may choose a simpler model because it is easier to interpret, and this remains valid if model interpretability is a more important project requirement than predictive skill.

Ensemble learning algorithms are unambiguously a more complex type of model when the number of model parameters is considered the measure of complexity. As such, an open problem in machine learning involves alternate measures of complexity.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Related Tutorials

Papers

Books

Articles

Summary

In this tutorial, you discovered how to reconcile Occam’s Razor with ensemble machine learning.

Specifically, you learned:

  • Occam’s razor is a heuristic that suggests choosing simpler machine learning models as they are expected to generalize better.
  • The heuristic can be divided into two razors, one of which is true and remains a useful tool, and the other that is false and should be abandoned.
  • Ensemble learning algorithms like boosting provide a specific case of how the second razor fails and added complexity can result in lower generalization error.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.