Stacked generalization, or stacking, may be a less popular machine learning ensemble given that it describes a framework increasingly than a explicit model. Perhaps the reason it has been less popular in mainstream machine learning is that it can be tricky to train a stacking model correctly, without suffering data leakage. This has meant that the technique has mainly been used by highly skilled experts in high-stakes environments, such as machine learning competitions, and given new names like blending ensembles. Nevertheless, modern machine learning frameworks make stacking routine to implement and evaluate for nomenclature and regression predictive modeling problems. As such, we can review ensemble learning methods related to stacking through the lens of the stacking framework. This broader family of stacking techniques can moreover help to see how to tailor the configuration of the technique in the later on when exploring our own predictive modeling projects. In this tutorial, you will discover the essence of the stacked generalization tideway to machine learning ensembles. After completing this tutorial, you will know:
  • The stacking ensemble method for machine learning uses a meta-model to combine predictions from contributing members.
  • How to slaver the essential elements from the stacking method and how popular extensions like blending and the super ensemble are related.
  • How to devise new extensions to stacking by selecting new procedures for the essential elements of the method.
Let’s get started.
Essence of Stacking Ensembles for Machine Learning

Essence of Stacking Ensembles for Machine Learning Photo by Thomas, some rights reserved.

Tutorial Overview

This tutorial is divided into four parts; they are:
  1. Stacked Generalization
  2. Essence of Stacking Ensembles
  3. Stacking Ensemble Family
    1. Voting Ensembles
    2. Weighted Average
    3. Blending Ensemble
    4. Super Learner Ensemble
  4. Customized Stacking Ensembles

Stacked Generalization

Stacked Generalization, or stacking for short, is an ensemble machine learning algorithm. Stacking involves using a machine learning model to learn how to weightier combine the predictions from contributing ensemble members. In voting, ensemble members are typically a diverse hodgepodge of model types, such as a visualization tree, naive Bayes, and support vector machine. Predictions are made by averaging the predictions, such as selecting the matriculation with the most votes (the statistical mode) or the largest summed probability.
… (unweighted) voting only makes sense if the learning schemes perform comparably well.
— Page 497, Data Mining: Practical Machine Learning Tools and Techniques, 2016. An extension to voting is to weigh the contribution of each ensemble member in the prediction, providing a weighted sum prediction. This allows increasingly weight to be placed on models that perform largest on stereotype and less on those that don’t perform as well but still have some predictive skill. The weight prescribed to each contributing member must be learned, such as the performance of each model on the training dataset or a holdout dataset. Stacking generalizes this tideway and allows any machine learning model to be used to learn how to weightier combine the predictions from contributing members. The model that combines the predictions is referred to as the meta-model, whereas the ensemble members are referred to as base-models.
The problem with voting is that it is not well-spoken which classifier to trust. Stacking tries to learn which classifiers are the reliable ones, using flipside learning algorithm—the metalearner—to discover how weightier to combine the output of the wiring learners.
— Page 497, Data Mining: Practical Machine Learning Tools and Techniques, 2016. In the language taken from the paper that introduced the technique, wiring models are referred to as level-0 learners, and the meta-model is referred to as a level-1 model. Naturally, the stacking of models can protract to any desired level.
Stacking is a unstipulated procedure where a learner is trained to combine the individual learners. Here, the individual learners are tabbed the first-level learners, while the combiner is tabbed the second-level learner, or meta-learner.
— Page 83, Ensemble Methods, 2012. Importantly, the way that the meta-model is trained is variegated to the way the base-models are trained. The input to the meta-model are the predictions made by the base-models, not the raw inputs from the dataset. The target is the same expected target value. The predictions made by the base-models used to train the meta-model are for examples not used to train the base-models, meaning that they are out of sample. For example, the dataset can be split into train, validation, and test datasets. Each base-model can then be fit on the training set and make predictions on the validation dataset. The predictions from the validation set are then used to train the meta-model. This ways that the meta-model is trained to weightier combine the capabilities of the base-models when they are making out-of-sample predictions, e.g. examples not seen during training.
… we reserve some instances to form the training data for the level-1 learner and build level-0 classifiers from the remaining data. Once the level-0 classifiers have been built they are used to classify the instances in the holdout set, forming the level-1 training data.
— Page 498, Data Mining: Practical Machine Learning Tools and Techniques, 2016. Once the meta-model is trained, the wiring models can be re-trained on the combined training and validation datasets. The whole system can then be evaluated on the test set by passing examples first through the wiring models to collect base-level predictions, then passing those predictions through the meta-model to get final predictions. The system can be used in the same way when making predictions on new data. This tideway to training, evaluating, and using a stacking model can be remoter generalized to work with k-fold cross-validation. Typically, wiring models are prepared using variegated algorithms, meaning that the ensembles are a heterogeneous hodgepodge of model types providing a desired level of diversity to the predictions made. However, this does not have to be the case, and variegated configurations of the same models can be used or the same model trained on variegated datasets.
The first-level learners are often generated by applying variegated learning algorithms, and so, stacked ensembles are often heterogeneous
— Page 83, Ensemble Methods, 2012. On nomenclature problems, the stacking ensemble often performs largest when base-models are configured to predict probabilities instead of well-done matriculation labels, as the widow uncertainty in the predictions provides increasingly context for the meta-model when learning how to weightier combine the predictions.
… most learning schemes are worldly-wise to output probabilities for every matriculation label instead of making a each well-defined prediction. This can be venal to modernize the performance of stacking by using the probabilities to form the level-1 data.
— Page 498, Data Mining: Practical Machine Learning Tools and Techniques, 2016. The meta-model is typically a simple linear model, such as a linear regression for regression problems or a logistic regression model for classification. Again, this does not have to be the case, and any machine learning model can be used as the meta learner.
… considering most of the work is once washed-up by the level-0 learners, the level-1 classifier is basically just an textbook and it makes sense to segregate a rather simple algorithm for this purpose. […] Simple linear models or trees with linear models at the leaves usually work well.
— Page 499, Data Mining: Practical Machine Learning Tools and Techniques, 2016. This is a high-level summary of the stacking ensemble method, yet we can generalize the tideway and pericope the essential elements.

Essence of Stacking Ensembles

The essence of stacking is well-nigh learning how to combine contributing ensemble members. In this way, we might think of stacking as thesping that a simple “wisdom of crowds” (e.g. averaging) is good but not optimal and that largest results can be achieved if we can identify and requite increasingly weight to experts in the crowd. The experts and lesser experts are identified based on their skill in new situations, e.g. out-of-sample data. This is an important stardom from simple averaging and voting, although it introduces a level of complexity that makes the technique challenging to implement correctly and stave data leakage, and in turn, incorrect and optimistic performance. Nevertheless, we can see that stacking is a very unstipulated ensemble learning approach. Broadly conceived, we might think of a weighted stereotype of ensemble models as a generalization and resurgence upon voting ensembles, and stacking as a remoter generalization of a weighted stereotype model. As such, the structure of the stacking procedure can be divided into three essential elements; they are:
  • Diverse Ensemble Members: Create a diverse set of models that make variegated predictions.
  • Member Assessment: Evaluate the performance of ensemble members.
  • Combine With Model: Use a model to combine predictions from members.
We can map canonical stacking onto these elements as follows:
  • Diverse Ensemble Members: Use variegated algorithms to fit each contributing model.
  • Member Assessment: Evaluate model performance on out-of-sample predictions.
  • Combine With Model: Machine learning model to combine predictions.
This provides a framework where we could consider related ensemble algorithms. Let’s take a closer squint at other ensemble methods that may be considered a part of the stacking family.

Stacking Ensemble Family

Many ensemble machine learning techniques may be considered precursors or descendants of stacking. As such, we can map them onto our framework of essential stacking. This is a helpful exercise as it both highlights the differences between methods and uniqueness of each technique. Perhaps increasingly importantly, it may moreover spark ideas for spare variations that you may want to explore on your own predictive modeling project. Let’s take a closer squint at four of the increasingly worldwide ensemble methods related to stacking.

Voting Ensembles

Voting ensembles are one of the simplest ensemble learning techniques. A voting ensemble typically involves using a variegated algorithm to prepare each ensemble member, much like stacking. Instead of learning how to combine predictions, a simple statistic is used. On regression problems, a voting ensemble may predict the midpoint or median of the predictions from ensemble members. For nomenclature problems, the label with the most votes is predicted, tabbed nonflexible voting, or the label that received the largest sum probability is predicted, tabbed soft voting. The important difference from stacking is that there is no weighing of models based on their performance. All models are unsupportable to have the same skill level on average.
  • Member Assessment: Assume all models are equally skillful.
  • Combine with Model: Simple statistics.

Weighted Stereotype Ensemble

A weighted stereotype might be considered one step whilom a voting ensemble. Like stacking and voting ensembles, a weighted stereotype uses a diverse hodgepodge of model types as contributing members. Unlike voting, a weighted stereotype assumes that some contributing members are largest than others and weighs contributions from models accordingly. The simplest weighted stereotype ensemble weighs each model based on its performance on a training dataset. An resurgence over this naive tideway is to weigh each member based on its performance on a hold-out dataset, such as a validation set or out-of-fold predictions during k-fold cross-validation. One step remoter might involve tuning the coefficient weightings for each model using an optimization algorithm and performance on a holdout dataset. These unfurled improvements of a weighted stereotype model uncork to resemble a primitive stacking model with a linear model trained to combine the predictions.
  • Member Assessment: Member performance on training dataset.
  • Combine With Model: Weighted stereotype of predictions.

Blending Ensemble

Blending is explicitly a stacked generalization model with a explicit configuration. A limitation of stacking is that there is no often wonted configuration. This can make the method challenging for beginners as substantially any models can be used as the base-models and meta-model, and any resampling method can be used to prepare the training dataset for the meta-model. Blending is a explicit stacking ensemble that makes two prescriptions. The first is to use a holdout validation dataset to prepare the out-of-sample predictions used to train the meta-model. The second is to use a linear model as the meta-model. The technique was born out of the requirements of practitioners working on machine learning competitions that involves the minutiae of a very large number of wiring learner models, perhaps from variegated sources (or teams of people), that in turn may be too computationally expensive and too challenging to coordinate to validate using the k-fold cross-validation partitions of the dataset.
  • Member Predictions: Out-of-sample predictions on a validation dataset.
  • Combine With Model: Linear model (e.g. linear regression or logistic regression).
Given the popularity of blending ensembles, stacking has sometimes come to specifically refer to the use of k-fold cross-validation to prepare out of sample predictions for the meta-model.

Super Learner Ensemble

Like blending, the super ensemble is a explicit configuration of a stacking ensemble. The meta-model in super learning is prepared using out-of-fold predictions for wiring learners placid during k-fold cross-validation. As such, we might think of the super learner ensemble as a sibling to blending where the main difference is the nomination of how out-of-sample predictions are prepared for the meta learner.
  • Diverse Ensemble Members: Use variegated algorithms and variegated configurations of the same algorithms.
  • Member Assessment: Out of fold predictions on k-fold cross-validation.

Customized Stacking Ensembles

We have looked canonical stacking as a framework for combining predictions from a diverse hodgepodge of model types. Stacking is a wholesale method, which can make it nonflexible to start using. We can see how voting ensembles and weighted stereotype ensembles are a simplification of the stacking method and blending ensembles and the super learner ensembles are a explicit configuration of stacking. This review highlighted that the focus on variegated stacking approaches is on the sophistication of the meta-model, such as using statistics, a weighted average, or a true machine learning model. The focus has moreover been on the manner in which the meta-model is trained, e.g. out of sample predictions from a validation dataset or k-fold cross-validation. An unorganized zone to explore with stacking might be the diversity of the ensemble members vastitude simply using variegated algorithms. Stacking is not prescriptive in the types of models compared to boosting and bagging that both prescribe using visualization trees. This allows for a lot of flexibility in customizing and exploring the use of the method on a dataset. For example, we could imagine fitting a large number of visualization trees on bootstrap samples of the training dataset, as we do in bagging, then testing a suite of variegated models to learn how to weightier combine the predictions from the trees.
  • Diverse Ensemble Members: Visualization trees trained on bootstrap samples.
Alternatively, we can imagine grid searching a large number of configurations for a each machine learning model, which is worldwide on a machine learning project, and keeping all of the fit models. These models could then be used as members in a stacking ensemble.
  • Diverse Ensemble Members: Unorganized configurations of the same algorithm.
We might moreover see the “mixture of experts” technique as fitting into the stacking method. Mixture of experts, or MoE for short, is a technique that explicitly partitions a problem into subproblems and trains a model on each subproblem, then uses the model to learn how to weightier weigh or combine the predictions from experts. The important differences between stacking and mixture of experts are the explicitly divide and conquer tideway of MoE and the increasingly ramified manner in which predictions are combined using a gating network. Nevertheless, we imagine partitioning an input full-length space into a grid of subspaces, training a model on each subspace and using a meta-model that takes the predictions from the base-models as well as the raw input sample and learns which base-model to trust or weigh the most provisionary on the input data.
  • Diverse Ensemble Members: Partition input full-length space into uniform subspaces.
This could be remoter extended to first select the one model type that performs well among many for each subspace, keeping only those top-performing experts for each subspace, then learning how to weightier combine their predictions. Finally, we might think of the meta-model as a correction of the wiring models. We might explore this idea and have multiple meta-models struggle to correct overlapping or non-overlapping pools of contributing members and spare layers of models stacked on top of them. This deeper stacking of models is sometimes used in machine learning competitions and can wilt ramified and challenging to train, but may offer spare goody on prediction tasks where largest model skill vastly outweighs the worthiness to introspect the model. We can see that the generality of the stacking method leaves a lot of room for experimentation and customization, where ideas from boosting and bagging may be incorporated directly.

Further Reading

This section provides increasingly resources on the topic if you are looking to go deeper.

Related Tutorials

Books

Summary

In this tutorial, you discovered the essence of the stacked generalization tideway to machine learning ensembles. Specifically, you learned:
  • The stacking ensemble method for machine learning uses a meta-model to combine predictions from contributing members.
  • How to slaver the essential elements from the stacking method and how popular extensions like blending and the super ensemble are related.
  • How to devise new extensions to stacking by selecting new procedures for the essential elements of the method.
Do you have any questions? Ask your questions in the comments unelevated and I will do my weightier to answer.