Essence of Boosting Ensembles for Machine Learning
Tweet
Share
Share
Boosting is a powerful and popular matriculation of ensemble learning techniques.
Historically, boosting algorithms were challenging to implement, and
1k
By Nick Cotes
Boosting is a powerful and popular matriculation of ensemble learning techniques.
Historically, boosting algorithms were challenging to implement, and it was not until AdaBoost demonstrated how to implement boosting that the technique could be used effectively. AdaBoost and modern gradient boosting work by sequentially subtracting models that correct the residual prediction errors of the model. As such, boosting methods are known to be effective, but constructing models can be slow, expressly for large datasets.
More recently, extensions planned for computational efficiency have made the methods fast unbearable for broader adoption. Open-source implementations, such as XGBoost and LightGBM, have meant that boosting algorithms have wilt the preferred and often top-performing tideway in machine learning competitions for nomenclature and regression on tabular data.
In this tutorial, you will discover the essence of boosting to machine learning ensembles.
After completing this tutorial, you will know:
The boosting ensemble method for machine learning incrementally adds weak learners trained on weighted versions of the training dataset.
The essential idea that underlies all boosting algorithms and the key tideway used within each boosting algorithm.
How the essential ideas that underlie boosting could be explored on new predictive modeling projects.
Kick-start your project with my new typesetting Ensemble Learning Algorithms With Python, including step-by-step tutorials and the Python source code files for all examples.
Let’s get started.
Essence of Boosting Ensembles for Machine Learning Photo by Armin S Kowalski, some rights reserved.
Tutorial Overview
This tutorial is divided into four parts; they are:
Boosting Ensembles
Essence of Boosting Ensembles
Family of Boosting Ensemble Algorithms
AdaBoost Ensembles
Classic Gradient Boosting Ensembles
Modern Gradient Boosting Ensembles
Customized Boosting Ensembles
Boosting Ensembles
Boosting is a powerful ensemble learning technique.
As such, boosting is popular and may be the most widely used ensemble techniques at the time of writing.
Boosting is one of the most powerful learning ideas introduced in the last twenty years.
— Page 337, The Elements of Statistical Learning, 2016.
As an ensemble technique, it can read and sound increasingly ramified than sibling methods, such as bootstrap team (bagging) and stacked generalization (stacking). The implementations can, in fact, be quite complicated, yet the ideas that underly boosting ensembles are very simple.
Boosting can be understood by contrasting it to bagging.
In bagging, an ensemble is created by making multiple variegated samples of the same training dataset and fitting a visualization tree on each. Given that each sample of the training dataset is different, each visualization tree is different, in turn making slightly variegated predictions and prediction errors. The predictions for all of the created visualization trees are combined, resulting in lower error than fitting a each tree.
Boosting operates in a similar manner. Multiple trees are fit on variegated versions of the training dataset and the predictions from the trees are combined using simple voting for nomenclature or averaging for regression to result in a largest prediction than fitting a each visualization tree.
… boosting […] combines an ensemble of weak classifiers using simple majority voting …
— Page 13, Ensemble Machine Learning, 2012.
There are some important differences; they are:
Instances in the training set are prescribed a weight based on difficulty.
Learning algorithms must pay sustentation to instance weights.
Ensemble members are widow sequentially.
The first difference is that the same training dataset is used to train each visualization tree. No sampling of the training dataset is performed. Instead, each example in the training dataset (each row of data) is prescribed a weight based on how easy or difficult the ensemble finds that example to predict.
The main idea overdue this algorithm is to requite increasingly focus to patterns that are harder to classify. The value of focus is quantified by a weight that is prescribed to every pattern in the training set.
— Pages 28-29, Pattern Nomenclature Using Ensemble Methods, 2010.
This ways that rows that are easy to predict using the ensemble have a small weight and rows that are difficult to predict correctly will have a much larger weight.
Boosting works in a similar way, except that the trees are grown sequentially: each tree is grown using information from previously grown trees. Boosting does not involve bootstrap sampling; instead each tree is fit on a modified version of the original data set.
— Page 322, An Introduction to Statistical Learning with Applications in R, 2014.
The second difference from bagging is that the wiring learning algorithm, e.g. the visualization tree, must pay sustentation to the weightings of the training dataset. In turn, it ways that boosting is specifically planned to use visualization trees as the wiring learner, or other algorithms that support a weighting of rows when constructing the model.
The construction of the model must pay increasingly sustentation to training examples proportional to their prescribed weight. This ways that ensemble members are synthetic in a unjust way to make (or work nonflexible to make) correct predictions on heavily weighted examples.
Finally, the boosting ensemble is synthetic slowly. Ensemble members are widow sequentially, one, then another, and so on until the ensemble has the desired number of members.
Importantly, the weighting of the training dataset is updated based on the sufficiency of the unshortened ensemble without each ensemble member is added. This ensures that each member that is subsequently widow works nonflexible to correct errors made by the whole model on the training dataset.
In boosting, however, the training dataset for each subsequent classifier increasingly focuses on instances misclassified by previously generated classifiers.
— Page 13, Ensemble Machine Learning, 2012.
The contribution of each model to the final prediction is a weighted sum of the performance of each model, e.g. a weighted stereotype or weighted vote.
This incremental wing of ensemble members to correct errors on the training dataset sounds like it would sooner overfit the training dataset. In practice, boosting ensembles can overfit the training dataset, but often, the effect is subtle and overfitting is not a major problem.
Unlike bagging and random forests, boosting can overfit if [the number of trees] is too large, although this overfitting tends to occur slowly if at all.
— Page 323, An Introduction to Statistical Learning with Applications in R, 2014.
This is a high-level summary of the boosting ensemble method and closely resembles the AdaBoost algorithm, yet we can generalize the tideway and pericope the essential elements.
Want to Get Started With Ensemble Learning?
Take my self-ruling 7-day email crash undertow now (with sample code).
Click to sign-up and moreover get a self-ruling PDF Ebook version of the course.
Download Your FREE Mini-Course
Essence of Boosting Ensembles
The essence of boosting sounds like it might be well-nigh correcting predictions.
This is how all modern boosting algorithms are implemented, and it is an interesting and important idea. Nevertheless, correcting prediction errors might be considered an implementation detail for achieving boosting (a large and important detail) rather than the essence of the boosting ensemble approach.
The essence of boosting is the combination of multiple weak learners into a strong learner.
The strategy of boosting, and ensembles of classifiers, is to learn many weak classifiers and combine them in some way, instead of trying to learn a each strong classifier.
— Page 35, Ensemble Machine Learning, 2012.
A weak learner is a model that has a very modest skill, often meaning that its performance is slightly whilom a random classifier for binary nomenclature or predicting the midpoint value for regression. Traditionally, this ways a visualization stump, which is a visualization tree that considers one value of one variable and makes a prediction.
A weak learner (WL) is a learning algorithm capable of producing classifiers with probability of error strictly (but only slightly) less than that of random guessing …
— Page 35, Ensemble Machine Learning, 2012.
A weak learner can be contrasted to a strong learner that performs well on a predictive modeling problem. In general, we seek a strong learner to write a nomenclature or regression problem.
… a strong learner (SL) is worldly-wise (given unbearable training data) to yield classifiers with summarily small error probability.
— Page 35, Ensemble Machine Learning, 2012.
Although we seek a strong learner for a given predictive modeling problem, they are challenging to train. Whereas weak learners are very fast and easy to train.
Boosting recognizes this stardom and proposes explicitly constructing a strong learner from multiple weak learners.
Boosting is a matriculation of machine learning methods based on the idea that a combination of simple classifiers (obtained by a weak learner) can perform largest than any of the simple classifiers alone.
— Page 35, Ensemble Machine Learning, 2012.
Many approaches to boosting have been explored, but only one has been truly successful. That is the method described in the previous section where weak learners are widow sequentially to the ensemble to specifically write or correct the residual errors for regression trees or matriculation label prediction errors for classification. The result is a strong learner.
Let’s take a closer squint at ensemble methods that may be considered a part of the boosting family.
Family of Boosting Ensemble Algorithms
There are a large number of boosting ensemble learning algorithms, although all work in often the same way.
Namely, they involve sequentially subtracting simple wiring learner models that are trained on (re-)weighted versions of the training dataset.
The term boosting refers to a family of algorithms that are worldly-wise to convert weak learners to strong learners.
— Page 23, Ensemble Methods, 2012.
We might consider three main families of boosting methods; they are: AdaBoost, Classic Gradient Boosting, and Modern Gradient Boosting.
The semester is somewhat arbitrary, as there are techniques that may span all groups or implementations that can be configured to realize an example from each group and plane bagging-based methods.
AdaBoost Ensembles
Initially, naive boosting methods explored training weak classifiers on separate samples of the training dataset and combining the predictions.
These methods were not successful compared to bagging.
Adaptive Boosting, or AdaBoost for short, were the first successful implementations of boosting.
Researchers struggled for a time to find an constructive implementation of boosting theory, until Freund and Schapire collaborated to produce the AdaBoost algorithm.
— Page 204, Applied Predictive Modeling, 2013.
It was not just a successful realization of the boosting principle; it was an constructive algorithm for classification.
Boosting, expressly in the form of the AdaBoost algorithm, was shown to be a powerful prediction tool, usually outperforming any individual model. Its success drew sustentation from the modeling polity and its use became widespread …
— Page 204, Applied Predictive Modeling, 2013.
Although AdaBoost was ripened initially for binary classification, it was later extended for multi-class classification, regression, and a myriad of other extensions and specialized versions.
They were the first to maintain the same training dataset and to introduce weighting of training examples and the sequential wing of models trained to correct prediction errors of the ensemble.
Classic Gradient Boosting Ensembles
After the success of AdaBoost, a lot of sustentation was paid to boosting methods.
Gradient boosting was a generalization of the AdaBoost group of techniques that unliable the training of each subsequent wiring learning to be achieved using wrong-headed loss functions.
Rather than deriving new versions of boosting for every variegated loss function, it is possible to derive a generic version, known as gradient boosting.
— Page 560, Machine Learning: A Probabilistic Perspective, 2012.
The “gradient” in gradient boosting refers to the prediction error from a chosen loss function, which is minimized by subtracting wiring learners.
The vital principles of gradient boosting are as follows: given a loss function (e.g., squared error for regression) and a weak learner (e.g., regression trees), the algorithm seeks to find an ingredient model that minimizes the loss function.
— Page 204, Applied Predictive Modeling, 2013.
After the initial reframing of AdaBoost as gradient boosting and the use of unorganized loss functions, there was a lot of remoter innovation, such as Multivariate Adaptive Regression Trees (MART), Tree Boosting, and Gradient Boosting Machines (GBM).
If we combine the gradient boosting algorithm with (shallow) regression trees, we get a model known as MART […] without fitting a regression tree to the residual (negative gradient), we re-estimate the parameters at the leaves of the tree to minimize the loss …
— Page 562, Machine Learning: A Probabilistic Perspective, 2012.
The technique was extended to include regularization in an struggle to remoter slow lanugo the learning and sampling of rows and columns for each visualization tree in order to add some independence to the ensemble members, based on ideas from bagging, referred to as stochastic gradient boosting.
Modern Gradient Boosting Ensembles
Gradient boosting and variances of the method were shown to be very effective, yet were often slow to train, expressly for large training datasets.
This was mainly due to the sequential training of ensemble members, which could not be parallelized. This was unfortunate, as training ensemble members in parallel and the computational speed-up it provides is an often described desirable property of using ensembles.
As such, much effort was put into improving the computational efficiency of the method.
This resulted in highly optimized open-source implementations of gradient boosting that introduced innovative techniques that both velocious the training of the model and offered remoter improved predictive performance.
Notable examples included both Extreme Gradient Boosting (XGBoost) and the Light Gradient Boosting Machine (LightGBM) projects. Both were so constructive that they became de facto methods used in machine learning competitions when working with tabular data.
Customized Boosting Ensembles
We have transiently looked the canonical types of boosting algorithms.
Modern implementations of algorithms like XGBoost and LightGBM provide sufficient configuration hyperparameters to realize many variegated types of boosting algorithms.
Although boosting was initially challenging to realize as compared to simpler to implement methods like bagging, the essential ideas may be useful in exploring or extending ensemble methods on your own predictive modeling projects in the future.
The essential idea of constructing a strong learner from weak learners could be implemented in many variegated ways. For example, bagging with visualization stumps or other similarly weak learner configurations of standard machine learning algorithms could be considered a realization of this approach.
Bagging weak learners.
It moreover provides a unrelatedness to other ensemble types, such as stacking, that attempts to combine multiple strong learners into a slightly stronger learner. Plane then, perhaps unorganized success could be achieved on a project by stacking diverse weak learners instead.
Stacking weak learners.
The path to constructive boosting involves explicit implementation details of weighted training examples, models that can honor the weightings, and the sequential wing of models fit under some loss minimization method.
Nevertheless, these principles could be harnessed in ensemble models increasingly generally.
For example, perhaps members can be widow to a bagging or stacking ensemble sequentially and only kept if they result in a useful lift in skill, waif in prediction error, or transpiration in the distribution of predictions made by the model.
Sequential bagging or stacking.
In some senses, stacking offers a realization of the idea of correcting predictions of other models. The meta-model (level-1 learner) attempts to powerfully combine the predictions of the wiring models (level-0 learners). In a sense, it is attempting to correct the predictions of those models.
Levels of the stacked model could be widow to write explicit requirements, such as the minimization of some or all prediction errors.
Deep stacking of models.
These are a few perhaps obvious examples of how the essence of the boosting method can be explored, hopefully inspiring remoter ideas. I would encourage you to spitball how you might transmute the methods to your own explicit project.
Further Reading
This section provides increasingly resources on the topic if you are looking to go deeper.
Related Tutorials
Books
Articles
Summary
In this tutorial, you discovered the essence of boosting to machine learning ensembles.
Specifically, you learned:
The boosting ensemble method for machine learning incrementally adds weak learners trained on weighted versions of the training dataset.
The essential idea that underlies all boosting algorithms and the key tideway used within each boosting algorithm.
How the essential ideas that underlie boosting could be explored on new predictive modeling projects.
Do you have any questions? Ask your questions in the comments unelevated and I will do my weightier to answer.
Get a Handle on Modern Ensemble Learning!
Improve Your Predictions in Minutes
...with just a few lines of python code
Discover how in my new Ebook: Ensemble Learning Algorithms With Python
It provides self-study tutorials with full working code on: Stacking, Voting, Boosting, Bagging, Blending, Super Learner,
and much more...
Bring Modern Ensemble Learning Techniques to Your Machine Learning Projects