Extreme Gradient Boosting (XGBoost) is an open-source library that provides an efficient and constructive implementation of the gradient boosting algorithm.

Shortly without its minutiae and initial release, XGBoost became the go-to method and often the key component in winning solutions for a range of problems in machine learning competitions.

Regression predictive modeling problems involve predicting a numerical value such as a dollar value or a height. **XGBoost** can be used directly for **regression predictive modeling**.

In this tutorial, you will discover how to develop and evaluate XGBoost regression models in Python.

After completing this tutorial, you will know:

- XGBoost is an efficient implementation of gradient boosting that can be used for regression predictive modeling.
- How to evaluate an XGBoost regression model using the weightier practice technique of repeated k-fold cross-validation.
- How to fit a final model and use it to make a prediction on new data.

Let’s get started.

## Tutorial Overview

This tutorial is divided into three parts; they are:

- Extreme Gradient Boosting
- XGBoost Regression API
- XGBoost Regression Example

## Extreme Gradient Boosting

**Gradient boosting** refers to a matriculation of ensemble machine learning algorithms that can be used for nomenclature or regression predictive modeling problems.

Ensembles are synthetic from visualization tree models. Trees are widow one at a time to the ensemble and fit to correct the prediction errors made by prior models. This is a type of ensemble machine learning model referred to as boosting.

Models are fit using any wrong-headed differentiable loss function and gradient descent optimization algorithm. This gives the technique its name, “*gradient boosting*,” as the loss gradient is minimized as the model is fit, much like a neural network.

For increasingly on gradient boosting, see the tutorial:

Extreme Gradient Boosting, or XGBoost for short, is an efficient open-source implementation of the gradient boosting algorithm. As such, XGBoost is an algorithm, an open-source project, and a Python library.

It was initially ripened by Tianqi Chen and was described by Chen and Carlos Guestrin in their 2016 paper titled “XGBoost: A Scalable Tree Boosting System.”

It is planned to be both computationally efficient (e.g. fast to execute) and highly effective, perhaps increasingly constructive than other open-source implementations.

The two main reasons to use XGBoost are execution speed and model performance.

XGBoost dominates structured or tabular datasets on nomenclature and regression predictive modeling problems. The vestige is that it is the go-to algorithm for competition winners on the Kaggle competitive data science platform.

Among the 29 rencontre winning solutions 3 published at Kaggle’s blog during 2015, 17 solutions used XGBoost. […] The success of the system was moreover witnessed in KDDCup 2015, where XGBoost was used by every winning team in the top-10.

— XGBoost: A Scalable Tree Boosting System, 2016.

Now that we are familiar with what XGBoost is and why it is important, let’s take a closer squint at how we can use it in our regression predictive modeling projects.

## XGBoost Regression API

XGBoost can be installed as a standalone library and an XGBoost model can be ripened using the scikit-learn API.

The first step is to install the XGBoost library if it is not once installed. This can be achieved using the pip python package manager on most platforms; for example:

sudo pip install xgboost |

You can then personize that the XGBoost library was installed correctly and can be used by running the pursuit script.

# trammels xgboost version import xgboost print(xgboost.__version__) |

Running the script will print your version of the XGBoost library you have installed.

Your version should be the same or higher. If not, you must upgrade your version of the XGBoost library.

It is possible that you may have problems with the latest version of the library. It is not your fault.

Sometimes, the most recent version of the library imposes spare requirements or may be less stable.

If you do have errors when trying to run the whilom script, I recommend downgrading to version 1.0.1 (or lower). This can be achieved by specifying the version to install to the pip command, as follows:

sudo pip install xgboost==1.0.1 |

If you require explicit instructions for your minutiae environment, see the tutorial:

The XGBoost library has its own custom API, although we will use the method via the scikit-learn wrapper classes: XGBRegressor and XGBClassifier. This will indulge us to use the full suite of tools from the scikit-learn machine learning library to prepare data and evaluate models.

An XGBoost regression model can be specified by creating an instance of the *XGBRegressor* class; for example:

... # create an xgboost regression model model = XGBRegressor() |

You can specify hyperparameter values to the matriculation constructor to configure the model.

Perhaps the most wontedly configured hyperparameters are the following:

**n_estimators**: The number of trees in the ensemble, often increased until no remoter improvements are seen.**max_depth**: The maximum depth of each tree, often values are between 1 and 10.**eta**: The learning rate used to weight each model, often set to small values such as 0.3, 0.1, 0.01, or smaller.**subsample**: The number of samples (rows) used in each tree, set to a value between 0 and 1, often 1.0 to use all samples.**colsample_bytree**: Number of features (columns) used in each tree, set to a value between 0 and 1, often 1.0 to use all features.

For example:

... # create an xgboost regression model model = XGBRegressor(n_estimators=1000, max_depth=7, eta=0.1, subsample=0.7, colsample_bytree=0.8) |

Good hyperparameter values can be found by trial and error for a given dataset, or systematic experimentation such as using a grid search wideness a range of values.

Randomness is used in the construction of the model. This ways that each time the algorithm is run on the same data, it may produce a slightly variegated model.

When using machine learning algorithms that have a stochastic learning algorithm, it is good practice to evaluate them by averaging their performance wideness multiple runs or repeats of cross-validation. When fitting a final model, it may be desirable to either increase the number of trees until the variance of the model is reduced wideness repeated evaluations, or to fit multiple final models and stereotype their predictions.

Let’s take a squint at how to develop an XGBoost ensemble for regression.

## XGBoost Regression Example

In this section, we will squint at how we might develop an XGBoost model for a standard regression predictive modeling dataset.

First, let’s introduce a standard regression dataset.

We will use the housing dataset.

The housing dataset is a standard machine learning dataset comprising 506 rows of data with 13 numerical input variables and a numerical target variable.

Using a test harness of repeated stratified 10-fold cross-validation with three repeats, a naive model can unzip a mean wool error (MAE) of well-nigh 6.6. A top-performing model can unzip a MAE on this same test harness of well-nigh 1.9. This provides the premises of expected performance on this dataset.

The dataset involves predicting the house price given details of the house’s suburb in the American municipality of Boston.

No need to download the dataset; we will download it automatically as part of our worked examples.

The example unelevated downloads and loads the dataset as a Pandas DataFrame and summarizes the shape of the dataset and the first five rows of data.

1 2 3 4 5 6 7 8 9 10 |
# load and summarize the housing dataset from pandas import read_csv from matplotlib import pyplot # load dataset url = ‘https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv’ dataframe = read_csv(url, header=None) # summarize shape print(dataframe.shape) # summarize first few lines print(dataframe.head()) |

Running the example confirms the 506 rows of data and 13 input variables and a each numeric target variable (14 in total). We can moreover see that all input variables are numeric.

(506, 14) 0 1 2 3 4 5 … 8 9 10 11 12 13 0 0.00632 18.0 2.31 0 0.538 6.575 … 1 296.0 15.3 396.90 4.98 24.0 1 0.02731 0.0 7.07 0 0.469 6.421 … 2 242.0 17.8 396.90 9.14 21.6 2 0.02729 0.0 7.07 0 0.469 7.185 … 2 242.0 17.8 392.83 4.03 34.7 3 0.03237 0.0 2.18 0 0.458 6.998 … 3 222.0 18.7 394.63 2.94 33.4 4 0.06905 0.0 2.18 0 0.458 7.147 … 3 222.0 18.7 396.90 5.33 36.2 [5 rows x 14 columns] |

Next, let’s evaluate a regression XGBoost model with default hyperparameters on the problem.

First, we can split the loaded dataset into input and output columns for training and evaluating a predictive model.

... # split data into input and output columns X, y = data[:, :–1], data[:, –1] |

Next, we can create an instance of the model with a default configuration.

... # pinpoint model model = XGBRegressor() |

We will evaluate the model using the weightier practice of repeated k-fold cross-validation with 3 repeats and 10 folds.

This can be achieved by using the RepeatedKFold matriculation to configure the evaluation procedure and calling the cross_val_score() to evaluate the model using the procedure and collect the scores.

Model performance will be evaluated using midpoint squared error (MAE). Note, MAE is made negative in the scikit-learn library so that it can be maximized. As such, we can ignore the sign and seem all errors are positive.

... # pinpoint model evaluation method cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring=‘neg_mean_absolute_error’, cv=cv, n_jobs=–1) |

Once evaluated, we can report the unscientific performance of the model when used to make predictions on new data for this problem.

In this case, considering the scores were made negative, we can use the absolute() NumPy function to make the scores positive.

We then report a statistical summary of the performance using the midpoint and standard deviation of the distribution of scores, flipside good practice.

... # gravity scores to be positive scores = absolute(scores) print(‘Mean MAE: %.3f (%.3f)’ % (scores.mean(), scores.std()) ) |

Tying this together, the well-constructed example of evaluating an XGBoost model on the housing regression predictive modeling problem is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
# evaluate an xgboost regression model on the housing dataset from numpy import absolute from pandas import read_csv from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedKFold from xgboost import XGBRegressor # load the dataset url = ‘https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv’ dataframe = read_csv(url, header=None) data = dataframe.values # split data into input and output columns X, y = data[:, :–1], data[:, –1] # pinpoint model model = XGBRegressor() # pinpoint model evaluation method cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring=‘neg_mean_absolute_error’, cv=cv, n_jobs=–1) # gravity scores to be positive scores = absolute(scores) print(‘Mean MAE: %.3f (%.3f)’ % (scores.mean(), scores.std()) ) |

Running the example evaluates the XGBoost Regression algorithm on the housing dataset and reports the stereotype MAE wideness the three repeats of 10-fold cross-validation.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the stereotype outcome.

In this case, we can see that the model achieved a MAE of well-nigh 2.1.

This is a good score, largest than the baseline, meaning the model has skill and tropical to the weightier score of 1.9.

Mean MAE: 2.109 (0.320) |

We may decide to use the XGBoost Regression model as our final model and make predictions on new data.

This can be achieved by fitting the model on all misogynist data and calling the *predict()* function, passing in a new row of data.

For example:

... # make a prediction yhat = model.predict(new_data) |

We can demonstrate this with a well-constructed example, listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
# fit a final xgboost model on the housing dataset and make a prediction from numpy import asarray from pandas import read_csv from xgboost import XGBRegressor # load the dataset url = ‘https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv’ dataframe = read_csv(url, header=None) data = dataframe.values # split dataset into input and output columns X, y = data[:, :–1], data[:, –1] # pinpoint model model = XGBRegressor() # fit model model.fit(X, y) # pinpoint new data row = [0.00632,18.00,2.310,0,0.5380,6.5750,65.20,4.0900,1,296.0,15.30,396.90,4.98] new_data = asarray([row]) # make a prediction yhat = model.predict(new_data) # summarize prediction print(‘Predicted: %.3f’ % yhat) |

Running the example fits the model and makes a prediction for the new rows of data.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the stereotype outcome.

In this case, we can see that the model predicted a value of well-nigh 24.

Predicted: 24.019 |

## Further Reading

This section provides increasingly resources on the topic if you are looking to go deeper.

### Tutorials

### Papers

### APIs

## Summary

In this tutorial, you discovered how to develop and evaluate XGBoost regression models in Python.

Specifically, you learned:

- XGBoost is an efficient implementation of gradient boosting that can be used for regression predictive modeling.
- How to evaluate an XGBoost regression model using the weightier practice technique of repeated k-fold cross-validation.
- How to fit a final model and use it to make a prediction on new data.

**Do you have any questions?**

Ask your questions in the comments unelevated and I will do my weightier to answer.