Road to ML Engineer #12 - XGBoost Cont'd

DISCLAIMER: If you are looking to implement XGBoost for serious use cases, make sure you understand the details by going through the official XGBoost documentation. This article by no means covers every aspect of the XGBoost Python package; it is intended for casual readers who want to implement XGBoost for the first time.

Step 1 & 2. Data Exploration & Preprocessing

In this article, we will be using a new dataset: the California Housing Price regression dataset by Keras, which contains 20,640 samples of block groups in California (the smallest geographical unit for which the U.S. Census Bureau publishes sample data) with 8 features each (median income, median house age, average number of rooms, etc.). Let's download the dataset from Keras using load_data.

import keras
 
(X_train, y_train), (X_test, y_test) = keras.datasets.california_housing.load_data(
    version="large", path="california_housing.npz", test_split=0.2, seed=113
)

The dataset has already been split into training and testing datasets as numpy arrays, so there is no need for further preprocessing. Let's visualize the relationships between the features using a pair plot.

columns = [
    "Longitude", "Latitude", "AveOccup", "Population", 
    "AveBedrms", "AveRooms", "HouseAge", "MedInc"
]
 
train_df = pd.DataFrame(X_train, columns=columns)
train_df['MedHouseVal'] = y_train
 
sns.pairplot(train_df)
plt.show()

While this plot is difficult to interpret, we can still try to glean some insights. It seems like there are some outliers in terms of the average number of bedrooms, population, and average number of household members that are making some plots hard to interpret. These outliers are partly explained by the Keras documentation, as a few households with empty homes, such as vacation resorts, tend to have surprisingly high numbers in some columns. There are also strong positive correlations between those variables, as expected, which suggests the redundancy of including all the features when making predictions.

As for the median household values that we are trying to predict, it seems like there are no clear relationships with almost all the features (as we would expect in the real world), except for the median income, which appears to have a slight positive correlation. It makes sense that households with higher incomes tend to reside in homes with higher values. Looking at the unusual pattern of latitude and longitude, my guess is that the houses near the coastlines tend to have higher household values. Let's plot that to see if this is the case.

plt.figure(figsize=(10, 6))
scatter = plt.scatter(train_df['Longitude'], train_df['Latitude'], 
                      c=train_df['MedHouseVal'], cmap='viridis', s=10)
plt.colorbar(scatter, label='Median House Value')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.title('Latitude and Longitude with Median House Value')
plt.show()

As we can see, the houses on the coastline do seem to have higher values. (We would need to perform statistical analysis to test this hypothesis rigorously, but I won't do that here as our objective is to build the best XGBoost predictor.) Therefore, we expect XGBoost to pick up on this pattern.

Step 3. Model

Before we engage in cross-validation to perform hyperparameter tuning, we need to understand the set of hyperparameters for basic XGBoost models. Below are some of the hyperparameters exposed by the XGBoost package.

Hyperparameters of XGBoost
Hyperparameter	Syntax	Explanation
Learning Rate	`eta`	It is multiplied by the output value of each tree to take smaller steps ( $\nu$ in our formal definition). The default is 0.3.
Gamma ( $\gamma$ )	`gamma`	The minimum gain a split can have, used for pruning. The default is 0.
Lambda ( $\lambda$ )	`lambda`	The L2 regularization rate added to the denominator of the output value and similarity score. The default is 1.
Alpha ( $\alpha$ )	`alpha`	The L1 regularization rate. You can also add an L1 regularization term as well. We tend not to use this over L2 unless overfitting is severe. The default is 0.
Maximum Depth	`max_depth`	The maximum depth of the tree allowed. The default is 6.
Minimum Child Weight	`min_child_weight`	The minimum weight allowed for a leaf. For regression, weights are numbers of residuals, whereas for classification, the weights are $\sum p(1-p)$ . The default is 1.
Subsample	`subsample`	The ratio of samples (rows) used to find splits in a tree. The default is 1, meaning all the samples are used.
Column Subsampling	`colsample_by*`	The ratio of features (columns) used to find splits in a tree. You can resample either by tree, level, or node. The default is 1 for `bytree`, `bylevel`, and `bynode`, meaning all the features are utilized for all the splits.
Tree Method	`tree_method`	Whether to use an exact greedy algorithm to test every possible split (`exact`), or to use an approximate greedy algorithm with a weighted quantile sketch (`approx` or `hist`). The default is set to `auto`.

There are other hyperparameters, such as max_leaves and min_bin, which dictate the maximum number of leaves allowed for each tree and the number of bins used for the weighted quantile sketch, but the above are the main ones. As the dataset is not large, we will mainly tune eta, gamma, and lambda. (max_depth and min_child_weight are for pruning, which can be managed with gamma and lambda, while subsample and colsample_by* are for handling large datasets.)

Let's define a function for cross-validation. This time, I will use RandomSearchCV from sklearn, which can take a probability distribution of hyperparameters and choose a random set of hyperparameters to compare the scores. I could have used GridSearchCV too, but I figured it takes too much time.

from sklearn.model_selection import StratifiedKFold, RandomizedSearchCV
from xgboost import XGBRegressor, plot_importance
from sklearn.metrics import make_scorer, mean_squared_error
 
def randomsearchCV_XGBR(X_train, y_train, parameters, n_iter=20, n_splits=10, shuffle=True):
  estimator = XGBRegressor(objective='reg:squarederror',
                           n_estimators=100,
                           tree_method='hist',
                           seed=42)
 
  cv = StratifiedKFold(n_splits=n_splits, shuffle=shuffle)
 
  rand_search = RandomizedSearchCV(
      estimator=estimator,
      param_distributions=parameters,
      scoring = make_scorer(mean_squared_error),
      n_iter=n_iter, ## number of samples to test
      ## n_jobs=10, ## if multithreads are available
      cv = cv,
      verbose=0
  )
 
  rand_search.fit(X_train, y_train)
 
  return rand_search.best_params_, rand_search.best_score_

To pass the probability distributions of hyperparameters, I will use scipy.stats, specifically the uniform function, as I have no idea about the values of hyperparameters that could result in the best outcome.

from scipy.stats import uniform
 
parameters = {
      'eta': uniform(loc=0.001, scale=1),
      'gamma': uniform(loc=0, scale=1),
      'lambda': uniform(loc=0, scale=1)
}
 
params, score = randomsearchCV_XGBR(X_train, y_train, parameters)
 
print("best parameters: ", params)
print("best score: ", score)
 
# =>
# best parameters:  {'eta': 0.013192469939280582, 'gamma': 0.4946327995662789, 'lambda': 0.1587494731756135}
# best score:  4679027302.4

From the above hyperparameters, we can build an XGBoost Regressor and fit the model to the training data.

xgb = XGBRegressor(objective='reg:squarederror',
                  eta=params['eta'],
                  gamma=params['gamma'],
                  reg_lambda=params['lambda'],
                  n_estimators=100,
                  tree_method='hist',
                  seed=42)
 
r = xgb.fit(X_train, y_train)

Step 4. Model Evaluation

We need to make predictions on the test dataset using the model we just trained to evaluate the model.

pred = xgb.predict(X_test)
 
mean_squared_error(y_test, pred)
# => 4806821000.0

The MSE is quite hard to interpret in this case, so let's use the r2_score or R-squared value, which describes how much of the variation in the predictor variable can be explained by the explanatory variable(s) using the model. (If you want to know more about it, I recommend checking out R-squared, Clearly Explained!!! by StatQuest. )

from sklearn.metrics import r2_score
 
r2_score(y_test, pred)
# => 0.6384312017848871

After fitting a linear regression model to the same data with feature engineering, the R-squared value was around 0.6. Hence, we can say that the XGBoost Regressor achieved better predictions without too much work! (We need to measure the performance multiple times to confirm the significance of this if we want to be more rigorous.)

Another advantage of XGBoost is how easy it is to visualize the tree or the importance of features. Here, I will visualize the importance of features (not the tree, as the tree is too big to visualize here).

F_score is a measure of how many splits are made with that feature. The above illustrates that the splits were made more frequently with latitude, longitude, and median income, which makes sense given the separabilities and relationships we observed in Step 1. The other features have relatively low importance, which also matches our expectations in Step 1.

Conclusion

I hope you got a sense of how a machine learning pipeline with XGBoost looks and how to implement it in Python. I recommend working on a classification problem on your own using XGBoost to sharpen your knowledge and skills. (It will be a great addition to your portfolio!)

NOTE: If you are following the "Road to C Programmer" series, I challenge you to implement XGBoost in C by looking at the official XGBoost documentation. Good luck!

Resources

Keras. n.d. California Housing price regression dataset. Keras.
n.d. 2022. XGBoost Documentation. XGBoost.