Revisting NHL Salary Prediction with XGBoost

9 minute read

Objective:

In the previous post in this series, I attempted to predict NHL player salaries based on player characteristics and performance measures from the 2016-2017 NHL season. If you haven’t already read that post, you should check it out here first.

Some of the models we tried gave us decent predictions, but I know we can do better. I think it’s time to bring out the big guns with the XGBoost (eXtreme Gradient Boosting) algorithm. XGBoost is currently one of the most popular algorithms in Kaggle competitions, and for good reason. It’s a scalable tree boosting system that performs extremely well in both classification and regression problems. Now without futher ado, let’s get started.

Analysis:

We’ll use the same dataset and variables that we did in our first attempts at predicting NHL salaries.

library(caret)
library(xgboost)
library(data.table)

setwd("C:\\Users\\Viktor\\Desktop")
dat = read.csv("NHL_2016-17_Cleaned_v2.csv", header = T)
dat = na.omit(dat)

set.seed(42)
#Create subset of dataset using features from the last post
new_dat = dat[,c(152, 1, 24, 14, 36, 38, 17, 60, 12, 42)]
summary(new_dat)

holdout_idx = createDataPartition(new_dat$Salary, p = 0.9, list = FALSE)
modeldat = new_dat[holdout_idx, ]
holdout = new_dat[-holdout_idx, ]


idx = createDataPartition(modeldat$Salary, p = 0.8, list = FALSE)
dat_trn = modeldat[idx, ]
dat_tst = modeldat[-idx, ]

The only difference is this time, we’ll have a training set, a test set for tuning parameters (model validation) and a final holdout set not used to train or tune the model at all for the end test results (no cheating!).

Now we’re ready to prepare our data and preprocess for XGboost.

setDT(dat_trn)
setDT(dat_tst)

labels <- dat_trn$Salary
ts_label <- dat_tst$Salary
new_tr <- model.matrix(~.+0,data = dat_trn[,-c("Salary"),with=F])
new_ts <- model.matrix(~.+0,data = dat_tst[,-c("Salary"),with=F])
labels <- as.numeric(labels)
ts_label <- as.numeric(ts_label)

xgb_train <- xgb.DMatrix(data = new_tr,label = labels) 
xgb_test <- xgb.DMatrix(data = new_ts,label=ts_label)

We’ll start out by using a model with the default parameters. Training with cross validation will allow us to determine the optimal number of model iterations, then we’ll run the model using this information and evaluate it on heldout data it has never seen before.

default_params <- list(booster = "gbtree", objective = "reg:linear",
               eta=0.3, gamma=0, max_depth=6, min_child_weight=1,
               subsample=1, colsample_bytree=1)

xgbcv_untuned <- xgb.cv( params = default_params, data = xgb_train, nrounds = 100,
                 nfold = 5, showsd = T, stratified = T,
                 print_every_n = 10, early_stop_round = 20, maximize = F)

# find optimal number of iterations
minval = min(xgbcv_untuned$evaluation_log[,4])
errors = as.data.frame(xgbcv_untuned$evaluation_log[,4])
niter = match(minval, errors$test_rmse_mean)
cv_error = as.data.frame(xgbcv_untuned$evaluation_log[niter,2])

# run model with optimal number of iterations
xgb_model_untuned <- xgb.train(params = default_params, data = xgb_train, nrounds = niter,
                   watchlist = list(val=xgb_test,train=xgb_train), print_every_n = 10,
                   early_stop_round = 10, maximize = F , eval_metric = "error")
                   
# evaluate performance on heldout data
setDT(holdout)
holdout_inputs = model.matrix(~.+0,data = holdout[,-c("Salary"),with=F])
holdout_labels = holdout$Salary
holdout_dat <- xgb.DMatrix(data = holdout_inputs,label=holdout_labels)
sqrt((mean((predict(xgb_model_untuned,holdout_dat) - holdout_labels)^2)))

We get a RMSE of 1,475,553 which is already better than the best performing model from the previous post (the lowest test RMSE was 1,559,010 from the K-NN model). I’d say we’re off to a good start. Let’s see if tuning hyperparameters can give us even better performance.

We’ll start off by tuning max_depth and min_child_weight, while keeping the default parameters for everything else. Max_depth is the maximum depth allowed for a tree, and min_child_weight is the minimum sum of weights of all observations required in a child (it tells the model to stop splitting once the sample size in a node drops below a certain threshold). We start with these hyperparameters since they’ll have a large impact on model performance. We’ll also continue using a high learning rate for now to speed up training times, as well as sticking with the optimal number of iterations for this learning rate that we established in the untuned model.

searchGridSubCol <- expand.grid(subsample = 1, 
                                colsample_bytree = 1,
                                max_depth = seq(2,8),
                                min_child = seq(1,4), 
                                eta = 0.3,
                                gamma = 0,
                                alpha = 0,
                                lambda = 1
)

  rmseErrorsHyperparameters <- apply(searchGridSubCol, 1, function(parameterList){
    
    #test the parameters
    currentSubsampleRate <- parameterList[["subsample"]]
    currentColsampleRate <- parameterList[["colsample_bytree"]]
    currentDepth <- parameterList[["max_depth"]]
    currentEta <- parameterList[["eta"]]
    currentMinChild <- parameterList[["min_child"]]
    currentGamma <- parameterList[["gamma"]]
    currentAlpha <- parameterList[["alpha"]]
    currentLambda <- parameterList[["lambda"]]
    
    xgboostModelCV <- xgb.cv(data =  xgb_train, nrounds = niter, nfold = 5, showsd = TRUE, 
                             metrics = "rmse", verbose = TRUE, "eval_metric" = "rmse",
                             "objective" = "reg:linear", "max.depth" = currentDepth, "eta" = currentEta,                               
                             "subsample" = currentSubsampleRate, "colsample_bytree" = currentColsampleRate,
                             print_every_n = 10, "min_child_weight" = currentMinChild,
                             "gamma" = currentGamma, "alpha" = currentAlpha,
                             "lambda" = currentLambda, booster = "gbtree",
                             early_stopping_rounds = 20)
    
    xvalidationScores <- as.data.frame(xgboostModelCV$evaluation_log)
    rmse <- tail(xvalidationScores$test_rmse_mean, 1)
    trmse <- tail(xvalidationScores$train_rmse_mean,1)
    output <- return(c(rmse, trmse, currentSubsampleRate,
                       currentColsampleRate, currentDepth, currentEta, currentMinChild,
                       currentGamma, currentAlpha, currentLambda))})

output1 <- as.data.frame(t(rmseErrorsHyperparameters))
varnames <- c("TestRMSE", "TrainRMSE", "SubSampRate", "ColSampRate", "Depth", "eta",
              "MinChild", "Gamma", "Alpha", "Lambda")
names(output1) <- varnames

# parameters with smallest error in our validation set
minError = min(output1$TestRMSE)
optimalParams = output1[match(minError, output1$TestRMSE),]
optimalParams
#output
   TestRMSE TrainRMSE    SubSampRate   ColSampRate Depth    eta    MinChild Gamma Alpha Lambda
3  1420395  693091.6           1           1        4       0.3        1      0     0      1

Now we can use these values for max_depth and min_child_weight to tune gamma, which is the minimum loss reduction required to make a split.

searchGridSubCol <- expand.grid(subsample = 1, 
                                colsample_bytree = 1,
                                max_depth = optimalParams[,5],
                                min_child = optimalParams[,7], 
                                eta = 0.3,
                                gamma = seq(0, 10, by=0.05),
                                alpha = 0,
                                lambda = 1
)

rmseErrorsHyperparameters <- apply(searchGridSubCol, 1, function(parameterList){
  
  #test the parameters
  currentSubsampleRate <- parameterList[["subsample"]]
  currentColsampleRate <- parameterList[["colsample_bytree"]]
  currentDepth <- parameterList[["max_depth"]]
  currentEta <- parameterList[["eta"]]
  currentMinChild <- parameterList[["min_child"]]
  currentGamma <- parameterList[["gamma"]]
  currentAlpha <- parameterList[["alpha"]]
  currentLambda <- parameterList[["lambda"]]
  
  xgboostModelCV <- xgb.cv(data =  xgb_train, nrounds = niter, nfold = 5, showsd = TRUE, 
                           metrics = "rmse", verbose = TRUE, "eval_metric" = "rmse",
                           "objective" = "reg:linear", "max.depth" = currentDepth, "eta" = currentEta,                               
                           "subsample" = currentSubsampleRate, "colsample_bytree" = currentColsampleRate,
                           print_every_n = 10, "min_child_weight" = currentMinChild,
                           "gamma" = currentGamma, "alpha" = currentAlpha,
                           "lambda" = currentLambda, booster = "gbtree",
                           early_stopping_rounds = 20)
  
  xvalidationScores <- as.data.frame(xgboostModelCV$evaluation_log)
  rmse <- tail(xvalidationScores$test_rmse_mean, 1)
  trmse <- tail(xvalidationScores$train_rmse_mean,1)
  output <- return(c(rmse, trmse, currentSubsampleRate,
                     currentColsampleRate, currentDepth, currentEta, currentMinChild,
                     currentGamma, currentAlpha, currentLambda))})

output2 <- as.data.frame(t(rmseErrorsHyperparameters))
names(output2) <- varnames

# parameters with smallest error in our validation set
minError = min(output2$TestRMSE)
optimalParams = output2[match(minError, output2$TestRMSE),]
optimalParams
#output
     TestRMSE   TrainRMSE    SubSampRate   ColSampRate  Depth    eta    MinChild Gamma   Alpha Lambda
157  1365061    697513           1             1          4      0.3        1      7.8     0      1

We can then use all of these parameters to tune subsample, which is the proportion of all observations that will be randomly selected in each tree iteration, and colsample_bytree, which is the proportion of all features that will be made available as candidates for splitting in each tree.

searchGridSubCol <- expand.grid(subsample = seq(0.5,1,by=0.1), 
                                colsample_bytree = seq(0.5,1,by=0.1),
                                max_depth = optimalParams[,5],
                                min_child = optimalParams[,7], 
                                eta = 0.3,
                                gamma = optimalParams[,8],
                                alpha = 0,
                                lambda = 1
)

rmseErrorsHyperparameters <- apply(searchGridSubCol, 1, function(parameterList){
  
  #test the parameters
  currentSubsampleRate <- parameterList[["subsample"]]
  currentColsampleRate <- parameterList[["colsample_bytree"]]
  currentDepth <- parameterList[["max_depth"]]
  currentEta <- parameterList[["eta"]]
  currentMinChild <- parameterList[["min_child"]]
  currentGamma <- parameterList[["gamma"]]
  currentAlpha <- parameterList[["alpha"]]
  currentLambda <- parameterList[["lambda"]]
  
  xgboostModelCV <- xgb.cv(data =  xgb_train, nrounds = niter, nfold = 5, showsd = TRUE, 
                           metrics = "rmse", verbose = TRUE, "eval_metric" = "rmse",
                           "objective" = "reg:linear", "max.depth" = currentDepth, "eta" = currentEta,                               
                           "subsample" = currentSubsampleRate, "colsample_bytree" = currentColsampleRate,
                           print_every_n = 10, "min_child_weight" = currentMinChild,
                           "gamma" = currentGamma, "alpha" = currentAlpha,
                           "lambda" = currentLambda, booster = "gbtree",
                           early_stopping_rounds = 20)
  
  xvalidationScores <- as.data.frame(xgboostModelCV$evaluation_log)
  rmse <- tail(xvalidationScores$test_rmse_mean, 1)
  trmse <- tail(xvalidationScores$train_rmse_mean,1)
  output <- return(c(rmse, trmse, currentSubsampleRate,
                     currentColsampleRate, currentDepth, currentEta, currentMinChild,
                     currentGamma, currentAlpha, currentLambda))})

output3<- as.data.frame(t(rmseErrorsHyperparameters))
names(output3) <- varnames

# parameters with smallest error in our validation set
minError = min(output3$TestRMSE)
optimalParams = output3[match(minError, output3$TestRMSE),]
optimalParams
#output
     TestRMSE  TrainRMSE     SubSampRate ColSampRate   Depth   eta  MinChild  Gamma  Alpha Lambda
26   1385484   808073.9         0.6         0.9          4     0.3      1      7.8     0      1

Finally, we will tune the regularization parameters. The optimal gamma we found earlier should already do a good job of controlling model complexity, but we may as well tune these parameters too.

searchGridSubCol <- expand.grid(subsample = optimalParams[,3], 
                                colsample_bytree = optimalParams[,4],
                                max_depth = optimalParams[,5],
                                min_child = optimalParams[,7], 
                                eta = 0.3,
                                gamma = optimalParams[,8],
                                alpha = seq(0,1,by=0.1),
                                lambda = seq(0,1,by=0.1)
)

rmseErrorsHyperparameters <- apply(searchGridSubCol, 1, function(parameterList){
  
  #test the parameters
  currentSubsampleRate <- parameterList[["subsample"]]
  currentColsampleRate <- parameterList[["colsample_bytree"]]
  currentDepth <- parameterList[["max_depth"]]
  currentEta <- parameterList[["eta"]]
  currentMinChild <- parameterList[["min_child"]]
  currentGamma <- parameterList[["gamma"]]
  currentAlpha <- parameterList[["alpha"]]
  currentLambda <- parameterList[["lambda"]]
  
  xgboostModelCV <- xgb.cv(data =  xgb_train, nrounds = niter, nfold = 5, showsd = TRUE, 
                           metrics = "rmse", verbose = TRUE, "eval_metric" = "rmse",
                           "objective" = "reg:linear", "max.depth" = currentDepth, "eta" = currentEta,                               
                           "subsample" = currentSubsampleRate, "colsample_bytree" = currentColsampleRate,
                           print_every_n = 10, "min_child_weight" = currentMinChild,
                           "gamma" = currentGamma, "alpha" = currentAlpha,
                           "lambda" = currentLambda, booster = "gbtree",
                           early_stopping_rounds = 20)
  
  xvalidationScores <- as.data.frame(xgboostModelCV$evaluation_log)
  rmse <- tail(xvalidationScores$test_rmse_mean, 1)
  trmse <- tail(xvalidationScores$train_rmse_mean,1)
  output <- return(c(rmse, trmse, currentSubsampleRate,
                     currentColsampleRate, currentDepth, currentEta, currentMinChild,
                     currentGamma, currentAlpha, currentLambda))})

output4<- as.data.frame(t(rmseErrorsHyperparameters))
names(output4) <- varnames

# parameters with smallest error in our validation set
minError = min(output4$TestRMSE)
optimalParams = output4[match(minError, output4$TestRMSE),]
optimalParams
#output
     TestRMSE  TrainRMSE    SubSampRate   ColSampRate  Depth   eta   MinChild   Gamma  Alpha Lambda
95   1408861   810632.2         0.6         0.9          4     0.3      1        7.8    0.6    0.8

Results:

Now we’ll reduce the learning rate and calculate the new optimal number of iterations for the final model using the hyperparameters we just tuned. We’ll run the model using our tuned hyperparameters and the new optimal number of iterations for the lower learning rate, then see how we did by evaluating the model on the heldout data that it has not be trained or tuned with.

params <- list(booster = "gbtree", objective = "reg:linear",
               subsample = optimalParams[,3], 
               colsample_bytree = optimalParams[,4],
               max_depth = optimalParams[,5],
               min_child = optimalParams[,7], 
               eta = 0.001,
               gamma = optimalParams[,8],
               alpha = optimalParams[,9],
               lambda = optimalParams[,10])

xgbcv <- xgb.cv( params = params, data = xgb_train, nrounds = 10000,
                 nfold = 5, showsd = T, stratified = T,
                 print_every_n = 10, early_stop_round = 20, maximize = F)

# find optimal number of iterations
minval = min(xgbcv$evaluation_log[,4])
errors = as.data.frame(xgbcv$evaluation_log[,4])
niter = match(minval, errors$test_rmse_mean)
cv_error = as.data.frame(xgbcv$evaluation_log[niter,2])

# run model with optimal number of iterations

xgb_model <- xgb.train(params = params, data = xgb_train, nrounds = niter,
                       watchlist = list(val=xgb_test,train=xgb_train), print_every_n = 10,
                       early_stop_round = 10, maximize = F , eval_metric = "error")

#final evaluation on heldout data not used to train or tune the model parameters
sqrt((mean((predict(xgb_model,holdout_dat) - holdout_labels)^2)))

We now get a RMSE of 1,366,490 which is even better than our untuned model. Also for reference, the improvement in RMSE we’ve gotten from this tuned XGBoost model compared to the best performing model in the previous post (which was a K-NN model) is larger than the difference in RMSE between the prevoius post’s best and worst performing model. It’s clear that XGBoost has given us a major performance increase.

Of course, this XGBoost model still suffers from some of the issues the models in the previous post had, namely contract salaries being “sticky” which causes player statistics and characteristics in the current year to not always be strong predictors of salaries which were determined by contracts signed a number of years in the past. Collecting more years of data, especially data for the year in which each player signed their contract, would likely improve our model more than anything else could. Even the best model in the world can’t fix bad data (which is a good reminder of the importance of feature engineering and domain knowledge).

Despite this, the performance increase we see from XGBoost shows how powerful a carefully tuned model can be. There are certainly usecases for quick, right out of the box models, but a more powerful and flexible model will usually perform better if you’ve got the time and patience to tune it.