Unfortunately CatBoost isn’t on CRAN, and isn’t structured so that you could install it via remotes::install_github`. You have to visit the GitHub releases page, find the URL for the newest R release for your OS, and install it via:
binary_url <- "https://github.com/catboost/catboost/releases/download/v0.26.1/catboost-R-Darwin-0.26.1.tgz"
remotes::install_url(
binary_url,
INSTALL_opts = c("--no-multiarch", "--no-test-load")
)
While you’re at it consider installing treesnip, which allows you to use CatBoost (and lightgbm) with tidymodels:
remotes::install_github("curso-r/treesnip")
We’ll be using the Ames housing dataset, available from the AmesHousing package.
library(rsample)
library(catboost)
library(dplyr)
library(purrr)
library(ggplot2)
set.seed(1234)
ames <- AmesHousing::make_ames() %>%
janitor::clean_names()
ames_split <- initial_split(ames, strata = "sale_price")
ames_train <- training(ames_split)
ames_test <- testing(ames_split)
The Ames dataset is nice for our purposes as it contains many factor variables:
ames %>% map(class) %>% keep(. == "factor") %>% length()
## [1] 46
One thing you should be aware of is that CatBoost prefaces each function it exports with catboost.:
ls("package:catboost") %>% stringr::str_detect("^catboost\\.") %>% all()
## [1] TRUE
To use CatBoost, you’ll have to convert the datasets into a particular format. If you’re familiar with lightgbm, it looks a bit like that.
ames_train_pool <- catboost.load_pool(
data = ames_train %>% select(-sale_price),
label = ames_train$sale_price
)
ames_test_pool <- catboost.load_pool(
data = ames_test %>% select(-sale_price),
label = ames_test$sale_price
)
You could also give it a CSV file to read, or supply it with row or group weights, etc. The documentation indicates that you should specify categorical variable indices with the argument cat_features, but if you do so you’ll be greeted with the message that it’s unnecessary and that you should convert your categorical features to factors.
Training a model is now pretty easily done:
params <- list(
loss_function = "RMSE",
custom_loss = "R2",
iterations = 500,
depth = 8,
learning_rate = 0.08
)
model <- catboost.train(
learn_pool = ames_train_pool, test_pool = ames_test_pool, params = params
)
See the documentation for catboost.train() for more information on the possible configuration parameters.
I’ve hidden the output because it’s 2000 lines, but at the end it gives the following message:
bestTest = 23935.587
bestIteration = 492
Shrink model to first 493 iterations.
You can shrink the model with catboost.shrink():
catboost.shrink(model, 493)
## [1] TRUE
You don’t use base R’s predict function for predictions, but rather catboost.predict(), which works on a CatBoost pool object:
test_predictions <- catboost.predict(model, ames_test_pool)
r2 <- cor(ames_test$sale_price, test_predictions)^2
r2
## [1] 0.9076073
Though you can also compute \(R^2\) using the catboost.eval_metrics() function:
catboost.eval_metrics(model, ames_test_pool, "R2") %>%
as_tibble() %>%
mutate(row = row_number()) %>%
ggplot(aes(row, R2)) +
geom_point() +
ggtitle("R2 by Iteration")
You can also get feature importance:
feature_importance <- catboost.get_feature_importance(
model,
pool = ames_test_pool,
type = "FeatureImportance"
)
tibble(
column = rownames(feature_importance),
importance = feature_importance[,1]
) %>%
arrange(desc(importance)) %>%
head(5) %>%
knitr::kable()
| column | importance |
|---|---|
| gr_liv_area | 13.958892 |
| overall_qual | 13.848104 |
| exter_qual | 10.109577 |
| total_bsmt_sf | 8.036430 |
| garage_cars | 4.350653 |
catboost.get_feature_importance() supports other types, such as ShapValues.
You can find the k-fold cross validation error with catboost.cv(), which operates much like catboost.train():
cv_results <- catboost.cv(
pool = ames_train_pool,
params = params,
fold_count = 3,
partition_random_seed = 1024
)
cv_results[which.min(cv_results$test.RMSE.mean),] %>%
knitr::kable()
| test.RMSE.mean | test.RMSE.std | train.RMSE.mean | train.RMSE.std | test.R2.mean | test.R2.std | train.R2.mean | train.R2.std | |
|---|---|---|---|---|---|---|---|---|
| 492 | 25966.94 | 629.0298 | 5006.062 | 184.6111 | 0.8940723 | 0.0118855 | 0.9960802 | 0.0005536 |
Another function that might be useful (though we’ll not go over it here) is catboost.caret(), which can be passed as a method to caret::train().