clause chow
- Sep 3, 2021
- 8 min read

Learning Note: Regression with Light GBM

The most recent team assignment of Analytics Software is a prediction problem written in R language. I firstly wanted to use keras as a DNN, but after watching kaggle, I found that XGBoosting scored the highest, so I wanted to learn the GBM algorithm. And my partner used ARIMA, which I learned during my double degree. Maybe I will write another blog to record and review something about ARIMA.

First, open the documentation of Light GBM. we could get a glimpse of it. Light GBM is a fast, distributed, high-performance gradient boosting framework based on decision tree algorithm, used for ranking, classification and many other machine learning tasks. So, I guess it will be suitable for my model.

Algorithm

Since it is based on decision tree algorithms, it splits the tree leaf wise with the best fit whereas other boosting algorithms split the tree depth wise or level wise rather than leaf-wise. So when growing on the same leaf in Light GBM, the leaf-wise algorithm can reduce more loss than the level-wise algorithm and hence results in much better accuracy which can rarely be achieved by any of the existing boosting algorithms. Also, it is surprisingly very fast, hence the word ‘Light’.

Leaf wise splits lead to increase in complexity and may lead to overfitting and it can be overcome by specifying another parameter max-depth which specifies the depth to which splitting will occur. Let's take an example to practice.

Record of Data processing

Our data from Recruit Restaurant Visitor Forecasting, and just used for learning. In this competition, we are challenged to use reservation and visitation data to predict the total number of visitors to a restaurant for future dates. This information will help restaurants be much more efficient and allow them to focus on creating an enjoyable dining experience for their customers.

The data pattern is similar to a relational database with 7 relational csv tables, so there are two ways to explore data: The tidy-r mode that looks for the relationship between various data item by item like a database, then do time series; the base-r mode that merge as much useful information as possible to do the regression.

Tidy-r mode has graceful syntax, but in the second way we could do better. First, load the packages. Due to the use of base-r, the package we chose is also a relatively basic package. In this way, the learning cost of new projects is reduced on the basis of learning base-r.

require(data.table)
require(stringr)
require(lubridate)
require(zoo)
require(lightgbm)

1.1. Since it is required to predict the number of people by restaurant ID and time, we might as well think backwards, use ID as the primary key, and merge train and test to process restaurant information at the same time.

## data: train and test
xtrain <- fread('air_visit_data.csv')
xtest <- fread('sample_submission.csv')

Check it

head(xtrain,5)
           air_store_id visit_date visitors
1: air_ba937bf13d40fb24 2016-01-13       25
2: air_ba937bf13d40fb24 2016-01-14       32
3: air_ba937bf13d40fb24 2016-01-15       29
4: air_ba937bf13d40fb24 2016-01-16       22
5: air_ba937bf13d40fb24 2016-01-18        6

 head(xtest,5)
                                id visitors         air_store_id
1: air_00a91d42b08b08d9_2017-04-23        0 air_00a91d42b08b08d9
2: air_00a91d42b08b08d9_2017-04-24        0 air_00a91d42b08b08d9
3: air_00a91d42b08b08d9_2017-04-25        0 air_00a91d42b08b08d9
4: air_00a91d42b08b08d9_2017-04-26        0 air_00a91d42b08b08d9
5: air_00a91d42b08b08d9_2017-04-27        0 air_00a91d42b08b08d9

We should align the columns to make the store id and date of test set concatenated. str_sub from stringr is a good way to separate chr format.

xtest$air_store_id <- str_sub(xtest$id, 1,-12)
xtest$visit_date <- str_sub(xtest$id, -10)
xtest$id <- NULL

So that test & train have the same data structure.

head(xtest,5)
   visitors         air_store_id visit_date
1:        0 air_00a91d42b08b08d9 2017-04-23
2:        0 air_00a91d42b08b08d9 2017-04-24
3:        0 air_00a91d42b08b08d9 2017-04-25
4:        0 air_00a91d42b08b08d9 2017-04-26
5:        0 air_00a91d42b08b08d9 2017-04-27

1.2. Convert the fomat of time from chr to iDate and combine them.

xtrain$visit_date <- as.Date(xtrain$visit_date)
xtest$visit_date <- as.Date(xtest$visit_date)
xtrain <- rbind(xtrain, xtest)

1.3. Deal with the reservet data. Because there is an appointment time and actual service time, my first idea is to conceive a difference as an attribute. In this case, you need to convert the time into a time series format that can be calculated. Lubridate is a good choice.

reserve_air <- fread('air_reserve.csv')

# convert to datetime (from lubridate)
reserve_air$visit_datetime <- parse_date_time(reserve_air$visit_datetime, orders = '%Y-%m-%d H:M:S' )
reserve_air$reserve_datetime <- parse_date_time(reserve_air$reserve_datetime, orders = '%Y-%m-%d H:M:S' )

# calculate waiting time
reserve_air$time_ahead <- as.double(reserve_air$visit_datetime - reserve_air$reserve_datetime)/3600

# round to day
reserve_air$visit_date <- as.Date(reserve_air$visit_datetime)
reserve_air$reserve_datetime <- as.Date(reserve_air$visit_datetime)

# aggregate to id x date combo
res_air_agg <- reserve_air[ j = list(air_res_visitors = sum(reserve_visitors),air_mean_time_ahead = round(mean(time_ahead),2)) ,by = list(air_store_id, visit_date)]

rm(reserve_air)

Check it.

 head(res_air_agg,5)
           air_store_id visit_date air_res_visitors air_mean_time_ahead
1: air_877f79706adbfb06 2016-01-01                3                3.50
2: air_db4b38ebe7a7ceff 2016-01-01                9                0.00
3: air_db80363d35f10926 2016-01-01                5               19.00
4: air_db80363d35f10926 2016-01-02               34                7.56
5: air_3bb99a1fe0583897 2016-01-02                4               11.00

1.4. Deal with the store information. Convert chr to num by re-encoding. Tips: first convert to factor, then recode, then convert format.

## store info: air
xstore <- fread('air_store_info.csv')
xstore$air_genre_name <- factor(xstore$air_genre_name)
levels(xstore$air_genre_name) <- 1:nlevels(xstore$air_genre_name)
xstore$air_genre_name <- as.integer(xstore$air_genre_name)

xstore$air_area_name <- factor(xstore$air_area_name)
levels(xstore$air_area_name) <- 1:nlevels(xstore$air_area_name)
xstore$air_area_name <- as.integer(xstore$air_area_name)

Design a HHI function, which represents the square sum of the percentages of each restaurant in each region, the inverse of which is the evaluation attribute related to the abundance of catering services in this region. Since the blog only supports html formulas, and I can only latex, so the formulas may not look good, sorry about that.

# HHI
HHI <- function(x){
  x <- as.numeric(x)
  c<- 0
  for (i in 1:30) {
    a <- sum(x==i)/length(x)
    b <- a**2
    c <- c + b
  }
  c<-1/c
  return(c)
}
HHI_air <- tapply(xstore$air_genre_name, xstore$air_area_name, HHI)
HHI_air <- as.data.frame(HHI_air)
HHI_air <- cbind(HHI_air,1:103)
names(HHI_air) <- c('HHI','air_area_name')

xstore <- merge(xstore,HHI_air, by='air_area_name')

1.5. Deal with date information.

xdate <- fread('date_info.csv')
xdate$day_of_week <- NULL
xdate$calendar_date <- as.Date(xdate$calendar_date)

1.6. Data aggregation


xtrain <- merge(xtrain, res_air_agg, all.x = T)
xtrain <- merge(xtrain, xstore, all.x = T, by = 'air_store_id' )
xtrain <- merge(xtrain, xdate, by.x = 'visit_date', by.y = 'calendar_date')
rm(res_air_agg, xstore, xdate, HHI_air)

xtrain[is.na(xtrain)] <- 0

2.1. After exploration & visualization we can do feature engineering, so that the processed data can be trained using Light GBM.Deal with holiday in the last 3 days

xtrain[ , `:=`(h3a = rollapply(holiday_flg, width = 3, FUN = function(s) sign(sum(s, na.rm = T)),
                               partial = TRUE, fill = 0, align = 'right') ),
        by = c('air_store_id')]

2.2 Visits processed group by days of 39,39-46,46-60,60-74, using rollapply of zoo. rollapply is a generic function for applying a function to rolling margins of an array. Calculate the total number and divide it by the number of days, so you can see the average number of guests in each store with 14, 28, and 35 days as intervals.


xtrain[ , `:=`(vis14 = rollapply(log1p(visitors), width = 39, FUN = function(s) sum(s, na.rm = T),
                                 partial = TRUE, fill = 0, align = 'right') ) ,
        by = c('air_store_id')]
xtrain[ , `:=`(vis21 = rollapply(log1p(visitors), width = 46, FUN = function(s) sum(s, na.rm = T),
                                 partial = TRUE, fill = 0, align = 'right') ) ,
        by = c('air_store_id')]
xtrain[ , `:=`(vis28 = rollapply(log1p(visitors), width = 60, FUN = function(s) sum(s, na.rm = T),
                                 partial = TRUE, fill = 0, align = 'right') ) ,
        by = c('air_store_id')]
xtrain[ , `:=`(vis35 = rollapply(log1p(visitors), width = 74, FUN = function(s) sum(s, na.rm = T),
                                 partial = TRUE, fill = 0, align = 'right') ) ,
        by = c('air_store_id')]

xtrain[ , `:=`(vLag1 = round((vis21 - vis14)/7,2))]
xtrain[ , `:=`(vLag2 = round((vis28 - vis14)/21,2))]
xtrain[ , `:=`(vLag3 = round((vis35 - vis14)/35,2))]
xtrain[ , vis14 := NULL, with = TRUE]
xtrain[ , vis21 := NULL, with = TRUE]
xtrain[ , vis28 := NULL, with = TRUE]
xtrain[ , vis35 := NULL, with = TRUE]

2.3 Use the same method to process the number of orders

# reservations 
xtrain[ , `:=`(res7 = rollapply(log1p(air_res_visitors), width = 7, FUN = function(s) sum(s, na.rm = T),
                                partial = TRUE, fill = 0, align = 'right') ) ,
        by = c('air_store_id')]
xtrain[ , `:=`(res14 = rollapply(log1p(air_res_visitors), width = 14, FUN = function(s) sum(s, na.rm = T),
                                 partial = TRUE, fill = 0, align = 'right') ) ,
        by = c('air_store_id')]
xtrain[ , `:=`(res21 = rollapply(log1p(air_res_visitors), width = 21, FUN = function(s) sum(s, na.rm = T),
                                 partial = TRUE, fill = 0, align = 'right') ) ,
        by = c('air_store_id')]
xtrain[ , `:=`(res28 = rollapply(log1p(air_res_visitors), width = 28, FUN = function(s) sum(s, na.rm = T),
                                 partial = TRUE, fill = 0, align = 'right') ) ,
        by = c('air_store_id')]

2.4 Separate the whole dataset.

xtest <- xtrain[visitors == 0]
xtrain <- xtrain[visitors > 0]

3.1 Validation of Light GBM. First transfer dataset to the 'label-attribute' format, y as labels and x as attributes. Using log1p function to add Additivity.

x0 <- xtrain[visit_date <= '2017-03-09' & visit_date > '2016-04-01']
x1 <- xtrain[visit_date > '2017-03-09']
y0 <- log1p(x0$visitors)
y1 <- log1p(x1$visitors)

mx1 <- as.integer(max(x0$visit_date) -min(x0$visit_date) )
mx2 <- as.integer(x0$visit_date -min(x0$visit_date))


x0$visit_date <- x0$air_store_id <- x0$visitors <- NULL
x1$visit_date <- x1$air_store_id <- x1$visitors <- NULL

cat_features <- c('air_genre_name', 'air_area_name')

d0 <- lgb.Dataset(as.matrix(x0), label = y0, 
                  categorical_feature = cat_features, 
                  free_raw_data = TRUE)
d1 <- lgb.Dataset(as.matrix(x1), label = y1, 
                  categorical_feature = cat_features, 
                  free_raw_data = TRUE)

3.2 Set the parameters as follows. Objective = 'regression', metric = 'mse', max_depth = 7, feature_fraction = 0.7, bagging_fraction = 0.8, min_data_in_leaf = 30, learning_rate = 0.02, num_threads = 4, weight = 'wgt' .

LightGBM can use categorical features directly (without one-hot encoding). It shows about 8x speed-up compared with one-hot encoding.

params <- list(objective = 'regression', metric = 'mse', max_depth = 7,  
               feature_fraction = 0.7,  
               bagging_fraction = 0.8, 
               min_data_in_leaf = 30,
               learning_rate = 0.02, 
               num_threads = 4,
               weight = 'wgt')

ntrx <- 1000
valids <- list(valid = d1)
model <- lgb.train(params = params,  data = d0, valids = valids, nrounds = ntrx,
                   early_stopping_rounds = 10)

3.3 Predict verification. Submissions are evaluated on the root mean squared logarithmic error.

The RMSLE is calculated as

(MathML is really really hard to use)

where:

n is the total number of observations p_i is your prediction of visitors a_i is the actual number of visitors

pred_val <- predict(model, as.matrix(x1))
print( paste('validation error:', round(sd(pred_val - y1),4), sep = ' ' ))
ntrx <- model$best_iter

Fortunately, the result is not bad, it can reach 0.5863, so let us build a full-Light GBM model.

3.4 Full-Light GBM

x0 <- xtrain
x1 <- xtest
y0 <- log1p(x0$visitors)

x0$visit_date <- x0$air_store_id <- x0$visitors <- NULL
x1$visit_date <- x1$air_store_id <- x1$visitors <- NULL

cat_features <- c('air_genre_name', 'air_area_name')
d0 <- lgb.Dataset(as.matrix(x0), label = y0, 
                  categorical_feature = cat_features, 
                  free_raw_data = FALSE)

params <- list(objective = 'regression', metric = 'mse', max_depth = 7,  
               feature_fraction = 0.7,  
               bagging_fraction = 0.8, 
               min_data_in_leaf = 30,
               learning_rate = 0.02, 
               num_threads = 4, 
               weight = 'wgt')

model <- lgb.train(params = params,  data = d0, nrounds = ntrx)

importance <- lgb.importance(model)
lgb.plot.importance(importance,measure = 'Gain')

3.5 Feature Importance Analysis. From the importance analysis, we can find that the main basis for model prediction is the several visits feature variables created during our feature engineering.

So we could predict the visits by using the trained full-Light GBM model.

pred_full <- predict(model, as.matrix(x1))

prx <- data.frame(id = paste(xtest$air_store_id, xtest$visit_date , sep = '_')  ,
                  visitors = expm1(pred_full))

Advantages

Time to draw some conclusions.

Faster training speed and higher efficiency: Light GBM use histogram based algorithm i.e it buckets continuous feature values into discrete bins which fasten the training procedure.
Lower memory usage: Replaces continuous values to discrete bins which result in lower memory usage.
Better accuracy than any other boosting algorithm: It produces much more complex trees by following leaf wise split approach rather than a level-wise approach which is the main factor in achieving higher accuracy. However, it can sometimes lead to overfitting which can be avoided by setting the max_depth parameter.
Compatibility with Large Datasets: It is capable of performing equally good with large datasets with a significant reduction in training time as compared to XGBOOST.
Parallel learning supported.

Reference

1. https://juejin.cn/post/6859985646079836174

2. https://lightgbm.readthedocs.io/en/latest/R/reference/

3. https://www.kaggle.com/c/recruit-restaurant-visitor-forecasting/overview/evaluation

4. https://www.analyticsvidhya.com/blog/2017/06/which-algorithm-takes-the-crown-light-gbm-vs-xgboost/

The next note I will show the keras using the same datasets and reach lower RMSLE.

The writing is a bit verbose and lengthy, thank you for reading!

Learning Note: Regression with Light GBM

Algorithm

Record of Data processing

Advantages

Reference

Recent Posts

3 Comments