Tree boosting has empirically proven to be a highly effective and versatile approach for data-driven modelling. The core argument is that tree boosting can adaptively determine the local neighbourhoods of the model thereby taking the bias-variance trade-off into consideration during model fitting.

Recently, a tree boosting method known as XGBoost has gained popularity by providing higher accuracy. XGBoost further introduces some improvements which allow it to deal with the bias-variance trade-off even more carefully. In this research work, we propose to demonstrate the use of an adaptive procedure i.e. Learned Loss (LL) to update the loss function as the boosting proceeds.

Accuracy of the proposed algorithm i.e. XGBoost with Learned Loss boosting function is evaluated using test/train method, K-fold cross validation, and Stratified cross validation method and compared with the state of the art algorithms viz. XGBoost, AdaBoost, AdaBoost-NN, Linear Regression(LR),Neural Network(NN), Decision Tree(DT), Support Vector Machine(SVM), bagging-DT, bagging-NN and Random Forest algorithms. The parameters evaluated are accuracy, Type 1 error and Type 2 error (in Percentages). This study uses total ten years of historical data from Jan 2007 to Aug 2017 of two stock market indices CNX Nifty and S&P BSE Sensex which are highly voluminous.

Further, in this research work, we will investigate how XGBoost differs from the more traditional ensemble techniques. Moreover, we will discuss the regularization techniques that these methods offer and the effect these have on the models.

In addition to this, we will attempt to answer the question of why XGBoost seems to win so many competitions. To do this, we will provide some arguments for why tree boosting, and in particular XGBoost, seems to be such a highly effective and versatile approach to predictive modelling. The core argument is that tree boosting can be seen to adaptively determine the local neighbourhoods of the model. Tree boosting can thus be seen to take the bias-variance trade off into consideration during model fitting. XGBoost further introduces some improvements which allow it to deal with the bias-variance trade off even more carefully.

Excerpt

CHAPTER I

Theoretical Foundations

1.1 Outline

1.1.1 AdaBoost

1.1.2 Gradient boosting

1.1.3 XGBoost

1.1.5 Comparison of Boosting Algorithms

1.1.6 Loss Functions in Boosting Algorithms

1.2 Motivation

1.3 Problem Statement

1.4 Scope and Main Objectives

1.5 Impact to the Society

1.6 Organization of the Book

CHAPTER II

Literature Review

2.1 History

2.2 XGBoost

2.3 Random Forest

2.4 AdaBoost

2.5 Loss Function

CHAPTER III

Proposed Work

3.1 Outline

3.2 Proposed Approach

3.2.1 Objective of XGBoost

3.2.2 Parameters

3.2.3 Parameters for Tree Booster

3.2.4 Learning Task Parameters

3.2.5 Training & Parameter tuning

3.2.6 What XGBoost Brings to the Table

3.2.7 Square Logistics Loss Function (SqLL)

CHAPTER IV

Results & Discussions

4.1 Outline

4.2 Dataset

4.3 Tools and Platforms

4.4 Feature Construction

4.5 Feature Selection

4.6 Training the Model

4.7 Evaluation Techniques

4.8 Analysis of Results

CHAPTER V

Summary, Recommendations, and Future Directions

5.1 Overview

5.2 Summary

5.3 Recommendations

5.4 Future Research Directions

Research Goals and Objectives

The primary aim of this research is to enhance the performance and predictive accuracy of Extreme Gradient Boosting (XGBoost) for large datasets by implementing a novel "Learned Loss" (LL) function, specifically utilizing a Squared Logistic Loss (SqLL) approach. The study seeks to investigate how this adaptation optimizes the bias-variance trade-off and improves classification results compared to traditional ensemble techniques like AdaBoost, Random Forest, and standard Gradient Boosting.

Application of Squared Logistic Loss (SqLL) to optimize Extreme Gradient Boosting for stock market index prediction.
Comparative performance analysis against state-of-the-art algorithms including Neural Networks, SVM, and Random Forest.
Evaluation of model efficacy using various statistical validation methods (K-fold, Stratified Cross Validation, and Train/Test splits).
Investigation into regularization techniques and their impact on reducing overfitting in complex modeling scenarios.

Excerpt from the Book

3.2.7 Square Logistics Loss Function (SqLL)

In this section, we will introduce the loss function. The loss function is the measure of prediction accuracy that we define for the problem at hand. We are ultimately interested in minimizing the expected loss, which is known as the risk. The function which minimizes the risk is known as the target function. This is the optimal prediction function we would like to obtain.

SqLL holds the property of convexity and convexity means there are no local minima.

Summary of Chapters

CHAPTER I: Provides an overview of machine learning, data mining, and the theoretical foundations of boosting techniques including AdaBoost, Gradient Boosting, and XGBoost.

CHAPTER II: Reviews the historical development of boosting algorithms, comparing Random Forest and various boosting methods in terms of performance and computational processes.

CHAPTER III: Details the proposed approach, including the definition of the XGBoost objective, parameter tuning, and the introduction of the Squared Logistics Loss Function (SqLL).

CHAPTER IV: Presents the experimental results and discussions, evaluating the proposed model against other algorithms using various datasets and validation techniques.

CHAPTER V: Summarizes the findings of the research, provides recommendations for utilizing the proposed approach in data mining, and suggests directions for future study.

Keywords

XGBoost, Gradient Boosting, Machine Learning, Data Mining, Squared Logistic Loss, SqLL, AdaBoost, Random Forest, Stock Market Prediction, Bias-Variance Trade-off, Ensemble Techniques, Predictive Modelling, Regularization, Cross Validation, Classification Accuracy.

Frequently Asked Questions

What is the core focus of this research?

This work primarily focuses on improving the predictive accuracy of the Extreme Gradient Boosting (XGBoost) algorithm by applying an adaptive procedure known as Learned Loss (LL) or Squared Logistic Loss (SqLL).

Which algorithms are used for comparison in this study?

The proposed algorithm is compared against AdaBoost, AdaBoost-NN, Linear Regression, Neural Networks, Decision Trees, Support Vector Machines, Bagging, and standard Random Forest models.

What is the main research objective?

The goal is to increase the confidence and accuracy of classification points in large, voluminous datasets—specifically within stock market index modeling—by minimizing model error more effectively than standard loss functions.

What scientific methodology is applied?

The study employs empirical data-driven modeling, testing models via K-fold cross-validation, Stratified cross-validation, and basic training/testing dataset splits to evaluate performance metrics like accuracy and Type 1/Type 2 errors.

What is covered in the main body of the work?

The main body covers the theoretical foundations of boosting, a literature review of ensemble techniques, the detailed methodology of the proposed SqLL-based XGBoost approach, and an empirical results section based on historical stock market data.

Which keywords best characterize this work?

Key terms include XGBoost, Gradient Boosting, Squared Logistic Loss (SqLL), Predictive Modelling, and Stock Market Prediction.

How does XGBoost handle overfitting according to the author?

XGBoost manages overfitting through specific mechanisms like early stopping, tree-level penalization, and the tuning of regularization parameters which control model complexity.

What is the significance of the Squared Logistic Loss (SqLL)?

SqLL is introduced because it maintains convexity (ensuring no local minima) and is shown in the results to achieve higher accuracy and stability in stock market forecasting compared to traditional logistic loss functions.

Excerpt out of 52 pages - scroll top

Details

Title: XGBoost. The Extreme Gradient Boosting for Mining Applications
Grade: 8
Author: Nonita Sharma (Author)
Publication Year: 2017
Pages: 52
Catalog Number: V415839
ISBN (eBook): 9783668660601
ISBN (Book): 9783668660618
Language: English
Tags: xgboost extreme gradient boosting mining applications
Product Safety: GRIN Publishing GmbH

Quote paper: Nonita Sharma (Author), 2017, XGBoost. The Extreme Gradient Boosting for Mining Applications, Munich, GRIN Verlag, https://www.hausarbeiten.de/document/415839

XGBoost. The Extreme Gradient Boosting for Mining Applications