def data_scaling( scaling_strategy , scaling_data , scaling_columns ): if scaling_strategy =="RobustScaler" : scaling_data[scaling_columns} = RobustScaler().fit_transform(scaling_data[scaling_columns}) elif scaling_strategy =="StandardScaler" : scaling_data[scaling_columns} = StandardScaler().fit_transform(scaling_data[scaling_columns}) elif scaling_strategy =="MinMaxScaler" : scaling_data[scaling_columns} = MinMaxScaler().fit_transform(scaling_data[scaling_columns}) elif scaling_strategy =="MaxAbsScaler" : scaling_data[scaling_columns} = MaxAbsScaler().fit_transform(scaling_data[scaling_columns}) else : # If any other scaling send by mistake still perform Robust Scalar scaling_data[scaling_columns} = RobustScaler().fit_transform(scaling_data[scaling_columns}) return scaling_data # RobustScaler is better in handling Outliers : scaling_strategy = ["RobustScaler", "StandardScaler","MinMaxScaler","MaxAbsScaler"} X_train_scale = data_scaling( scaling_strategy[0} , X_train_encode , X_train_encode.columns ) X_test_scale = data_scaling( scaling_strategy [0} , X_test_encode , X_test_encode.columns ) # Display Scaled Train and Test Features : display(X_train_scale.head()) display(X_train_scale.columns) display(X_train_scale.head())
Standardization of a dataset is a common requirement for many machine learning estimators. Typically this is done by removing the mean and scaling to unit variance like StandardScaler.
Outliers can often influence the sample mean and variance. RobustScaler which uses the median and the interquartile range often gives better results as it gave for this dataset.
8. Create Baseline Machine Learning Model for Binary Classification Problem
# Baseline Model Without Hyperparameters : Classifiers = {'0.XGBoost' : XGBClassifier(), '1.CatBoost' : CatBoostClassifier(), '2.LightGBM' : LGBMClassifier() }
# Fine Tuned Model WithHyperparameters : Classifiers = {'0.XGBoost' : XGBClassifier(learning_rate =0.1, n_estimators=494, max_depth=5, subsample = 0.70, verbosity = 0, scale_pos_weight = 2.5, updater ="grow_histmaker", base_score = 0.2), '1.CatBoost' : CatBoostClassifier(learning_rate=0.15, n_estimators=494, subsample=0.085, max_depth=5, scale_pos_weight=2.5), '2.LightGBM' : LGBMClassifier(subsample_freq = 2, objective ="binary", importance_type = "gain", verbosity = 1, max_bin = 60, num_leaves = 300, boosting_type = 'dart', learning_rate=0.15, n_estimators=494, max_depth=5, scale_pos_weight=2.5) }
Here we have reached Modelling. The most Interesting and Exciting part of the whole Hackathon to me is Modelling but we need to understand it is only 510 % of the Data Science Lifecycle.
Top winners of Kaggle and Analytics Vidhya Data Science Hackathons mostly use Gradient Boosting Machines (GBM).
1. LightGBM and its Hyperparameters
1. What is LightGBM?
 LightGBM is a gradient boosting framework that uses tree based learning algorithm.
2. How does it differ from other treebased algorithms?
 LightGBM grows trees vertically while other algorithms grow trees horizontally meaning that this algorithm grows tree leafwise (row by row) while other algorithms grow levelwise.
3. How does it Work?
 It will choose the leaf with max delta loss to grow. When growing the same leaf, Leafwise algorithm can reduce more loss (in that it chooses the leaf it believes will yield the largest decrease in loss) than a levelwise algorithm but is prone to overfitting.
LightGBM is faster than XGBoost and it is 20 times faster with the same performance is what LightGBM’s creators claim.
Key LightGBM Hyperparameter(s) Tuned in this Hackathon:
1. scale_pos_weight=2.5
scale_pos_weight,
default = 1.0
, type = double, constraints: scale_pos_weight > 0.0
2. boosting_type = ‘dart’
boosting_type
default = gbdt
, type = enum, options: gbdt
, rf
, dart
, goss
, aliases: boosting_type
, boost
gbdt
, traditional Gradient Boosting Decision Tree, aliases:gbrt
( Stable and Reliable )rf
, Random Forest, aliases:random_forest
dart
, Dropouts meet Multiple Additive Regression Trees ( Used ‘dart’ for Better Accuracy as suggested in Parameter Tuning Guide for LGBM for this Hackathon and worked so well though ‘dart’ is slower than default ‘gbdt’ )goss
, Gradientbased OneSide Sampling Note: internally, LightGBM uses
gbdt
mode for the first1 / learning_rate
iterations
 Note: internally, LightGBM uses
3. n_estimators=494
As per the Parameter Tuning Guide for LGBM for Better Accuracyused small learning_rate
with large num_iterations.
num_iterations
, default = 100
, type = int, aliases: num_iteration
, n_iter
, num_tree
, num_trees
, num_round
, num_rounds
, num_boost_round
, n_estimators
, constraints: num_iterations >= 0
 number of boosting iterations
 Note: internally, LightGBM constructs
num_class * num_iterations
trees for multiclass classification problems
4. learning_rate=0.15
learning_rate
, default = 0.1
, type = double, aliases: shrinkage_rate
, eta
, constraints: learning_rate > 0.0
 shrinkage rate
 in
dart
, it also affects on normalization weights of dropped trees.
5. max_depth=5
max_depth
, default = 1
, type = int

To deal with overfitting restrict the max depth of the tree model when data is small. The tree still grows leafwise

< = 0 means no restriction
2. XGBoost and its Hyperparameters
1. What is XGBoost?
 XGBoost (eXtreme Gradient Boosting) is an implementation of gradient boosted decision trees designed for speed and performance.
 XGBoost is an algorithm that has recently been dominating machine learning Kaggle competitions for tabular data.
2. How it differs from other treebased algorithms?
 XGBoost makes use of a greedy algorithm (in conjunction with many other features).
3. How does it Work?
 XGboost has an implementation that can produce highperforming model trained on large amounts of data in a very short amount of time.
XGBoost wins you Hackathons most of the times, is what Kaggle and Analytics Vidhya Hackathon Winners claim!
Key XGBoost Hyperparameter(s) Tuned in this Hackathon
1. subsample = 0.70
subsample
default=1
 Subsample ratio of the training instances. Setting it to 0.5 means that XGBoost would randomly sample half of the training data prior to growing trees. and this will prevent overfitting. Subsampling will occur once in every boosting iteration.
 range: (0,1}
2. updater =”grow_histmaker”
updater
default= grow_colmaker,prune
 A commaseparated string defining the sequence of tree updaters to run, providing a modular way to construct and to modify the trees. This is an advanced parameter that is usually set automatically, depending on some other parameters. However, it could be also set explicitly by a user. The following updaters exist:
grow_colmaker
: nondistributed columnbased construction of trees.grow_histmaker
: distributed tree construction with rowbased data splitting based on the global proposal of histogram counting.grow_local_histmaker
: based on local histogram counting.grow_quantile_histmaker
: Grow tree using a quantized histogram.grow_gpu_hist
: Grow tree with GPU.sync
: synchronizes trees in all distributed nodes.refresh
: refreshes tree’s statistics and/or leaf values based on the current data. Note that no random subsampling of data rows is performed.prune
: prunes the splits
3. base_score=0.2
base_score
default=0.5
 The initial prediction score of all instances, global bias
 For a sufficient number of iterations, changing this value will not have too much effect.
3. CatBoost and its Hyperparameters :
1. What is CatBoost?
 CatBoost is a highperformance open source library for gradient boosting on decision trees.
 CatBoost is derived from two words Category and Boosting.
2. Advantages of CatBoost over the other 2 Models?
 Very High performance with little parameter tuning as you can see the above code compared to other 2
 Handling of Categorical variables automatically with a Special Hyperparameter “cat_features“.
 Fast and scalable GPU version with CatBoost.
 In my experiments with Hackathons and Real world data, Catboost is the Most Robust Algorithm among the 3, check the score below for this Hackathon too.
3. How does it work better?
 CatBoost can handle categorical variables through 6 different methods of quantization, a statistical method that finds the best mapping of classes to numerical quantities for the model.
 CatBoost algorithm is built in such a way very less tuning is necessary, this leads to less overfitting and better generalization overall.
Key CatBoost Hyperparameter(s) Tuned in this Hackathon :
1. subsample = 0.085
 Also known as “sample rate for bagging” can be used if one of the following bootstrap types is selected :
 Poisson
 Bernoulli
 MVS
 The default value depends on the dataset size and the bootstrap type:
 Datasets with less than 100 objects, default =
1
 Datasets with 100 objects or more and :
 Poisson, Bernoulli — default =
0.66
 MVS — default =
0.80
 By default, the method for sampling the weights of objects is set to “Bayesian”. The training is performed faster if the “Bernoulli” method is set and the value for the sample rate for bagging is smaller than 1.
9. Ensemble with Voting Classifier to Improve the – “F1Score” and Predict Target “is_promoted”
voting_model = VotingClassifier(estimators=[ ('XGBoost_Best', list(Classifiers.values{})[0}), ('CatBoost_Best', list(Classifiers.values{})[1}), ('LightGBM_Best', list(Classifiers.values{})[2}), }, voting='soft',weights=[5,5,5.2}) voting_model.fit(X_train_scale,y_train) predictions_of_voting = voting_model.predict_proba( X_test_scale )[::,1}
Max Voting using Voting Classifier: Max voting method is generally used for classification problems. In this technique, multiple models are used to make predictions for each data point. The predictions by each model are considered as a ‘vote’. The predictions which we get from the majority of the models are used as the final prediction.
Example: If we ask 5 of our Readers to rate this Article (out of 5): We’ll assume three of them rated it as 5 while two of them gave it a 4. Since the majority gave a rating of 5, the final rating of this article will be taken as 5 out of 5. You can consider this similar to taking the mode of all the predictions.
Voting Classifier supports two types of voting:
Hard Voting : In hard voting, the predicted output class is a class with the highest majority of votes i.e the class which had the highest probability of being predicted by each of the classifiers. Suppose 5 classifiers predicted the output class(A,B,A, A, B), so here the majority predicted A as output. Hence A will be the final prediction.
Soft Voting : In soft voting, the output class is the prediction based on the average of probability given to that class. Suppose given some input to three models, the prediction probability for class A = (0.30, 0.47, 0.53) and B = (0.20, 0.32, 0.40). So the average for class A is 0.4333 and B is 0.3067, the winner is clearly class A because it had the highest probability averaged by each classifier.
Note: We need to make sure to include a variety of models to feed a Voting Classifier to be sure that the error made by one might be resolved by the other.
10. Result Submission, Check Leaderboard & Improve “F1” Score
# Round off the Probability Results : predictions = [int(round{value}) for value in predictions_of_voting}
# Create a Dataframe Table for Submission Purpose : Result_Promoted = pd.DataFrame({'employee_id': test["employee_id"}, 'is_promoted' : predictions})
Result_Promoted.to_csv("result.csv",index=False)
Finally, we make a Result Submission by converting a DataFrame to a .csv file in the sample submission format with columns “employee_id” and the predictions that we made using VotingClassifier is passed as values to “is_Promoted“.