GBDTSelector is based on LightGBM, which is a gradient boosting framework that uses tree-based learning algorithms.
When passing the data into the GBDT model, the model will construct the boosting tree. And the feature importance comes from the score in construction, which indicates how useful or valuable each feature was in the construction of the boosted decision trees within the model.
We could use this method as a strong baseline in Feature Selector, especially when using the GBDT model as a classifier or regressor.
For now, we support the
gain. But we will support customized
importance_type in the future, which means the user could define how to calculate the
feature score by themselves.
First you need to install dependency:
pip install lightgbm
from nni.algorithms.feature_engineering.gbdt_selector import GBDTSelector # load data ... X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42) # initlize a selector fgs = GBDTSelector() # fit data fgs.fit(X_train, y_train, ...) # get improtant features # will return the index with important feature here. print(fgs.get_selected_features(10)) ...
And you could reference the examples in
Requirement of fit FuncArgs
X (array-like, require) - The training input samples which shape = [n_samples, n_features]
y (array-like, require) - The target values (class labels in classification, real numbers in regression) which shape = [n_samples].
lgb_params (dict, require) - The parameters for lightgbm model. The detail you could reference here
eval_ratio (float, require) - The ratio of data size. It’s used for split the eval data and train data from self.X.
early_stopping_rounds (int, require) - The early stopping setting in lightgbm. The detail you could reference here.
importance_type (str, require) - could be ‘split’ or ‘gain’. The ‘split’ means ‘ result contains numbers of times the feature is used in a model’ and the ‘gain’ means ‘result contains total gains of splits which use the feature’. The detail you could reference in here.
num_boost_round (int, require) - number of boost round. The detail you could reference here.
Requirement of get_selected_features FuncArgs
topk (int, require) - the topK impotance features you want to selected.