Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New initial assumption #1350

Closed
wants to merge 161 commits into from

Conversation

dmitryglhf
Copy link
Collaborator

@dmitryglhf dmitryglhf commented Dec 2, 2024

This is a 🔨 code refactoring.

Summary

This PR introduces the following key updates:

  • New Initial Assumptions: Updates initial assumptions by adding boosting-based solutions.

    Comparison table between old and new assumptions (validated on automlbenchmark):

    Metric (mean) main gbm
    0 auc 0.869263 0.879746
    1 acc 0.84667 0.852339
    2 balacc 0.805336 0.822745
    3 logloss 0.449189 0.377827
    4 training_duration 242.554 251.445

Context

Closes #1341

@dmitryglhf dmitryglhf requested a review from nicl-nno December 2, 2024 15:54
@pep8speaks
Copy link

pep8speaks commented Dec 2, 2024

Hello @dmitryglhf! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 419:67: W292 no newline at end of file

Line 567:69: W292 no newline at end of file

Line 146:74: W292 no newline at end of file

Line 278:78: W292 no newline at end of file

Comment last updated at 2025-01-10 09:38:30 UTC

RIDGE = 'ridge'

# Parameters of models
models_params = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Это какие-то эффективные гиперпараметры?

Copy link
Collaborator Author

@dmitryglhf dmitryglhf Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

При тестировании пайплайнов на Kaggle с указанными гиперпараметрами значение метрики немного улучшалось. Настройку параметров прекращал, когда скорость работы значительно уменьшалась или качество падало. Для CatBoost и линейных моделей параметры остались по умолчанию, поскольку они изначально показывали хорошие результаты, а при подборе метрика ухудшалась и/или время работы возрастало (для CatBoost).

.add_branch((CATBOOSTREG, models_params[CATBOOSTREG]),
(XGBOOSTREG, models_params[XGBOOSTREG]),
(LGBMREG, models_params[LGBMREG])) \
.join_branches(CATBOOSTREG, models_params[CATBOOSTREG])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

А почему модели обьединены CATBOOSTREG, а не линейной моделью?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Пробовал объединять с помощью Random Forest и линейных моделей. В обоих случаях метрика немного ухудшалась. Аналогичный результат был, когда эти модели были перед разветвлением. Это тестирование проводилось не на полном бенчмарке, а на Kaggle. Думаю здесь мне стоит провести больше тестов с объединением линейной моделью.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ну вот в презентации расхваливалась именно линейная модель. Можно конечно оба варианта добавлять в популяцию, с использованием кэша это вычислительно не очень дорого.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Хорошо, я протестирую это.

@nicl-nno
Copy link
Collaborator

nicl-nno commented Dec 2, 2024

image

А какая модель тут была раньше, что качество так просело?

@dmitryglhf
Copy link
Collaborator Author

dmitryglhf commented Dec 3, 2024

image

А какая модель тут была раньше, что качество так просело?

Раньше здесь было Scaling + RandomForest.

@nicl-nno
Copy link
Collaborator

nicl-nno commented Dec 3, 2024

Раньше здесь было Scaling + RandomForest.

Вроде он и сейчас в начальной популяции есть. Он не вообще используется или просто проигрывает в ходе отбора?

@dmitryglhf
Copy link
Collaborator Author

Раньше здесь было Scaling + RandomForest.

Вроде он и сейчас в начальной популяции есть. Он не вообще используется или просто проигрывает в ходе отбора?

Сейчас он есть, но не использовался, в бенчмарке тестировалось только 'gbm' приближение.

@dmitryglhf
Copy link
Collaborator Author

dmitryglhf commented Dec 11, 2024

Metric main gbm_linear linear_gbm_linear linear_gbm_catboost
0 auc 0.869263 0.879204 0.853338 0.848935
1 acc 0.84667 0.851591 0.826598 0.821353
2 balacc 0.805336 0.823641 0.79243 0.793106
3 logloss 0.449189 0.381866 0.504397 0.492602
4 training_duration 242.554 1007.77 86.6531 99.8162
Full table and Details

Pipelines (ridge instead of logit in regression tasks);
'main' assumption:
image

'gbm_linear' assumption:
image

'linear_gbm_linear' assumption:
image

'linear_gbm_catboost' assumption:
image

main_vs_gl.csv
main_vs_lgl.csv
main_vs_lgc.csv

Chose gbm_linear as result.

@dmitryglhf
Copy link
Collaborator Author

/fix-pep8

@aimclub aimclub deleted a comment from codecov bot Dec 11, 2024
Copy link

codecov bot commented Dec 11, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 80.34%. Comparing base (b0618df) to head (50064a3).

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1350   +/-   ##
=======================================
  Coverage   80.33%   80.34%           
=======================================
  Files         146      146           
  Lines       10464    10469    +5     
=======================================
+ Hits         8406     8411    +5     
  Misses       2058     2058           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dmitryglhf
Copy link
Collaborator Author

Nodes fit time in gbm_linear for Kaggle_s4e6, n_jobs=16 (-1)

Node: scaling, fit time: 1.89 seconds
Node: catboost, fit time: 82.914 seconds
Node: lgbm, fit time: 7.416 seconds
Node: xgboost, fit time: 2.812 seconds
Node: logit, fit time: 0.551 seconds
2024-12-11 18:14:40,666 - ApiComposer - Initial pipeline was fitted in 106.3 sec.

@dmitryglhf
Copy link
Collaborator Author

Metric (mean) main_scal_rf gbm_linear gbm_linear_catboost_with_rsm (new) catboost_only_without_rsm (new) rf_gbm_linear (new)
0 auc 0.869263 0.879204 0.877655 0.874172 0.872379
1 acc 0.84667 0.851591 0.852249 0.84465 0.839088
2 balacc 0.805336 0.823641 0.824312 0.815915 0.80716
3 logloss 0.449189 0.381866 0.383262 0.36923 0.647716
4 training_duration 242.554 1007.77 338.644 883.467 93.0425

Training duration reduced in gbm_linear because of adding parameter "rsm": 0.1 as default for catboost (Speeding up CatBoost).

Full table and Details

catboost_only.csv
gbm_linear_ctbst_rsm.csv
rf_gbm_linear.csv

catboost_only_without_rsm:
image

rf_gbm_linear:
image

@dmitryglhf
Copy link
Collaborator Author

dmitryglhf commented Dec 24, 2024

Metric (mean) main_scal_rf gbm_linear gbm_catboost_new_params rf3_gbm_linear xgb_lgbm_linear
0 auc 0.869263 0.879204 0.879746 0.86393 0.877597
1 acc 0.84667 0.851591 0.852339 0.837973 0.848727
2 balacc 0.805336 0.823641 0.822745 0.805042 0.818386
3 logloss 0.449189 0.381866 0.377827 0.625611 0.392734
4 training_duration 242.554 1007.77 251.445 104.559 213.057
Full table and Details

catboost_new_params.csv
rf3_gbm_linear.csv
xgb_lgbm_linear.csv

gbm_catboost_new_params tested with new default parameters for catboost:

  "catboost": {
    "n_jobs": -1,
    "num_trees": 3000,
    "learning_rate": 0.03,
    "l2_leaf_reg": 1e-2,
    "bootstrap_type": "Bernoulli",
    "grow_policy": "SymmetricTree",
    "max_depth": 5,
    "min_data_in_leaf": 1,
    "one_hot_max_size": 10,
    "fold_permutation_block": 1,
    "boosting_type": "Plain",
    "od_type": "Iter",
    "od_wait": 100,
    "max_bin": 32,
    "feature_border_type": "GreedyLogSum",
    "nan_mode": "Min",
    "verbose": false,
    "allow_writing_files": false,
    "use_eval_set": true,
    "use_best_model": true,
    "enable_categorical": true

  },
  "catboostreg": {
    "n_jobs": -1,
    "num_trees": 3000,
    "learning_rate": 0.03,
    "l2_leaf_reg": 1e-2,
    "bootstrap_type": "Bernoulli",
    "grow_policy": "SymmetricTree",
    "max_depth": 5,
    "min_data_in_leaf": 1,
    "one_hot_max_size": 10,
    "fold_permutation_block": 1,
    "boosting_type": "Plain",
    "od_type": "Iter",
    "od_wait": 100,
    "max_bin": 32,
    "feature_border_type": "GreedyLogSum",
    "nan_mode": "Min",
    "verbose": false,
    "allow_writing_files": false,
    "use_eval_set": true,
    "use_best_model": true,
    "enable_categorical": true,
    "loss_function": "MultiRMSE"
  },

rf3_gbm_linear:
image

xgb_lgbm_linear:
image

@dmitryglhf dmitryglhf closed this Jan 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

enh: Design effective initial assumption
3 participants