New initial assumption #1350

dmitryglhf · 2024-12-02T15:54:52Z

This is a 🔨 code refactoring.

Summary

This PR introduces the following key updates:

New Initial Assumptions: Updates initial assumptions by adding boosting-based solutions.

Comparison table between old and new assumptions (validated on automlbenchmark):

Metric (mean) main gbm

0 auc 0.869263 0.879746

1 acc 0.84667 0.852339

2 balacc 0.805336 0.822745

3 logloss 0.449189 0.377827

4 training_duration 242.554 251.445

Context

Closes #1341

pep8speaks · 2024-12-02T15:55:01Z

Hello @dmitryglhf! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file fedot/core/operations/evaluation/operation_implementations/models/boostings_implementations.py:

Line 419:67: W292 no newline at end of file

In the file test/integration/pipelines/tuning/test_pipeline_tuning.py:

Line 567:69: W292 no newline at end of file

In the file test/unit/api/test_assumption_builder.py:

Line 146:74: W292 no newline at end of file

In the file test/unit/preprocessing/test_preprocessing_through_api.py:

Line 278:78: W292 no newline at end of file

Comment last updated at 2025-01-10 09:38:30 UTC

nicl-nno · 2024-12-02T19:37:56Z

fedot/api/api_utils/assumptions/task_assumptions.py

+        RIDGE = 'ridge'
+
+        # Parameters of models
+        models_params = {


Это какие-то эффективные гиперпараметры?

При тестировании пайплайнов на Kaggle с указанными гиперпараметрами значение метрики немного улучшалось. Настройку параметров прекращал, когда скорость работы значительно уменьшалась или качество падало. Для CatBoost и линейных моделей параметры остались по умолчанию, поскольку они изначально показывали хорошие результаты, а при подборе метрика ухудшалась и/или время работы возрастало (для CatBoost).

nicl-nno · 2024-12-02T19:39:04Z

fedot/api/api_utils/assumptions/task_assumptions.py

+            .add_branch((CATBOOSTREG, models_params[CATBOOSTREG]),
+                        (XGBOOSTREG, models_params[XGBOOSTREG]),
+                        (LGBMREG, models_params[LGBMREG])) \
+            .join_branches(CATBOOSTREG, models_params[CATBOOSTREG])


А почему модели обьединены CATBOOSTREG, а не линейной моделью?

Пробовал объединять с помощью Random Forest и линейных моделей. В обоих случаях метрика немного ухудшалась. Аналогичный результат был, когда эти модели были перед разветвлением. Это тестирование проводилось не на полном бенчмарке, а на Kaggle. Думаю здесь мне стоит провести больше тестов с объединением линейной моделью.

Ну вот в презентации расхваливалась именно линейная модель. Можно конечно оба варианта добавлять в популяцию, с использованием кэша это вычислительно не очень дорого.

Хорошо, я протестирую это.

nicl-nno · 2024-12-02T19:40:19Z

А какая модель тут была раньше, что качество так просело?

dmitryglhf · 2024-12-03T04:23:14Z

А какая модель тут была раньше, что качество так просело?

Раньше здесь было Scaling + RandomForest.

nicl-nno · 2024-12-03T09:23:05Z

Раньше здесь было Scaling + RandomForest.

Вроде он и сейчас в начальной популяции есть. Он не вообще используется или просто проигрывает в ходе отбора?

dmitryglhf · 2024-12-03T10:25:30Z

Раньше здесь было Scaling + RandomForest.

Вроде он и сейчас в начальной популяции есть. Он не вообще используется или просто проигрывает в ходе отбора?

Сейчас он есть, но не использовался, в бенчмарке тестировалось только 'gbm' приближение.

dmitryglhf · 2024-12-11T10:09:17Z

	Metric	main	gbm_linear	linear_gbm_linear	linear_gbm_catboost
0	auc	0.869263	0.879204	0.853338	0.848935
1	acc	0.84667	0.851591	0.826598	0.821353
2	balacc	0.805336	0.823641	0.79243	0.793106
3	logloss	0.449189	0.381866	0.504397	0.492602
4	training_duration	242.554	1007.77	86.6531	99.8162

Full table and Details

Pipelines (ridge instead of logit in regression tasks);
'main' assumption:

'gbm_linear' assumption:

'linear_gbm_linear' assumption:

'linear_gbm_catboost' assumption:

main_vs_gl.csv
main_vs_lgl.csv
main_vs_lgc.csv

Chose gbm_linear as result.

dmitryglhf · 2024-12-11T10:24:33Z

/fix-pep8

codecov · 2024-12-11T11:25:01Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 80.34%. Comparing base (b0618df) to head (50064a3).

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #1350   +/-   ##
=======================================
  Coverage   80.33%   80.34%           
=======================================
  Files         146      146           
  Lines       10464    10469    +5     
=======================================
+ Hits         8406     8411    +5     
  Misses       2058     2058

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

dmitryglhf · 2024-12-11T16:13:34Z

Nodes fit time in gbm_linear for Kaggle_s4e6, n_jobs=16 (-1)

Node: scaling, fit time: 1.89 seconds
Node: catboost, fit time: 82.914 seconds
Node: lgbm, fit time: 7.416 seconds
Node: xgboost, fit time: 2.812 seconds
Node: logit, fit time: 0.551 seconds
2024-12-11 18:14:40,666 - ApiComposer - Initial pipeline was fitted in 106.3 sec.

dmitryglhf · 2024-12-16T09:37:37Z

	Metric (mean)	main_scal_rf	gbm_linear	gbm_linear_catboost_with_rsm (new)	catboost_only_without_rsm (new)	rf_gbm_linear (new)
0	auc	0.869263	0.879204	0.877655	0.874172	0.872379
1	acc	0.84667	0.851591	0.852249	0.84465	0.839088
2	balacc	0.805336	0.823641	0.824312	0.815915	0.80716
3	logloss	0.449189	0.381866	0.383262	0.36923	0.647716
4	training_duration	242.554	1007.77	338.644	883.467	93.0425

Training duration reduced in gbm_linear because of adding parameter "rsm": 0.1 as default for catboost (Speeding up CatBoost).

Full table and Details

catboost_only.csv
gbm_linear_ctbst_rsm.csv
rf_gbm_linear.csv

catboost_only_without_rsm:

rf_gbm_linear:

dmitryglhf · 2024-12-24T10:44:07Z

	Metric (mean)	main_scal_rf	gbm_linear	gbm_catboost_new_params	rf3_gbm_linear	xgb_lgbm_linear
0	auc	0.869263	0.879204	0.879746	0.86393	0.877597
1	acc	0.84667	0.851591	0.852339	0.837973	0.848727
2	balacc	0.805336	0.823641	0.822745	0.805042	0.818386
3	logloss	0.449189	0.381866	0.377827	0.625611	0.392734
4	training_duration	242.554	1007.77	251.445	104.559	213.057

Full table and Details

catboost_new_params.csv
rf3_gbm_linear.csv
xgb_lgbm_linear.csv

gbm_catboost_new_params tested with new default parameters for catboost:

  "catboost": {
    "n_jobs": -1,
    "num_trees": 3000,
    "learning_rate": 0.03,
    "l2_leaf_reg": 1e-2,
    "bootstrap_type": "Bernoulli",
    "grow_policy": "SymmetricTree",
    "max_depth": 5,
    "min_data_in_leaf": 1,
    "one_hot_max_size": 10,
    "fold_permutation_block": 1,
    "boosting_type": "Plain",
    "od_type": "Iter",
    "od_wait": 100,
    "max_bin": 32,
    "feature_border_type": "GreedyLogSum",
    "nan_mode": "Min",
    "verbose": false,
    "allow_writing_files": false,
    "use_eval_set": true,
    "use_best_model": true,
    "enable_categorical": true

  },
  "catboostreg": {
    "n_jobs": -1,
    "num_trees": 3000,
    "learning_rate": 0.03,
    "l2_leaf_reg": 1e-2,
    "bootstrap_type": "Bernoulli",
    "grow_policy": "SymmetricTree",
    "max_depth": 5,
    "min_data_in_leaf": 1,
    "one_hot_max_size": 10,
    "fold_permutation_block": 1,
    "boosting_type": "Plain",
    "od_type": "Iter",
    "od_wait": 100,
    "max_bin": 32,
    "feature_border_type": "GreedyLogSum",
    "nan_mode": "Min",
    "verbose": false,
    "allow_writing_files": false,
    "use_eval_set": true,
    "use_best_model": true,
    "enable_categorical": true,
    "loss_function": "MultiRMSE"
  },

rf3_gbm_linear:

xgb_lgbm_linear:

added rows_len

…f/FEDOT into pr/1350

dmitryglhf requested a review from nicl-nno December 2, 2024 15:54

nicl-nno reviewed Dec 2, 2024

View reviewed changes

aimclub deleted a comment from codecov bot Dec 11, 2024

dmitryglhf mentioned this pull request Dec 27, 2024

Fix tests and add multi-target output for boostings #1353

Merged

dmitryglhf closed this in #1353 Jan 8, 2025

dmitryglhf reopened this Jan 8, 2025

dmitryglhf and others added 12 commits January 9, 2025 20:34

Remove scaling as default, new single-node initial assumptions

568fb61

Set parameters for assumptions

51e0455

New gbm assumption

c39ad55

Update assumptions

8c97850

Return scaling as default

2f4c891

Fix XGBoost and LGBM convert_to_dataframe method

e9e482c

Fork-gbm assumptions

7356847

Linear assumptions

26e6c04

Update rf, rfr

71653db

Add params

5f74cdc

fix pep8

f8ff8d2

test without xgb and lgbm

b81ee65

dmitryglhf and others added 29 commits January 10, 2025 12:36

logit in gbm assumption

26c4913

added rows_len

new initial assumptions

83a5449

test commit - main branch

984bf16

Update task_assumptions.py

30df58c

with bugfix and old assumptions

86a7cc2

with bugfix and new assumptions

9901a0d

Update task_assumptions.py

9efaf8a

without gbm

ed7cfbf

assumptions as dict return

8533366

rf, linear only

3ef072e

Update task_assumptions.py

d212244

only catboost, rf, linear

9fdff48

rf, linear, xgb

7372d66

rf, linear, lgbm

7aaf361

all assumptions

4f8f670

catboost for multitask

3e2dc00

All asumptions, added 'rsm' param for catboost

526c1d3

new default params for catboost, all assumptions

fe4b347

Update default_operation_params.json

91427e7

Update default_operation_params.json

50274ca

All assumptions

273e035

All initial assumptions

293275c

Multi-task for boostings, all assumptions

307fbac

lgbm multi-task test

27f007e

Multi-task for boostings, all assumptions

bd0d9ff

All assumptions, rollback boostings

9d59d91

Rebase

25d2940

Update boostings.py

a59e34c

Merge branch 'new-initial-assumption' of https://github.com/dmitryglh…

fc9b76b

…f/FEDOT into pr/1350

dmitryglhf closed this Jan 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New initial assumption #1350

New initial assumption #1350

dmitryglhf commented Dec 2, 2024 •

edited

Loading

pep8speaks commented Dec 2, 2024 •

edited

Loading

nicl-nno Dec 2, 2024

dmitryglhf Dec 3, 2024 •

edited

Loading

nicl-nno Dec 2, 2024

dmitryglhf Dec 3, 2024

nicl-nno Dec 3, 2024

dmitryglhf Dec 3, 2024

nicl-nno commented Dec 2, 2024

dmitryglhf commented Dec 3, 2024 •

edited

Loading

nicl-nno commented Dec 3, 2024

dmitryglhf commented Dec 3, 2024

dmitryglhf commented Dec 11, 2024 •

edited

Loading

dmitryglhf commented Dec 11, 2024

codecov bot commented Dec 11, 2024 •

edited

Loading

dmitryglhf commented Dec 11, 2024

dmitryglhf commented Dec 16, 2024

dmitryglhf commented Dec 24, 2024 •

edited

Loading

New initial assumption #1350

New initial assumption #1350

Conversation

dmitryglhf commented Dec 2, 2024 • edited Loading

Summary

Context

pep8speaks commented Dec 2, 2024 • edited Loading

Comment last updated at 2025-01-10 09:38:30 UTC

nicl-nno Dec 2, 2024

Choose a reason for hiding this comment

dmitryglhf Dec 3, 2024 • edited Loading

Choose a reason for hiding this comment

nicl-nno Dec 2, 2024

Choose a reason for hiding this comment

dmitryglhf Dec 3, 2024

Choose a reason for hiding this comment

nicl-nno Dec 3, 2024

Choose a reason for hiding this comment

dmitryglhf Dec 3, 2024

Choose a reason for hiding this comment

nicl-nno commented Dec 2, 2024

dmitryglhf commented Dec 3, 2024 • edited Loading

nicl-nno commented Dec 3, 2024

dmitryglhf commented Dec 3, 2024

dmitryglhf commented Dec 11, 2024 • edited Loading

dmitryglhf commented Dec 11, 2024

codecov bot commented Dec 11, 2024 • edited Loading

Codecov Report

dmitryglhf commented Dec 11, 2024

dmitryglhf commented Dec 16, 2024

dmitryglhf commented Dec 24, 2024 • edited Loading

dmitryglhf commented Dec 2, 2024 •

edited

Loading

pep8speaks commented Dec 2, 2024 •

edited

Loading

dmitryglhf Dec 3, 2024 •

edited

Loading

dmitryglhf commented Dec 3, 2024 •

edited

Loading

dmitryglhf commented Dec 11, 2024 •

edited

Loading

codecov bot commented Dec 11, 2024 •

edited

Loading

dmitryglhf commented Dec 24, 2024 •

edited

Loading