[Quesiton] Imbalanced data for classifiers in classification tasks #79

omihub777 · 2024-09-26T11:27:31Z

Thank you for all your hard work.

I've noticed that in your implementation of classifiers in ClassificationEvaluator, it seems that classifiers like LogisticRegression and kNN are trained on the entire training datasets even for extremely imbalanced data such as amazon_counterfactual dataset, where 90% of the labels are 0 (stats-ja). In the original MTEB, this issue is addressed by undersampling the training dataset to achieve a balanced distribution before fitting LogisticRegression.

Could you elaborate on your design choice for training on the entire dataset? Are there specific reasons for this approach? If I am missing something, feel free to correct me. Thank you!

The text was updated successfully, but these errors were encountered:

lsz05 · 2024-11-11T10:17:07Z

Thank you for your question.

Simply we don't have enough evidence showing we should or should not conduct undersampling, then we chose the simpler one (do nothing about the balance of labels). Running without undersampling didn't cause critical problems, so we didn't consider undersampling.

Another point is if we do undersampling, the size of data will be largely reduced. Considering there is a training process with a statistical classifier, we prioritized the size of data to the risk of imbalanced labels.

If you have evidences about the necessity of adjusting the balance of labels, would you share with us?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Quesiton] Imbalanced data for classifiers in classification tasks #79

[Quesiton] Imbalanced data for classifiers in classification tasks #79

omihub777 commented Sep 26, 2024

lsz05 commented Nov 11, 2024

[Quesiton] Imbalanced data for classifiers in classification tasks #79

[Quesiton] Imbalanced data for classifiers in classification tasks #79

Comments

omihub777 commented Sep 26, 2024

lsz05 commented Nov 11, 2024