Review TextClassification task #1073

sdiazlor · 2024-11-28T12:37:13Z

No description provided.

for more information, see https://pre-commit.ci

github-actions · 2024-11-28T12:38:45Z

Documentation for this PR has been built. You can view it at: https://distilabel.argilla.io/pr-1073/

codspeed-hq · 2024-11-28T12:42:58Z

CodSpeed Performance Report

Merging #1073 will improve performances by ×7.2

_{Comparing feat/review-text-classification (1e65f6a) with develop (63c75c5)}

Summary

⚡ 1 improvements

Benchmarks breakdown

	Benchmark	`develop`	`feat/review-text-classification`	Change
⚡	`test_cache_time`	3,985.3 ms	550.3 ms	×7.2

plaguss · 2024-11-28T12:47:28Z

Hi @sdiazlor! The thing with the n attribute is that it's needed for a task like TextClustering, where you want to enforce the same number of labels. I see that it can be useful to have it as single/multi label, but both this option and a predefined set of n options should be available, so an extra argument could work

davidberenstein1957

Hi @sdiazlor, I think you should also update the prompt template to make this work smoothly.

davidberenstein1957 · 2024-11-28T12:52:34Z

src/distilabel/steps/tasks/text_classification.py

-            '"label"'
-            if self.n == 1
-            else "[" + ", ".join([f'"label_{i}"' for i in range(self.n)]) + "]"
+            "[" + ", ".join([f'"label_{i}"' for i in range(random.randint(1, 3))]) + "]"


why a randomint 1,3?

Yes, I was doubting about the best approach as I didn't want to implicit a fixed number of labels. So, that's why I randomized it.

I think we need a more fundamental change to this implementation on the level of the prompt template. We ideally want to have an arbitrary number of labels based on a potential set, where we should allow for 0 to n labels without it forcefully setting a fixed number.

davidberenstein1957 · 2024-11-28T13:03:32Z

@plaguss perhaps we don't know enough about the context of the paper/implementation but when would it be useful to set a fixed number? Normally in a mulit-label textcat setting, you would go for a random number of labels without enforcing the exact required number because it leads to mis-labelling.

sdiazlor · 2024-11-28T13:12:06Z

@plaguss Thanks! I read the paper and it was more focused on structured generation rather than texcat, I guess we can more or less modify the task, right? So, would it be possible to optionally select between n or multi_label on an exclusive basis (updating the code/prompt conditionally), so TexClustering won't be broken, but for "standard textcat" we don't need to use n.

plaguss · 2024-11-28T14:53:30Z

@plaguss perhaps we don't know enough about the context of the paper/implementation but when would it be useful to set a fixed number? Normally in a mulit-label textcat setting, you would go for a random number of labels without enforcing the exact required number because it leads to mis-labelling.

This is the task implementing the TextClustering: https://github.com/argilla-io/distilabel/blob/main/src/distilabel/steps/clustering/text_clustering.py, that's the reason to specify a given set of labels, the approach is a bit different to a given text classification problem, but instead it reads a bunch of texts and try to obtain a set of representative labels.

plaguss · 2024-11-28T14:54:41Z

@plaguss Thanks! I read the paper and it was more focused on structured generation rather than texcat, I guess we can more or less modify the task, right? So, would it be possible to optionally select between n or multi_label on an exclusive basis (updating the code/prompt conditionally), so TexClustering won't be broken, but for "standard textcat" we don't need to use n.

Totally. As long as both options are available work it's perfect

for more information, see https://pre-commit.ci

sdiazlor and others added 3 commits November 28, 2024 13:35

update n by is_multilabel

d203e14

update tests

fcf20c9

[pre-commit.ci] auto fixes from pre-commit.com hooks

5988844

for more information, see https://pre-commit.ci

sdiazlor requested review from davidberenstein1957 and plaguss November 28, 2024 12:44

davidberenstein1957 reviewed Nov 28, 2024

View reviewed changes

sdiazlor and others added 2 commits December 12, 2024 10:13

update with n for TextClustering

b1f3443

[pre-commit.ci] auto fixes from pre-commit.com hooks

1e65f6a

for more information, see https://pre-commit.ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Review TextClassification task #1073

Review TextClassification task #1073

sdiazlor commented Nov 28, 2024

github-actions bot commented Nov 28, 2024

codspeed-hq bot commented Nov 28, 2024 •

edited

Loading

plaguss commented Nov 28, 2024

davidberenstein1957 left a comment

davidberenstein1957 Nov 28, 2024

sdiazlor Nov 28, 2024

davidberenstein1957 Nov 28, 2024

davidberenstein1957 commented Nov 28, 2024

sdiazlor commented Nov 28, 2024

plaguss commented Nov 28, 2024

plaguss commented Nov 28, 2024

Review TextClassification task #1073

Are you sure you want to change the base?

Review TextClassification task #1073

Conversation

sdiazlor commented Nov 28, 2024

github-actions bot commented Nov 28, 2024

codspeed-hq bot commented Nov 28, 2024 • edited Loading

CodSpeed Performance Report

Merging #1073 will improve performances by ×7.2

Summary

Benchmarks breakdown

plaguss commented Nov 28, 2024

davidberenstein1957 left a comment

Choose a reason for hiding this comment

davidberenstein1957 Nov 28, 2024

Choose a reason for hiding this comment

sdiazlor Nov 28, 2024

Choose a reason for hiding this comment

davidberenstein1957 Nov 28, 2024

Choose a reason for hiding this comment

davidberenstein1957 commented Nov 28, 2024

sdiazlor commented Nov 28, 2024

plaguss commented Nov 28, 2024

plaguss commented Nov 28, 2024

codspeed-hq bot commented Nov 28, 2024 •

edited

Loading