Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review TextClassification task #1073

Open
wants to merge 5 commits into
base: develop
Choose a base branch
from

Conversation

sdiazlor
Copy link
Contributor

No description provided.

Copy link

Documentation for this PR has been built. You can view it at: https://distilabel.argilla.io/pr-1073/

Copy link

codspeed-hq bot commented Nov 28, 2024

CodSpeed Performance Report

Merging #1073 will improve performances by ×7.2

Comparing feat/review-text-classification (1e65f6a) with develop (63c75c5)

Summary

⚡ 1 improvements

Benchmarks breakdown

Benchmark develop feat/review-text-classification Change
test_cache_time 3,985.3 ms 550.3 ms ×7.2

@plaguss
Copy link
Contributor

plaguss commented Nov 28, 2024

Hi @sdiazlor! The thing with the n attribute is that it's needed for a task like TextClustering, where you want to enforce the same number of labels. I see that it can be useful to have it as single/multi label, but both this option and a predefined set of n options should be available, so an extra argument could work

Copy link
Member

@davidberenstein1957 davidberenstein1957 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sdiazlor, I think you should also update the prompt template to make this work smoothly.

'"label"'
if self.n == 1
else "[" + ", ".join([f'"label_{i}"' for i in range(self.n)]) + "]"
"[" + ", ".join([f'"label_{i}"' for i in range(random.randint(1, 3))]) + "]"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why a randomint 1,3?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was doubting about the best approach as I didn't want to implicit a fixed number of labels. So, that's why I randomized it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a more fundamental change to this implementation on the level of the prompt template. We ideally want to have an arbitrary number of labels based on a potential set, where we should allow for 0 to n labels without it forcefully setting a fixed number.

@davidberenstein1957
Copy link
Member

@plaguss perhaps we don't know enough about the context of the paper/implementation but when would it be useful to set a fixed number? Normally in a mulit-label textcat setting, you would go for a random number of labels without enforcing the exact required number because it leads to mis-labelling.

@sdiazlor
Copy link
Contributor Author

@plaguss Thanks! I read the paper and it was more focused on structured generation rather than texcat, I guess we can more or less modify the task, right? So, would it be possible to optionally select between n or multi_label on an exclusive basis (updating the code/prompt conditionally), so TexClustering won't be broken, but for "standard textcat" we don't need to use n.

@plaguss
Copy link
Contributor

plaguss commented Nov 28, 2024

@plaguss perhaps we don't know enough about the context of the paper/implementation but when would it be useful to set a fixed number? Normally in a mulit-label textcat setting, you would go for a random number of labels without enforcing the exact required number because it leads to mis-labelling.

This is the task implementing the TextClustering: https://github.com/argilla-io/distilabel/blob/main/src/distilabel/steps/clustering/text_clustering.py, that's the reason to specify a given set of labels, the approach is a bit different to a given text classification problem, but instead it reads a bunch of texts and try to obtain a set of representative labels.

@plaguss
Copy link
Contributor

plaguss commented Nov 28, 2024

@plaguss Thanks! I read the paper and it was more focused on structured generation rather than texcat, I guess we can more or less modify the task, right? So, would it be possible to optionally select between n or multi_label on an exclusive basis (updating the code/prompt conditionally), so TexClustering won't be broken, but for "standard textcat" we don't need to use n.

Totally. As long as both options are available work it's perfect

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants