Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tutorial] Token classification tutorial for USPTO claims text with HF AutoTrain #5375

Open
wants to merge 45 commits into
base: main
Choose a base branch
from

Conversation

bikash119
Copy link
Contributor

Description

A tutorial on how to use Argilla for annotation and use the annotated dataset to train a model using HuggingFace AutoTrain

Closes #<issue_number>

Type of change

  • Documentation update

How Has This Been Tested

  • I added relevant documentation
  • I followed the style guidelines of this project
  • I did a self-review of my code
  • I made corresponding changes to the documentation

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@bikash119 bikash119 changed the title token classification tutorial for USPTO claims text with HF AutoTrain [Tutorial] Token classification tutorial for USPTO claims text with HF AutoTrain Aug 5, 2024
Copy link
Member

@davidberenstein1957 davidberenstein1957 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thanks for this PR. It looks very advanced. I believe the text is still split up by individual letters. Would you be able to fix that?

@bikash119
Copy link
Contributor Author

Thank you @davidberenstein1957 for the review comments. I have modified the images and reran the notebook to display updated images.

@davidberenstein1957
Copy link
Member

Hi @bikash119, could you also add an overview of how you can run inference with the model and log that back into argilla?

@bikash119
Copy link
Contributor Author

Thank you for the suggestion @davidberenstein1957 . I have updated the notebook to generate predictions and push them back to Argilla Dataset. Please share your feedback.

@davidberenstein1957
Copy link
Member

Hi @bikash119, took some time to review again.

  • Create a Dataset with Argilla Python SDK does not contain an organize import sections, also, we might run all installs and organize all of the imports at the top of the notebooks separately :)
  • sometimes there is a bit too much output being printed, for example the "ERROR: pip's dependency resolver does " and (None, None), which makes it a bit messy
  • I think we can also reduce some clutter by not having too much commented out code. For example, the flow to switch from dataset to a list and or comments like "# Load model directly"
  • You don't need to use print statement to print things in notebook cells, at te end of a cell.
  • Sometimes you seem to use an indent with 4 spaces, and sometimes only 2.
  • Step 7: Push dataset to Hugginface Hub, pushes the dataset and loads it directly after, which seems a bit double, also it reprint the exaxt same data obtained before, so I think we can simplify that a bit.
  • it would be nice to add some type hinting to functions
  • instead of from transformers import AutoTokenizer, AutoModelForTokenClassification, we might directly use the pipeline through from transformers import pipeline and looad the model through there.
  • Some sections like "Using AutoTrain UI" and "Model Fine-tuning using AutoTrain" are very nicely documented but others might have some more context added, not too much but just a bit to guide the story :)

Overall it is looking very nice! when we are done, we can post the blog on https://huggingface.co/blog, socials and add a reference to it form our docs.

bikash119 added a commit to bikash119/argilla_autotrain that referenced this pull request Aug 9, 2024
@bikash119
Copy link
Contributor Author

bikash119 commented Aug 9, 2024

Hi @davidberenstein1957 ,
Please let me know your feedback.

  • Create a Dataset with Argilla Python SDK does not contain an organize import sections, also, we might run all installs and organize all of the imports at the top of the notebooks separately :)
  • sometimes there is a bit too much output being printed, for example the "ERROR: pip's dependency resolver does " and (None, None), which makes it a bit messy
  • I think we can also reduce some clutter by not having too much commented out code. For example, the flow to switch from dataset to a list and or comments like "# Load model directly"
  • You don't need to use print statement to print things in notebook cells, at the end of a cell.

Added a DEBUG flag for print statements. Let me know if this looks good or else will get rid of them. I wanted to keep them so that the audience can understand what each step is intended to do.

  • Sometimes you seem to use an indent with 4 spaces, and sometimes only 2. Used 4 spaces consistently
  • Step 7: Push dataset to Huggingface Hub, pushes the dataset and loads it directly after, which seems a bit double, also it reprint the exact same data obtained before, so I think we can simplify that a bit.
  • it would be nice to add some type hinting to functions. Added docstring for most of the functions.
  • instead of from transformers import AutoTokenizer, AutoModelForTokenClassification, we might directly use the pipeline through from transformers import pipeline and load the model through there.
  • Some sections like "Using AutoTrain UI" and "Model Fine-tuning using AutoTrain" are very nicely documented but others might have some more context added, not too much but just a bit to guide the story :)

Can you please help me with some pointers , will add them. I feel , I can add a few points for Argilla , but unable to come up with pointers to get started.

Thank you @davidberenstein1957 for the encouragement and guidance. I have learnt a lot in the process

@bikash119
Copy link
Contributor Author

Hi @davidberenstein1957 , as we discussed during our meeting.
Added context to

  • configure dataset step
  • the need of filter queries
  • inference step
  • insert predicted data to Argilla Dataset

Hope this aligns with our discussion points.

@davidberenstein1957
Copy link
Member

Hi @bikash119, the text looks nice.

I would not use the DEBUG statements everywhere but just print the outputs in certain cell where you feel that is needed. Also, you don't need to add a 'print' statement when you want to output variables at the end of the cell. You can simply remove it.

print(my_variable) # will be printed
my_other_variable # will not be printed
my_variable # will be printed

@davidberenstein1957
Copy link
Member

davidberenstein1957 commented Aug 19, 2024

We don't need to update the Dockerfile anymore

Update the Dockerfile:
Go to https://huggingface.co/spaces///blob/main/Dockerfile
Change FROM argilla/argilla-quickstart:v1.29.0 to FROM argilla/argilla-quickstart:v2.0.0rc2

In general a redirect to https://docs.argilla.io/dev/getting_started/how-to-configure-argilla-on-huggingface/ might also be nice.

@bikash119
Copy link
Contributor Author

bikash119 commented Aug 26, 2024

Hi @davidberenstein1957 ,
hope you're doing well! Just a friendly reminder that this PR is waiting for your review. Your input is valuable here, and we'd love to hear your thoughts. Let me know if you have any questions or need more context. Thanks!

Copy link
Member

@davidberenstein1957 davidberenstein1957 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I think the blog looks great. Would you be able to request to join this organizaiton https://huggingface.co/blog-explorers? We can then let copy the blog over to https://huggingface.co/blog and publish it there :)

@bikash119
Copy link
Contributor Author

Thanks @davidberenstein1957 . Request submitted. Will wait for the acceptance and revert back.

bikash119 and others added 25 commits September 4, 2024 18:10
Modified the markdown to get rid of colab style.
For some weird reason the colab styles are getting added to the notebook. Will check this later.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants