Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement HuggingFace Language Modeling Estimators #2336

Open
f4str opened this issue Nov 27, 2023 · 1 comment
Open

Implement HuggingFace Language Modeling Estimators #2336

f4str opened this issue Nov 27, 2023 · 1 comment
Assignees
Labels
enhancement New feature or request

Comments

@f4str
Copy link
Collaborator

f4str commented Nov 27, 2023

Is your feature request related to a problem? Please describe.
The next step of integrating HuggingFace into ART is to add support for the language modeling estimators. This involves creating ART estimator wrappers for the HuggingFace text models. These estimators should support

Describe the solution you'd like
A new module will be created: art.estimators.language_modeling which will be where all of the new HuggingFace language modeling estimators will be implemented.

A new estimator will be created for each language modeling task (e.g., masked LM, sequence classification, next sentence prediction, etc.). Each estimator will be named accordingly (e.g., HuggingFaceMaskedLM, HuggingFaceSequenceClassificationLM, HuggingFaceNextSequencePredictionLM, etc.). This is due to the fact that the expected input and output for each task is unique.

Each estimator will take in a HuggingFace model and the corresponding tokenizer. In this approach, the model and tokenizer will be coupled in the same wrapper. This is the simplest approach since the tokenizer is specific to the text model and is not very useful standalone for ART's use cases.

Describe alternatives you've considered
The tokenizer can be made its own standalone module that is passed in to the ART wrapper. However, the tokenizer by itself is not very useful since it is dependent on the model (BERT, GPT-2, T5, etc.) and adds unnecessary complexity to creating the language model. If needed, the tokenizer can always be decoupled from the model and made standalone at a later point.

Additional context
The naming for the module and estimators are not finalized and are open to suggestions.

@OrsonTyphanel93
Copy link

OrsonTyphanel93 commented Nov 28, 2023

Hi Dear @f4str , incorporating NLP into ART isn't a bad idea!, I hope the goal will be to "backdoored" these LLMs or "poisoned" these models to better understand their potential vulnerabilities and flaws,? because if the goal is simply to insert HuggingFace models that are based on pre-trainer models that are themselves vulnerable .....

In short, a technical problem to bear in mind: a stand-alone tokenizer is less useful for ART use cases( I think ) because it's specific to a particular HuggingFace model and adds unnecessary complexity?
On the other hand, decoupling the tokenizer may introduce unnecessary complexity into the estimator creation process.

An improvement(s) could consist in : A mechanism for dynamically selecting the appropriate tokenizer based on the specified model. Adding automatic model loading to streamline the model preparation process, with integration with ART's tuning capabilities to enable optimization of HuggingFace's future models and tasks, which change on an almost monthly or quarterly basis, so as not to disrupt ART's existing structure.

Thanks ! : )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants