Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fine-tune cross-lingual translator for text2text generation #27

Open
artitw opened this issue Jul 31, 2021 · 32 comments
Open

Fine-tune cross-lingual translator for text2text generation #27

artitw opened this issue Jul 31, 2021 · 32 comments
Assignees

Comments

@artitw
Copy link
Owner

artitw commented Jul 31, 2021

Fine-tune cross-lingual translator for text2text generation tasks, e.g. question generation, question answering, summarization, etc. to demonstrate cross-lingual alignment, zero-shot generation, etc.

For example, can we demonstrate question generation or question answering using the existing API? If not, what needs to get fixed?

https://github.com/artitw/text2text#training--finetuning

@artitw artitw changed the title Test fine-tuning module to ensure functionality Fine-tune cross-lingual translator for text2text generation Jul 31, 2021
@johnanisere
Copy link

I would be working on this.

@artitw
Copy link
Owner Author

artitw commented Aug 1, 2021

Awesome. For question generation, one approach to get started is to use the SQuAD dataset, and pre-process it into context + answer -> question. Likewise, for question answering pre-process it into context + question -> answer. This could then be used to for the fine-tuning.

@johnanisere
Copy link

Here is the link to the colab for the analysis of the question answering and question generation API --> https://colab.research.google.com/drive/1WzO_TP9Nn98AeKmicCYaNTRWezAp9OLt?usp=sharing

@artitw
Copy link
Owner Author

artitw commented Aug 14, 2021

Thanks very much for sharing the notebook. I can recommend two things to try:

  1. Use the existing pre-trained question-answering model to evaluate the performance of the question generation model.
  2. Use the text2text fine-tuning API to see if we can get question generation to work on a pre-trained translator. Although the documentation provides an example for translating, there's nothing stopping us from using it for question generation. Depending on the results, we can dig deeper to understand how to develop the model further.

@johnanisere
Copy link

Thanks Art.

Feedback:
After evaluating the performance of the question generation model with the question-answering model, my conclusion is that is pretty accurate to a high level, I have documented it here https://colab.research.google.com/drive/1WzO_TP9Nn98AeKmicCYaNTRWezAp9OLt#scrollTo=BtnFEHUlProe&line=1&uniqifier=1

Blocker: When using the text2text fine-tuning API to get a pre-trained translator I run out of space. I have experienced this on both AWS and colab (I get 'no space left on device' error message). I would appreciate every help I can get. I have attached screenshots here.
Screenshot 2021-08-21 at 18 47 16
Screenshot 2021-08-21 at 18 50 28

@artitw
Copy link
Owner Author

artitw commented Aug 22, 2021

  1. Would you be able to report the test set accuracy so that we can establish a benchmark? This would be useful for researchers as a better way to measure question generation performance.
  2. It looks like you are using the default model, whish takes up a lot of space and memory. Try using a smaller model with the setting
    t2t.Transformer.PRETRAINED_TRANSLATOR = "facebook/m2m100_418M"
    This was tested on Google's colab environment.

@johnanisere
Copy link

@artitw oh Okay. Got it

@johnanisere
Copy link

Thanks Art.
The question generation API actually works on a pre-trained translator. I was able to demonstrate it here. https://colab.research.google.com/drive/1WzO_TP9Nn98AeKmicCYaNTRWezAp9OLt?usp=sharing

The next steps is for me to report the test set accuracy so that we can establish a benchmark.

@artitw
Copy link
Owner Author

artitw commented Sep 3, 2021

Reviewed the notebook. It looks like the fine-tuning was not performed on question generation data; rather, it was done using the example for translation. Could you try the following format? I updated the API in the repo to avoid confusion with the [SEP] token.

result = t2t.Handler(["I will go to school today to take my math exam. [SEP] school [TGT] Where will you go to take your math exam?"], 
            src_lang="en",
            tgt_lang="en",
            num_epochs=10, 
            save_directory="model_dir"
            ).fit()

@johnanisere
Copy link

Oh I see

@johnanisere
Copy link

Hi @artitw I have gone back to do the work again and the question generation API actually works on a pre-trained translator. Here https://colab.research.google.com/drive/1WzO_TP9Nn98AeKmicCYaNTRWezAp9OLt?usp=sharing

What strategy do you recommend for benchmarking the test set accuracy?

@artitw
Copy link
Owner Author

artitw commented Sep 19, 2021

For benchmarking, we can start with lower casing the text and then calculating the exact match accuracy.

For finetuning a pretrained translator, we would have to use the translate (not question generation) API to generate the finetuned results.

@artitw
Copy link
Owner Author

artitw commented Sep 26, 2021

In addition to exact match accuracy, it would be good to calculate average BLEU scores over the answers as well. For reference, see https://en.wikipedia.org/wiki/BLEU

@johnanisere
Copy link

Hi @artitw , After training with about 33 data points, the pre-trained translator is still just translating the payload.
Do you suggest I train with even more data?
here is my result https://colab.research.google.com/drive/1vJ5U_UNFxeu92VVyhAhxKSur_BZSJSIJ?usp=sharing

@artitw
Copy link
Owner Author

artitw commented Sep 26, 2021

Thanks for sharing the notebook. It looks like the right direction, but I would expect it to need much more training (>10k examples). I would also recommend saving the intermediate results in Google Drive so that you can pick up where you left off without starting over.

@johnanisere
Copy link

@artitw oh okay, got it

@lere01
Copy link
Contributor

lere01 commented Jan 3, 2022

Hi @artitw, I would like to continue from where John stopped.

@artitw
Copy link
Owner Author

artitw commented Jan 3, 2022

great, I've assigned you to this issue. Please review what John has done and let us know of any questions here.

@lere01
Copy link
Contributor

lere01 commented Jan 3, 2022

Noted. I have reviewed John's work and played with the notebooks he reported. It seems that my assignments are the following, in order:

  1. Get sufficient (> 10k) training data.
  2. Report exact match accuracy.
  3. Report average BLEU scores for the answers.

Am I right? Do you have any suggestions on getting training data?

Thank you.

@artitw
Copy link
Owner Author

artitw commented Jan 3, 2022

What you describe sounds like the right track. I would recommend starting with the English SQuAD [1] dataset and then use XQuAD [2] after that is somewhat working.

[1] https://rajpurkar.github.io/SQuAD-explorer/
[2] https://github.com/deepmind/xquad

@lere01
Copy link
Contributor

lere01 commented Jan 18, 2022

Hi @artitw,

After trying different options that did not work out, I opted for Amazon Sagemaker.

  • I loaded the datasets (JSON) to AWS s3
  • I dockerized the fine-tuning script and pushed the image to AWS ECR
  • I then created a job on Sagemaker using the docker image as a custom algorithm

The job has been running for some hours taking SQuAD[1] dataset as input. I will keep you updated.
I could not get access to a HPC Cluster so I followed this approach. Please, let me know what you think.

@artitw
Copy link
Owner Author

artitw commented Jan 20, 2022

Hi @lere01 What you suggest seems interesting. I would recommend using a small dataset to test your setup before running any heavy jobs.

@lere01
Copy link
Contributor

lere01 commented Jan 30, 2022

Hi @artitw,

I used a small dataset to test my setup as you suggested and it worked fine. But the larger dataset took too long to run. I set the job to run for 5 days and even that time frame was not enough.

However, you can see some sort of proof of concept at https://colab.research.google.com/drive/1Vvem1DqNJZQej4t2qAIkZN0DyCdUY_sM#scrollTo=RXf2UrMvSc25.

  1. I used 50 rows from the training set to fine tune
  2. Then performed translation task (answering) on 50 rows of the dev set.
  3. I calculated the bleu_score using NLTK implementation and reported BLEU-1 to BLEU-4
  4. 84% of answers generated by the model were perfect match for references.

This was just to show that the whole process works. I would like your suggestion on how to proceed.

@artitw
Copy link
Owner Author

artitw commented Feb 1, 2022

Hi @lere01,

Thanks for sharing your work and the summary. It looks like a good start. The main issue I can see is that the notebook you shared uses the Answerer model, not the finetuned translator you performed fitting on. We would have to perform predictions using the translator model because we are using it for an unintended purpose.

@lere01
Copy link
Contributor

lere01 commented Feb 16, 2022

Hi @artitw

Hope you have had a good day. Two things.

1. Before going far, I want to let you know that I am fine tuning using


t2t.Handler([f"{CONTEXT} [TGT] {QUESTION}"], 
            src_lang="en",
            tgt_lang="en",
            num_epochs=10, 
            save_directory="model_dir"
            ).fit()

AND NOT


t2t.Handler([f"{CONTEXT} [SEP] {ANSWER} [TGT] {QUESTION}"], 
            src_lang="en",
            tgt_lang="en",
            num_epochs=10, 
            save_directory="model_dir"
            ).fit()

Am I on the right track?

@lere01
Copy link
Contributor

lere01 commented Feb 16, 2022

2. I dug into the codebase and figured out a way to use the GPU.


By editing the Translator and doing this:

import text2text as t2t
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

class Translator(t2t.Transformer):
     def __init__(self, **kwargs):
     pretrained_translator = self.__class__.PRETRAINED_TRANSLATOR
     torch_device = "cuda" if torch.cuda.is_available() else "cpu"
     self.__class__.model = AutoModelForSeq2SeqLM.from_pretrained(pretrained_translator).to(torch_device)
     self.__class__.tokenizer = AutoTokenizer.from_pretrained(pretrained_translator)

What do you think?

@artitw
Copy link
Owner Author

artitw commented Feb 16, 2022

The second approach should work, as we want to generate questions that correspond to a context and an answer.

Hi @artitw

Hope you have had a good day. Two things.

1. Before going far, I want to let you know that I am fine tuning using

t2t.Handler([f"{CONTEXT} [TGT] {QUESTION}"], 
            src_lang="en",
            tgt_lang="en",
            num_epochs=10, 
            save_directory="model_dir"
            ).fit()

AND NOT


t2t.Handler([f"{CONTEXT} [SEP] {ANSWER} [TGT] {QUESTION}"], 
            src_lang="en",
            tgt_lang="en",
            num_epochs=10, 
            save_directory="model_dir"
            ).fit()

Am I on the right track?

@artitw
Copy link
Owner Author

artitw commented Feb 16, 2022

Nice find. I am referencing your pull request here: #31

2. I dug into the codebase and figured out a way to use the GPU.

By editing the Translator and doing this:

import text2text as t2t
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

class Translator(t2t.Transformer):
     def __init__(self, **kwargs):
     pretrained_translator = self.__class__.PRETRAINED_TRANSLATOR
     torch_device = "cuda" if torch.cuda.is_available() else "cpu"
     self.__class__.model = AutoModelForSeq2SeqLM.from_pretrained(pretrained_translator).to(torch_device)
     self.__class__.tokenizer = AutoTokenizer.from_pretrained(pretrained_translator)

What do you think?

@lere01
Copy link
Contributor

lere01 commented Feb 17, 2022

Hi @artitw,

The dataset we are using for fine tuning has multiple questions attached to each context. Do you think that this might be affecting the algorithm's learning? As against one question per context.

@artitw
Copy link
Owner Author

artitw commented Feb 18, 2022

Yes, I would suggest that the context be concatenated with the answer for each target question. This would ensure that each unique question is mapped to a unique input to the model.

@lere01
Copy link
Contributor

lere01 commented May 30, 2022

Hi @artitw

I have been able to fine-tune up to 50,000 of the training data (SQuAD 1.0). At 10000, 20000 and 50000, I tried the model on the dev dataset but got a BLEU Score of 0 in all cases. Is this expected? Would you be able to take a look at my code to ascertain that I am doing things right? I ran the code locally but you can find it here https://colab.research.google.com/drive/1z3YTjOF1dllxqSQPLgxDDeKOf9wJFfG3?usp=sharing

@artitw
Copy link
Owner Author

artitw commented May 30, 2022

@lere01 thanks for your efforts on this and for sharing the notebook. Code looks fine to me, so good job with that. Can you share the prediction results after 50k training? If those don't look promising, we might have to put this project on hold until we can figure out how to train it more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants