Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DistilBERTurk training for question answering failed #25

Open
ekandemir opened this issue Jan 19, 2021 · 8 comments
Open

DistilBERTurk training for question answering failed #25

ekandemir opened this issue Jan 19, 2021 · 8 comments

Comments

@ekandemir
Copy link

Hey, I tried to train DistilBERTurk model for question answering by using run_squad.py script. After training, I got the error during evaluation stage;

Traceback (most recent call last):
  File "run_squad.py", line 838, in <module>
    main()
  File "run_squad.py", line 827, in main
    result = evaluate(args, model, tokenizer, prefix=global_step)
  File "run_squad.py", line 344, in evaluate
    start_logits, end_logits = output
ValueError: too many values to unpack (expected 2)

When I tried to discard the last value as "start_logits, end_logits, _ = output" the error became

Traceback (most recent call last):
  File "run_squad.py", line 839, in <module>
    main()
  File "run_squad.py", line 828, in main
    result = evaluate(args, model, tokenizer, prefix=global_step)
  File "run_squad.py", line 323, in evaluate
    output = [to_list(output[i]) for output in outputs.to_tuple()]
  File "run_squad.py", line 323, in <listcomp>
    output = [to_list(output[i]) for output in outputs.to_tuple()]
IndexError: tuple index out of range

I checked the model with samples from the dataset and the confidence levels were really low, mostly below 0.001. I assume training couldn't done right either.

I tried to train DistilBERT original with the same script and the same dataset and it trained without error and confidence levels were high.
I compared the layers but both model looked same. Also tried to load the model as qa model, saved it but the error occurred again.

Thank you so much.

@stefan-it
Copy link
Owner

Hi @ekandemir ,

thanks for your interest and using the distilled version 🤗

Could you specify the exact Transformers version, that you use for fine-tuning 🤔

I'm currently using:

python run_qa.py \
  --model_name_or_path dbmdz/distilbert-base-turkish-cased \
  --dataset_name squad \
  --do_train \
  --do_eval \
  --per_device_train_batch_size 12 \
  --learning_rate 3e-5 \
  --num_train_epochs 2 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir /tmp/debug_squad/

with Transformers 4.3.0.dev0 (latest master).

(Yes, it is not a Turkish QA dataset, but fine-tuning is running).

Could you also paste the exact training command that you use?

@stefan-it
Copy link
Owner

stefan-it commented Jan 19, 2021

Oh, I just saw that you're using the legacy script. Is there any chance that you use the new run_qa.py script?

I would be very interested in the Turkish QA dataset that you're using. If it's not available in the awesome Hugging Face datasets library, then we maybe could integrate it 🤗

@ekandemir
Copy link
Author

ekandemir commented Jan 20, 2021

Thanks for the quick answer. I've been trying to run new script but due to windows machine and network restrictions, I couldn't get "datasets" run well. Also it didn't work with Turkish Squad and my customized dataset local files.
I installed Transformers 4.3.0.dev0 (latest master) and run the command

python run_squad.py \
  --model_type distilbert \
  --model_name_or_path ../distilbert-base-turkish-cased  \
  --do_train   \
  --do_eval  \
  --train_file tquad/train-v1.1.json  \
  --predict_file tquad/dev-v1.1.json  \
  --per_gpu_train_batch_size 8  \
  --learning_rate 3e-5  \
  --num_train_epochs 1.0   \
  --max_seq_length 384  \
  --doc_stride 128  \
  --output_dir "./tmp/debug"

But same error.
I should probably find a way to run the new script but if you have any guess with why old one crash, I would be thankful to hear it.

Turkish QA dataset is available on TQuad . And there are some example BERT models exist in huggingface finetuned with this dataset.
PS: The dataset is not exactly in squad format so it needs a slight process.

Thanks again.

@stefan-it
Copy link
Owner

Hi @ekandemir ,

after some debugging I can confirm that there's something strange with the configuration of my distilbert model. Root cause can be found in the model configuration, output_hidden_states=True to be precisely. This option is not set in the "official" distilbert-base-cased model for example so the model will additionally output the hidden states, this can also be seen in your error message "too many values to unpack (expected 2)".

I will remove this option from the config and then evaluation should be fine (I checked it locally) and you should be able to use the old QA script then.

I've also written a new Hugging Face datasets recipe for TSQuAD, which I will integrate into datasets library soon.

Will report back here, whenever I changed the model configuration, @ekandemir !

(Thanks also to @sgugger for providing more information about that issue 🤗 )

@sgugger
Copy link

sgugger commented Jan 20, 2021

On our side, we'll work on fixing the scripts so the error does not appear if the option output_hidden_states=True is set in the config.

@ekandemir
Copy link
Author

Changing the config file solved the problem training from the main model, thanks.
But adding line output_hidden_states = false didn't help with fine-tuning QA model so I added model.config.output_hidden_states = False line to run_squad.py line 747 as a temp solution.

Thanks for your help.

@stefan-it
Copy link
Owner

stefan-it commented Jan 24, 2021

Hi @ekandemir , great to hear that it would work with the old script.

Here's a first draft of a Hugging Face Datasets recipe.

Just create a folder named like squad_tr, create a file squad_tr.py in it with the following content:

from __future__ import absolute_import, division, print_function

import json

import datasets


# BibTeX citation
_CITATION = """
"""

_DESCRIPTION = """\
TSQuAD
"""

_URL = "https://raw.githubusercontent.com/TQuad/turkish-nlp-qa-dataset/master/"
_URLS = {
    "train": _URL + "train-v0.1.json",
    "dev": _URL + "dev-v0.1.json",
}


class SquadTrConfig(datasets.BuilderConfig):
    """BuilderConfig for TSQuAD."""

    def __init__(self, **kwargs):
        """BuilderConfig for TSQuAD.

        Args:
          **kwargs: keyword arguments forwarded to super.
        """
        super(SquadTrConfig, self).__init__(**kwargs)


class SquadTr(datasets.GeneratorBasedBuilder):
    """TSQuAD dataset."""

    VERSION = datasets.Version("0.1.0")

    BUILDER_CONFIGS = [
        SquadTrConfig(
            name="v1.1.0",
            version=datasets.Version("1.0.0", ""),
            description="Plain text Turkish squad version 1",
        ),
    ]

    def _info(self):
        # Specifies the datasets.DatasetInfo object
        return datasets.DatasetInfo(
            # This is the description that will appear on the datasets page.
            description=_DESCRIPTION,
            # datasets.features.FeatureConnectors
            features=datasets.Features(
                {
                    # These are the features of your dataset like images, labels ...
                    "id": datasets.Value("string"),
                    "title": datasets.Value("string"),
                    "context": datasets.Value("string"),
                    "question": datasets.Value("string"),
                    "answers": datasets.features.Sequence(
                        {
                            "text": datasets.Value("string"),
                            "answer_start": datasets.Value("int32"),
                        }
                    ),
                }
            ),
            # If there's a common (input, target) tuple from the features,
            # specify them here. They'll be used if as_supervised=True in
            # builder.as_dataset.
            supervised_keys=None,
            # Homepage of the dataset for documentation
            homepage="https://github.com/TQuad/turkish-nlp-qa-dataset",
            citation=_CITATION,
        )

    def _split_generators(self, dl_manager):
        """Returns SplitGenerators."""
        # Downloads the data and defines the splits
        # dl_manager is a datasets.download.DownloadManager that can be used to

        # download and extract URLs
        dl_dir = dl_manager.download_and_extract(_URLS)
       
        return [
            datasets.SplitGenerator(
                name=datasets.Split.TRAIN,
                # These kwargs will be passed to _generate_examples
                gen_kwargs={"filepath": dl_dir["train"]},
            ),
            datasets.SplitGenerator(
                name=datasets.Split.VALIDATION,
                # These kwargs will be passed to _generate_examples
                gen_kwargs={"filepath": dl_dir["dev"]},
            ),
        ]

    def _generate_examples(self, filepath):
        """Yields examples."""
        # Yields (key, example) tuples from the dataset
        with open(filepath, encoding="utf-8") as f:
            data = json.load(f)
            for example in data["data"]:
                title = example.get("title", "").strip()
                for paragraph in example["paragraphs"]:
                    context = paragraph["context"].strip()
                    for qa in paragraph["qas"]:
                        question = qa["question"].strip()
                        id_ = str(qa["id"])

                        answer_starts = [answer["answer_start"] for answer in qa["answers"]]
                        answers = [answer["text"].strip() for answer in qa["answers"]]

                        yield id_, {
                            "title": title,
                            "context": context,
                            "question": question,
                            "id": id_,
                            "answers": {
                                "answer_start": answer_starts,
                                "text": answers,
                            },
                        }

Then you can use the shiny new run_sq.py script, like:

$ python3 run_qa.py \
  --model_name_or_path dbmdz/distilbert-base-turkish-cased \
  --dataset_name ./squad_tr \
  --do_train \
  --do_eval \
  --per_device_train_batch_size 12 \
  --learning_rate 3e-5 \
  --num_train_epochs 2 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir ./output-squad-tr

You may ask for good baseline comparisons. I recently found a great paper from @xplip and @JoPfeiff : "How Good is Your Tokenizer?" that also uses this QA dataset with the "normal" BERTurk model.

@medical-projects
Copy link

cheaters gonna cheat :D typical :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants