Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Steps to run the code #33

Open
sahulsumra opened this issue Aug 31, 2023 · 5 comments
Open

Steps to run the code #33

sahulsumra opened this issue Aug 31, 2023 · 5 comments

Comments

@sahulsumra
Copy link

Can you explain me how to run this code?

@abertsch72
Copy link
Owner

Hi @sahulsumra, thanks for your interest in our work! Have you tried following the instructions for running in the readme?

@patrickocal
Copy link

patrickocal commented Nov 6, 2023

Hi @abertsch72, my first problem was getting the conda environment setup on the basis of "requirements.txt". Not sure if you are working within a conda environment? But doing so might help to isolate exactly what needs installing.

So, aside from some packages that were absent from your "requirements.txt" file, I managed to get the inference_example.py working fine. Working with a decent-sized gpu on the cluster, so it's fast and very happy with that. (Thumbs up.)

But I had a bunch of problems with "src/run.py", when I try to run:

python src/run.py \
    src/configs/training/base_training_args.json \
    src/configs/data/gov_report.json \
    --output_dir output_train_bart_base_local/ \
    --learning_rate 1e-5 \
    --model_name_or_path facebook/bart-base \
    --max_source_length 1024 \
    --eval_max_source_length 1024 --do_eval=True \
    --eval_steps 1000 --save_steps 1000 \
    --per_device_eval_batch_size 1 --per_device_train_batch_size 2 \
    --extra_metrics bertscore

I get:

  File "/scratch/project_mnt/S0066/unlimiformer-orig/src/run.py", line 1180, in <module>
    main()
  File "/scratch/project_mnt/S0066/unlimiformer-orig/src/run.py", line 437, in main
    seq2seq_dataset = _get_dataset(data_args, model_args, training_args)
  File "/scratch/project_mnt/S0066/unlimiformer-orig/src/run.py", line 943, in _get_dataset
    seq2seq_dataset = load_dataset(
  File "/home/uqpocall/.conda/envs/unlimiformer/lib/python3.10/site-packages/datasets/load.py", line 1657, in load_dataset
    builder_instance = load_dataset_builder(
  File "/home/uqpocall/.conda/envs/unlimiformer/lib/python3.10/site-packages/datasets/load.py", line 1515, in load_dataset_builder
    builder_instance: DatasetBuilder = builder_cls(
  File "/home/uqpocall/.conda/envs/unlimiformer/lib/python3.10/site-packages/datasets/builder.py", line 1022, in __init__
    super(GeneratorBasedBuilder, self).__init__(*args, **kwargs)
  File "/home/uqpocall/.conda/envs/unlimiformer/lib/python3.10/site-packages/datasets/builder.py", line 259, in __init__
    self.config, self.config_id = self._create_builder_config(
  File "/home/uqpocall/.conda/envs/unlimiformer/lib/python3.10/site-packages/datasets/builder.py", line 366, in _create_builder_config
    raise ValueError(f"BuilderConfig {builder_config} doesn't have a '{key}' key.")
ValueError: BuilderConfig SLEDConfig(name='gov_report', version=1.0.0, data_dir=None, data_files=None, description='\n@inproceedings{huang-etal-2021-efficient,\n    title = "Efficient Attentions for Long Document Summarization",\n    author = "Huang, Luyang  and\n      Cao, Shuyang  and\n      Parulian, Nikolaus  and\n      Ji, Heng  and\n      Wang, Lu",\n    booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",\n    month = jun,\n    year = "2021",\n    address = "Online",\n    publisher = "Association for Computational Linguistics",\n    url = "https://aclanthology.org/2021.naacl-main.112",\n    doi = "10.18653/v1/2021.naacl-main.112",\n    pages = "1419--1436",\n    abstract = "The quadratic computational and memory complexities of large Transformers have limited their scalability for long document summarization. In this paper, we propose Hepos, a novel efficient encoder-decoder attention with head-wise positional strides to effectively pinpoint salient information from the source. We further conduct a systematic study of existing efficient self-attentions. Combined with Hepos, we are able to process ten times more tokens than existing models that use full attentions. For evaluation, we present a new dataset, GovReport, with significantly longer documents and summaries. Results show that our models produce significantly higher ROUGE scores than competitive comparisons, including new state-of-the-art results on PubMed. Human evaluation also shows that our models generate more informative summaries with fewer unfaithful errors.",\n}') 
doesn't have a 'verification_mode' key.

I found that this verification_mode variable is new to hf. (The solution, for now, seems to be to go to the run.py file and comment "verification_mode" out and replace it with the soon-to-deprecated "ignore_verifications=True")

Hoping the above is useful to some.

@patrickocal
Copy link

It's strange because I would expect the following lines from src/run.py

    # Preprocessing the datasets.
    # We need to tokenize inputs and targets.
    if training_args.do_train:
        column_names = seq2seq_dataset["train"].column_names

to pick up the column names from the ccdv/govreport-summarization dataset itself. Why doesn't it?

@patrickocal
Copy link

Anyway, I got it working by rewriting your deduplicate function and instead having it assign an "id" column. Most first-time users would be using this with a standard dataset such as ccdv/govreport-summarization. So no need for deduping. I also needed to change the column_names assignment within the run.py file. Seems like a bit of work is needed to make this more streamlined and accessible.

@patrickocal
Copy link

patrickocal commented Nov 6, 2023

Ah okay, I now understand. In your original gov_report.json file:

"dataset_name": "tau/sled",
"dataset_config_name": "gov_report",
"max_source_length": 16384,
"generation_max_length": 1024,
"max_prefix_length": 0,
"pad_prefix": false,
"num_train_epochs": 10,
"metric_names": ["rouge"],
"metric_for_best_model": "rouge/geometric_mean",
"greater_is_better": true

Selects the gov_report dataset within "tau/sled". Okay, that now makes sense. I will leave the above trail for others that go down the same rabbit hole. (I am still confused about what exactly an epoch is here. Why I don't see 17.5k when I run ccdv/govreport-summarization with "num_train_epochs": 1? Instead, I see 1000/8759, 2000/8759, ...)

Finally, at the end of training, got the following error:

Traceback (most recent call last):
  File "/scratch/project_mnt/S0066/unlimiformer-orig/src/run.py", line 1181, in <module>
    main()
  File "/scratch/project_mnt/S0066/unlimiformer-orig/src/run.py", line 802, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/uqpocall/.conda/envs/unlimiformer/lib/python3.10/site-packages/transformers/trainer.py", line 1591, in train
    return inner_training_loop(
  File "/home/uqpocall/.conda/envs/unlimiformer/lib/python3.10/site-packages/transformers/trainer.py", line 1984, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/home/uqpocall/.conda/envs/unlimiformer/lib/python3.10/site-packages/transformers/trainer.py", line 2328, in _maybe_log_save_evaluate
    metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
  File "/scratch/project_mnt/S0066/unlimiformer-orig/src/utils/custom_seq2seq_trainer.py", line 300, in evaluate
    output.metrics.update(self.compute_metrics(*eval_preds))
  File "/scratch/project_mnt/S0066/unlimiformer-orig/src/metrics/metrics.py", line 45, in __call__
    return self._compute_metrics(id_to_pred, id_to_labels)
  File "/scratch/project_mnt/S0066/unlimiformer-orig/src/metrics/metrics.py", line 60, in _compute_metrics
    result = metric(id_to_pred_decoded, id_to_labels_decoded, is_decoded=True)
  File "/scratch/project_mnt/S0066/unlimiformer-orig/src/metrics/metrics.py", line 27, in __call__
    return self._compute_metrics(id_to_pred, id_to_labels)
  File "/scratch/project_mnt/S0066/unlimiformer-orig/src/metrics/metrics.py", line 158, in _compute_metrics
    return self._metric.compute(**self.convert_from_map_format(id_to_pred, id_to_labels), **self.kwargs)
  File "/home/uqpocall/.conda/envs/unlimiformer/lib/python3.10/site-packages/datasets/metric.py", line 419, in compute
    os.remove(file_path)
FileNotFoundError: [Errno 2] No such file or directory: '/home/uqpocall/.cache/huggingface/metrics/bert_score/default/default_experiment-1-0.arrow'

Any suggestions for fixing this last one? Thanks in advance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants