Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BookSum_Full BART Baseline script/code #66

Open
saxenarohit opened this issue Jul 12, 2024 · 4 comments
Open

BookSum_Full BART Baseline script/code #66

saxenarohit opened this issue Jul 12, 2024 · 4 comments

Comments

@saxenarohit
Copy link

saxenarohit commented Jul 12, 2024

Hi,

Great work! Thanks for sharing the code.

I have been trying to replicate the simple BART baseline on BookSum_full and I am unable to reproduce the results.

Can you share the code/script you used to train this model
https://huggingface.co/abertsch/bart-base-booksum

I was able to replicate the BART baseline for all the other datasets except this one.

Thanks!

@saxenarohit
Copy link
Author

saxenarohit commented Jul 15, 2024

Hi @empanada11 ,

I meant that I was able to reproduce the BART baseline (not Unlimiformer) on the gov_report dataset. It seems to me that the BART baseline (not Unlimiformer) on BookSum is first fine-tuned on BookSum.

I saw from other issues (#57) that people were not able to replicate the Unlimiformer results on gov_report. Can you ask for an update on that issue? Let me try running that as well. Also, the datasets from tau/sled don't have a test set. Does that mean the paper reports the development score?

@abertsch72
Copy link
Owner

Hey @saxenarohit ! That's strange, thanks for flagging. What do you get when you train BART-base? And what library are you using to evaluate ROUGE?

the datasets from tau/sled don't have a test set. Does that mean the paper reports the development score?

We report the test set scores using test sets from the original datasets, preprocessed to match the SCROLLS dataset formatting. (E.g. for govreport). We do this instead of submitting to the leaderboard because we didn't run all the SCROLLS tasks. There's also development set scores in the appendices, though, if you'd like to work off of those!

@empanada11 sorry to hear that! Can you share what your issue is?

@saxenarohit
Copy link
Author

saxenarohit commented Jul 16, 2024

Hi @abertsch72, thanks for your response.
I am using transformer evaluate and getting 'Rouge1': 24.42, 'Rouge2': 5.75, 'RougeL': 12.98, 'RougeLsum': 23.00 on testset. These numbers are quite off.
Can you please share the script/code/hyperparameters to replicate the results?

@abertsch72
Copy link
Owner

Hi @saxenarohit -- sorry for the delay. That definitely sounds quite low-- is it possible you're generating less than 1024 token outputs? I've been traveling but I will dig up the booksum code this weekend!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants