-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reproducing your results #57
Comments
The default config must be that the model picks up an existing training schedule where it left off. This would explain the lack of learning (improvement) in this training report (since I had some prior runs within the same folder: https://api.wandb.ai/links/unlimiformer-kg/y29tbk1n Here is the current run which looks much better (and it is still improving): |
I've found that the key to getting near your results on the evaluation set is the length of the generated summary of the Long Document (LD). I found that the default unlimiformer training settings (as above after simply cloning your repo) lead to small summaries of 70-130 words for the GovReport dataset. Unlimiformer did improve on no-unlimiformer-bart-base and that's great, but apart from high precision, because they were so small relative to target (400-1000, with a few exceptions), the summaries were meaningless. Note that BART didn't learn to generate larger summaries: it would start high and then drop (or low and then rise), but it would consistently converge to around 120-130 on average. I hope you don't mind me explaining a little more about my findings. My team and I have generated knowledge graphs for each example in the GovReport dataset (https://huggingface.co/datasets/patrickocal/gov_report_kg). We trained bart-base (with unlimiformer enabled) with KGs as input and, as a third experiment, KG and LD combined. Note that KGs are fed into the model as a single string of sequences The KG (and KG+LD) experiments resulted in significantly longer summaries (600-1000, and just under 900 on average). R1 was double the unlimiformer baseline at 40, but otherwise R2/RL/BF1 scores were pretty similar across the board. (Summary generation was significantly slower.) I think this surprising difference has something to do with bart-base and its treatment of BOS and EOS tokens. But: I also found that length of summary is highly dependent on the conda environment. (I didn't plan to run these experiments, but my original conda environment was somehow corrupted and I didn't realise that there was a So now I have three conda environments: two with For the two In any case, as a result, I have very nearly matched your results with this bizarre training process. The only thing that remains is to run the new model on the test set and submit to Scrolls. @urialon and @abertsch72 and team: your insights would be very welcome! |
PS. micromamba for resolving conda environments: it just rocks. |
I think I now understand: I think it is the |
Hey @patrickocal -- apologies for the lack of response earlier on this, but this is a really interesting thread. Your knowledge graph setting is cool-- how are you generating these knowledge graphs? That jump in performance from swapping pytorch is really wild-- I wonder if this is a general issue (did you see this with bart-base as well)? |
Hi folks, thanks for your help with understanding unlimiformer so far. My team and I trying to reproduce your training results from the paper using the following:
My understanding is that we should be reproducing Table 4: (56.6 / 26.3 / 27.6 / 68.2) for (Rouge 1 / 2/ L/ BERTScore). Here is a link to a wandb report of a full run we have produced (it took about 11 hours):
https://api.wandb.ai/links/unlimiformer-kg/y29tbk1n
The max_source_length 16384 is a concern given that the training set has some enormous documents. The dataset has a very long tail: plenty over 50k tokens and even one with 250k tokens.
I'll let you know how a second run goes overnight. (I've just cloned your repo and, just to be sure, here is a screen shot of my slurm job:
The text was updated successfully, but these errors were encountered: