Dennis Aumiller*, Ashish Chouhan*, and Michael Gertz
Heidelberg University & SRH Hochschule Heidelberg
contact us at: {aumiller, chouhan, gertz}@informatik.uni-heidelberg.de
Find our dataset on the Huggingface Hub: 🤗 eur-lex-sum
The data card also provides further insight on the acquisition process (and some limitations) of the data. Please refer to the Huggingface Hub for more information.
A pre-print of our work is available; it has also been accepted at the main conference track of EMNLP 2022, conference proceedings will be available in December 2022.
Install all necessary dependencies by running
python3 -m pip install -r requirements.txt
after cloning this repository.
This code base provides necessary scripts for the scraping process (Scraping/
), as well as the analysis of our corpus (Analysis/
) and final baseline experiments (Baselines/
).
For a comparison of language-specific stats, please refer to Table 5 in our pre-print.
Dataset Name | Domain | Number of Languages | Average Tokens in Reference Text | Average Tokens in the Summary text (in words) | Compression Ratio | Dataset |
---|---|---|---|---|---|---|
EUR-Lex-Sum - Our Contribution | Legal | 24 | 12,200 (EN) | 799 (EN) | 16 | 🤗 |
BillSum (US) | Legal | 1 | 1382 | 2000 characters, Words are not considered as tokens | - | 🤗 |
BillSum (CA) | Legal | 1 | 1684 | 2000 characters, Words are not considered as tokens | - | 🤗 |
Global Voices | News | 15 | 359 | 51 | - | Paperswithcode |
WikiLingua | WikiHow | 18 | 391 | 39 | - | 🤗 |
Xwikis (comparable) | Wikipedia | 4 | 945 | 77 | EN: ~12.2 | 🤗 |
Xwikis (parallel) | Wikipedia | 4 | 972 | 76 | 18.35 | 🤗 |
Spektrum (Wiki) | Wikipedia | 2 | 1559 | 140 | 20 | |
Spektrum (Spektrum) | Scientific | 2 | 2337 | 361 | 30 | |
CLIDSUM (Chat) | Dialogue | 3 | 83,9 | 20,3 | - | |
CLIDSUM (Interview) | Dialogue | 3 | 1555,4 | 14,4 | - | |
MLSUM | News | 5 | (French) FR: 632,39 | FR: 29,5 | FR: 21,4 | 🤗 |
(German) DE: 570,6 | DE: 30,36 | DE: 18,8 | ||||
(Spanish) ES: 800,50 | ES: 20,71 | ES: 38,7 | ||||
(Russian) RU: 959,4 | RU: 14,57 | RU: 65,8 | ||||
(Turkish) TU: 309,18 | TU: 22,88 | TU: 13,5 | ||||
(English) EN: 790,24 | EN: 55,56 | EN: 14,2 |
If you use the dataset or other parts of this code base, please use the following citation for attribution:
@inproceedings{aumiller-etal-2022-eur,
title = {{EUR-Lex-Sum: A Multi- and Cross-lingual Dataset for Long-form Summarization in the Legal Domain}},
author = "Aumiller, Dennis and
Chouhan, Ashish and
Gertz, Michael",
booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
month = dec,
year = "2022",
address = "Abu Dhabi, United Arab Emirates",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.emnlp-main.519",
pages = "7626--7639"
}
Copyright for the editorial content of EUR-Lex website, the summaries of EU legislation and the consolidated texts owned by the EU, are licensed under the Creative Commons Attribution 4.0 International licence, i.e., CC BY 4.0 as mentioned on the official EUR-Lex website. Any data artifacts remain licensed under the CC BY 4.0 license.
Per recommendation of the Creative Commons, we apply a separate license to the software component of this repository. We use the standard MIT license for code artifacts.