EUR-Lex-Sum: A Multi- and Cross-lingual Dataset for Long-form Summarization in the Legal Domain

Dennis Aumiller*, Ashish Chouhan*, and Michael Gertz
Heidelberg University & SRH Hochschule Heidelberg
contact us at: {aumiller, chouhan, gertz}@informatik.uni-heidelberg.de

Find our dataset on the Huggingface Hub: 🤗 eur-lex-sum
The data card also provides further insight on the acquisition process (and some limitations) of the data. Please refer to the Huggingface Hub for more information.
A pre-print of our work is available; it has also been accepted at the main conference track of EMNLP 2022, conference proceedings will be available in December 2022.

Installation

Install all necessary dependencies by running

python3 -m pip install -r requirements.txt

after cloning this repository.

This code base provides necessary scripts for the scraping process (Scraping/), as well as the analysis of our corpus (Analysis/) and final baseline experiments (Baselines/).

Comparison to Related Work

For a comparison of language-specific stats, please refer to Table 5 in our pre-print.

Dataset Name	Domain	Number of Languages	Average Tokens in Reference Text	Average Tokens in the Summary text (in words)	Compression Ratio	Dataset
EUR-Lex-Sum - Our Contribution	Legal	24	12,200 (EN)	799 (EN)	16	🤗
BillSum (US)	Legal	1	1382	2000 characters, Words are not considered as tokens	-	🤗
BillSum (CA)	Legal	1	1684	2000 characters, Words are not considered as tokens	-	🤗
Global Voices	News	15	359	51	-	Paperswithcode
WikiLingua	WikiHow	18	391	39	-	🤗
Xwikis (comparable)	Wikipedia	4	945	77	EN: ~12.2	🤗
Xwikis (parallel)	Wikipedia	4	972	76	18.35	🤗
Spektrum (Wiki)	Wikipedia	2	1559	140	20
Spektrum (Spektrum)	Scientific	2	2337	361	30
CLIDSUM (Chat)	Dialogue	3	83,9	20,3	-
CLIDSUM (Interview)	Dialogue	3	1555,4	14,4	-
MLSUM	News	5	(French) FR: 632,39	FR: 29,5	FR: 21,4	🤗
			(German) DE: 570,6	DE: 30,36	DE: 18,8
			(Spanish) ES: 800,50	ES: 20,71	ES: 38,7
			(Russian) RU: 959,4	RU: 14,57	RU: 65,8
			(Turkish) TU: 309,18	TU: 22,88	TU: 13,5
			(English) EN: 790,24	EN: 55,56	EN: 14,2

Cite our work

If you use the dataset or other parts of this code base, please use the following citation for attribution:

@inproceedings{aumiller-etal-2022-eur,
    title = {{EUR-Lex-Sum: A Multi- and Cross-lingual Dataset for Long-form Summarization in the Legal Domain}},
    author = "Aumiller, Dennis  and
      Chouhan, Ashish  and
      Gertz, Michael",
    booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, United Arab Emirates",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.emnlp-main.519",
    pages = "7626--7639"
}

License Information

Copyright for the editorial content of EUR-Lex website, the summaries of EU legislation and the consolidated texts owned by the EU, are licensed under the Creative Commons Attribution 4.0 International licence, i.e., CC BY 4.0 as mentioned on the official EUR-Lex website. Any data artifacts remain licensed under the CC BY 4.0 license.

License for software component

Per recommendation of the Creative Commons, we apply a separate license to the software component of this repository. We use the standard MIT license for code artifacts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EUR-Lex-Sum: A Multi- and Cross-lingual Dataset for Long-form Summarization in the Legal Domain

Installation

Comparison to Related Work

Cite our work

License Information

License for software component

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 132 Commits
Analysis		Analysis
Baselines		Baselines
Scraping		Scraping
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

achouhan93/eur-lex-sum

Folders and files

Latest commit

History

Repository files navigation

EUR-Lex-Sum: A Multi- and Cross-lingual Dataset for Long-form Summarization in the Legal Domain

Installation

Comparison to Related Work

Cite our work

License Information

License for software component

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages