This repository contains data and links to resources related to the task of simplifying Dutch (municipal) texts.
According to recent research, 16% of the people between 16 and 65 in Amsterdam have low literacy skills. This hinders societal participation in tasks such as voting, paying taxes, reissuing documents, or applying for social benefits. Thus, as part of our Amsterdam for All project, we have set on a mission to research the use of AI for measuring and improving the readability of municipal communication. We also hope to inspire others to work on text simplification and related technology.
Further information about the problem of measuring and improving the readability of municipal texts, and how we make use of the datasets in this repository can be found on the Amsterdam Intelligence website, as well as on Openresearch.
The City of Amsterdam has published a dataset for the task of sentence-level simplification of Dutch municipal texts. The dataset consists of 1311 sentence pairs automatically aligned from 50 documents manually simplified by communications experts.
The complex-simple-sentences
folder contains the dataset,
as well as further details about its creation.
In 2023, as part of her MSc AI thesis, Charlotte Van de Velde and other researchers from KU Leuven shared the following text simplification resources:
- a dataset of 1267 sentences automatically simplified by using gpt-3.5-turbo
- a base, small and large versions of a model fine-tuned on the above-mentioned dataset (using the corresponding UL2 models)
- a demo of the base version
The contextualized-lexical-simplification
folder contains information about an evaluation dataset for Dutch Contextualized Lexical Simplification,
as well as a reference to the corresponding paper proposing LSBertje - a new Dutch model for the task.
The City of Amsterdam maintains a list of complex words together with simpler alternatives.
A (partial) list of municipal abbreviations from the municipal domain can be found on github.
Due to the overlap in underlying technology, we propose taking inclusive language into account when detecting words or phrases that require substitution or providing suitable alternatives. The City of Amsterdam maintains a list of inclusive words as well as a full list of resources related to inclusive language.
According to their own website, "NT2Lex is a lexical database for Dutch as a foreign language (NT2) that includes frequency distributions of words observed in texts graded along the six-level scale of the Common European Framework of Reference for Languages. It is a receptive graded lexicon, with word frequencies observed in textbook reading activities and simplified readers targeting learners of Dutch."
There are also online tools for the analysis of the complexity of words in a text, as well as frequency distributions of words along the CEFR scale.
As part of a European Language Grid project, EDIA has developed a dataset of 1200 texts (form diverse data source, including municipal texts such as the once that can be found on the website of the City of Amsterdam). The texts are labelled with CEFR readability level and are available under a CC-BY-NC license (academic purposes).
The dataset can be requested on the company's website.
Feel free to help out! Open an issue, submit a PR or contact us.
This repository was created by Amsterdam Intelligence for the City of Amsterdam.
The resources (linked) in this repository were compiled by colleagues at the City of Amsterdam, as well as researchers from the University of Amsterdam, the Vrije Universiteit Amsterdam, as well as external partners and organizations.
We owe a special thank you to the Communications Department of the City of Amsterdam for providing us with valuable resources, and to Eliza Hobo, Daniel Vlantis and Ayoub Abdelouarit for their dedication.
This project is licensed under the terms of the European Union Public License 1.2 (EUPL-1.2).