Detection, Evaluation and Mitigation of Language Artefacts in the Competition On Legal Information Extraction and Entailment Dataset
This repository consists of the codebase for the thesis work on detecting, evaluating and mitigating the language/dataset artefacts in the legal information entailment dataset.
The codebase is categorized into separate folders containing Python notebooks for conducting the experiments.
data —> This folder is a placeholder to place the datasets needed for analysis
src/data scripts —> This folder contains scripts needed for data analysis and data preprocessing.
src/detection —> This folder contains the scripts needed for artefact detection in the dataset.
src/evaluation —> This folder contains the scripts needed for evaluating the BERT-based models for robustness.
src/mitigation —> This folder contains the necessary scripts for data augmentation to mitigate the contradiction word and word overlap artefacts.