Train - test leakage #13

YannDubs · 2023-07-18T15:38:27Z

Hi, I looked into the OOD results and many examples in the test sets seem to be in the train set. E.g. DengueFilipino has the same train and test set. KirundiNews has 90% overlap...

to reproduce:

from data import *
dataloaders = dict(DengueFilipino=load_filipino, 
                   KirundiNews=load_kirnews, 
                   KinyarwandaNews=load_kinnews, 
                   SwahiliNews=load_swahili)

for data_name, loader in dataloaders.items():
    train, test = loader();
    overlap = 1 - len(set(test) - set(train)) / len(set(test))
    print(data_name, f"train<->test overlap: {overlap * 100:.1f}%")

DengueFilipino train<->test overlap: 100.0%
KirundiNews train<->test overlap: 90.4%
KinyarwandaNews train<->test overlap: 23.8%
SwahiliNews train<->test overlap: 0.5%

The text was updated successfully, but these errors were encountered:

kts · 2023-07-18T16:54:25Z

Wow. It looks like the issue is in the original huggingface datasets. Lots of duplicates too.

Here are the stats using only huggingface datasets.load_dataset():

from datasets import load_dataset
from tabulate import tabulate

# dataset info from data.py:
# [name, args to load_dataset(), keys used on each item]
ds_info = [

    ("filipino",
     ['dengue_filipino'],
     ['text', 'absent', 'dengue', 'health', 'mosquito', 'sick']),

    ("kirnews",
     ["kinnews_kirnews","kirnews_cleaned"],
     ['label','title','content']),
    
    ("kinnews",
     ["kinnews_kirnews", "kinnews_cleaned"],
     ['label','title','content']),

    ("swahili",
     ['swahili_news'],
     ['label','text']),

]

lines = []
for name,args,keys in ds_info:

    ds = load_dataset(*args)

    # convert to list-of-tuples:
    train = [tuple([item[key] for key in keys]) for item in ds['train']]
    test  = [tuple([item[key] for key in keys]) for item in ds['test']]

    lines.append(name)

    n_overlap = len(set(train).intersection(test))
    lines.append(tabulate([
        ("train:",        len(train)),
        ("train unique:", len(set(train))),
        ("test:",         len(test)),
        ("test unique:",  len(set(test))),
        
        ("train/test overlap:", n_overlap,
         "%.1f%%" % (100.0 * n_overlap / len(set(test)))),
    ]))
    lines.append("\n")
    
print("\n".join(lines))

filipino
-------------------  ----  ------
train:               4015
train unique:        3947
test:                4015
test unique:         3947
train/test overlap:  3947  100.0%
-------------------  ----  ------


kirnews
-------------------  ----  -----
train:               3689
train unique:        1791
test:                 923
test unique:          698
train/test overlap:   631  90.4%
-------------------  ----  -----


kinnews
-------------------  -----  -----
train:               17014
train unique:         9199
test:                 4254
test unique:          2702
train/test overlap:    643  23.8%
-------------------  -----  -----


swahili
-------------------  -----  ----
train:               22207
train unique:        22207
test:                 7338
test unique:          7338
train/test overlap:     34  0.5%
-------------------  -----  ----

ljvmiranda921 · 2023-07-18T23:40:26Z

I suggest the authors to download the DengueFilipino dataset from the original link instead of Hugging Face. I'm also working on some Tagalog pipelines and I noticed the same upload issues (basically the train and test are 1:1 match).

I wrote a parser and some personal notes (file docstring) here. The parser uses some spaCy primitives but feel free to use this as you see fit: https://github.com/ljvmiranda921/calamanCy/blob/master/reports/emnlp2023/benchmark/scripts/process_dengue.py

bazingagin · 2023-07-20T20:51:03Z

Hi @YannDubs, wow thanks for pointing this out!!! I was only aware of the dataset issue of DengueFilipino. Thanks @kts for verifying the huggingface dataset issue. People should be aware of that and use the original link for those datasets. I will redo the experiment on Filipino using the original link @ljvmiranda921 provided.
I will also check if the issue of KirundiNews overlapped happened in their original dataset.
Thanks again!

bazingagin · 2023-08-01T02:11:51Z

Here are results using the original DengueFilipino dataset.
I also checked the original Kirundi dataset, it still has the data contamination issue.

maoxuxu · 2024-07-18T11:32:47Z

我建议作者从原始链接下载 DengueFilipino 数据集，而不是Hugging Face。我也在研究一些 Tagalog 管道，我注意到了同样的上传问题（基本上训练和测试是 1:1 匹配的）。

我在这里编写了一个解析器和一些个人笔记（文件文档字符串）。解析器使用了一些 spaCy 原语，但您可以随意使用它：https://github.com/ljvmiranda921/calamanCy/blob/master/reports/emnlp2023/benchmark/scripts/process_dengue.py

Hello, may I ask if you can provide the Filipino dataset? Cannot download from the original link, thank you very much.

kts mentioned this issue Jul 20, 2023

Problem with accuracy calculation? #3

Open

EliahKagan mentioned this issue Jul 29, 2023

V0.0.1 packaging #30

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train - test leakage #13

Train - test leakage #13

YannDubs commented Jul 18, 2023

kts commented Jul 18, 2023 •

edited

Loading

ljvmiranda921 commented Jul 18, 2023 •

edited

Loading

bazingagin commented Jul 20, 2023

bazingagin commented Aug 1, 2023

maoxuxu commented Jul 18, 2024

Train - test leakage #13

Train - test leakage #13

Comments

YannDubs commented Jul 18, 2023

kts commented Jul 18, 2023 • edited Loading

ljvmiranda921 commented Jul 18, 2023 • edited Loading

bazingagin commented Jul 20, 2023

bazingagin commented Aug 1, 2023

maoxuxu commented Jul 18, 2024

kts commented Jul 18, 2023 •

edited

Loading

ljvmiranda921 commented Jul 18, 2023 •

edited

Loading