-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Train - test leakage #13
Comments
Wow. It looks like the issue is in the original huggingface datasets. Lots of duplicates too. Here are the stats using only huggingface from datasets import load_dataset
from tabulate import tabulate
# dataset info from data.py:
# [name, args to load_dataset(), keys used on each item]
ds_info = [
("filipino",
['dengue_filipino'],
['text', 'absent', 'dengue', 'health', 'mosquito', 'sick']),
("kirnews",
["kinnews_kirnews","kirnews_cleaned"],
['label','title','content']),
("kinnews",
["kinnews_kirnews", "kinnews_cleaned"],
['label','title','content']),
("swahili",
['swahili_news'],
['label','text']),
]
lines = []
for name,args,keys in ds_info:
ds = load_dataset(*args)
# convert to list-of-tuples:
train = [tuple([item[key] for key in keys]) for item in ds['train']]
test = [tuple([item[key] for key in keys]) for item in ds['test']]
lines.append(name)
n_overlap = len(set(train).intersection(test))
lines.append(tabulate([
("train:", len(train)),
("train unique:", len(set(train))),
("test:", len(test)),
("test unique:", len(set(test))),
("train/test overlap:", n_overlap,
"%.1f%%" % (100.0 * n_overlap / len(set(test)))),
]))
lines.append("\n")
print("\n".join(lines))
|
I suggest the authors to download the DengueFilipino dataset from the original link instead of Hugging Face. I'm also working on some Tagalog pipelines and I noticed the same upload issues (basically the train and test are 1:1 match). I wrote a parser and some personal notes (file docstring) here. The parser uses some spaCy primitives but feel free to use this as you see fit: https://github.com/ljvmiranda921/calamanCy/blob/master/reports/emnlp2023/benchmark/scripts/process_dengue.py |
Hi @YannDubs, wow thanks for pointing this out!!! I was only aware of the dataset issue of DengueFilipino. Thanks @kts for verifying the huggingface dataset issue. People should be aware of that and use the original link for those datasets. I will redo the experiment on Filipino using the original link @ljvmiranda921 provided. |
Hello, may I ask if you can provide the Filipino dataset? Cannot download from the original link, thank you very much. |
Hi, I looked into the OOD results and many examples in the test sets seem to be in the train set. E.g. DengueFilipino has the same train and test set. KirundiNews has 90% overlap...
to reproduce:
The text was updated successfully, but these errors were encountered: