WIP

OFAI · Apr 9, 2024 · fafa166 · fafa166
1 parent 62cdc64
commit fafa166
Show file tree

Hide file tree

Showing 5 changed files with 110 additions and 18 deletions.
diff --git a/site/closed-track.md b/site/closed-track.md
@@ -0,0 +1 @@
+In the closed tracks, participants agree to use only the annotated data provided within this task to develop their model. No (i) additional data labelled for sexism or misogyn or (ii) additional models trained on data labelled for sexism or misogyny are allowed. Participants having made at least one submission in a closed track during the Test Phase will be invited to submit a paper for the Shared Task at KONVENS 2024 describing their system. If participants also made at least one submission for an open track, they can also include a comparison of their approaches in the paper. 
diff --git a/site/open-track.md b/site/open-track.md
@@ -0,0 +1 @@
+In the open tracks, participants are encouraged to use additional data or models trained on labelled data. These additional labelled data, embeddings or models need to be open source and provided by the participants upon request and if possible via the submission page. Participants submitting in open tracks are only invited to submit a paper for the Shared Task at KONVENS 2024 describing their system, if they also made a submission in a closed track during the Test Phase and compare their approaches in the paper. Due to reproducibility issues, e.g. when including generative LLM such as GPT 3.5, we do not accept papers who solely present approaches for the open tracks.  
diff --git a/site/subtask1.md b/site/subtask1.md
@@ -1 +1,47 @@
+# Submission Instructions
+## How to participate
+
+For the development phase of subtask 1, we provide all participants with the following data:
+* the labeled training set containing 'id', 'text', and 'annotations'
+* the unlabeled dev set containing 'id' and 'annotations'
+
+You can download the data [add-link](link-tbd)
+
+**note**: do we provide example submissions?
+
+**Goal** of subtask 1 is to solve 4 binary classification tasks on the data and to predict the majority label.
+
+For each submission:
+* save your predictions to a separate csv file. The file needs to contain the following columns:
+  * 'id': the unique ID of each text, as specified in the dev/test data
+  * 'bin_maj': predict 1 if a majority of annotators assigned non-0 (scores 1 - 4), predict 0 if a majority of annotators assigned 0
+  * 'bin_one': predict 1 if at least one annotator assigned non-0, 0 otherwise
+  * 'bin_all': predict 1 if all annotators assigned non-0
+  * 'multi_maj': predict the majority label if there is one
+  * 'disagree_bin': predict 1 if there is disagreement between annotators on 0 vs non-0
+* compress this csv file into a zip file.
+* under My Submissions, fill out the submission form and submit the zip file.
+
+**note**: do we want the data in .csv format?
+
+For the Development Phase, multiple submissions are allowed and they serve the purpose of developing the model.
+
+For the Test Phase, participants may only submit two times, to allow for a mistake in the first submission. Please note that only the latest valid submission determines the final task ranking.
+
+**note**: for EDOS, they restricted the submission in the test phase to 2. Do we want that as well?
+
+## Evaluation
+
+### Evaluation Data
+
+For the Development Phase, systems will be evaluated on the development data labels. For the Test Phase, systems will be evaluated on the test labels. The development data is available [add link](add-link). The test sets will be available as soon as the corresponding test phase starts.
+
+### Evaluation Metrics
+
+TBD
+
+## Submission errors
+
+A submission is successful, if it has the submission status 'finished'. 'Failed' submissions can be investigated for error sources by clicking at '?' next to 'failed' and looking at LOGS > scoring logs > stderr. 
+
 
diff --git a/site/subtask2.md b/site/subtask2.md
@@ -1,10 +1,57 @@
-# How to participate
-
-Please submit your results as a .tsv file. It needs to contain the following columns: 
-* *ID* : the ID of the post
-* *0_rate* : percentage of annotators rating this post as not sexist
-* *1_rate* : percentage of annotators rating this post as 1 on a scale from 0-4 with 4 being extremely sexist
-* *2_rate* : percentage of annotators rating this post as 2 on a scale from 0-4 with 4 being extremely sexist
-* *3_rate* : percentage of annotators rating this post as 3 on a scale from 0-4 with 4 being extremely sexist
-* *4_rate* : percentage of annotators rating this post as 4 on a scale from 0-4 with 4 being extremely sexist
-* *soft_label* : rating this post as sexist on a scale from 0 to 1.
+# Submission Instructions
+## How to participate
+
+For the development phase of subtask 1, we provide all participants with the following data:
+* the labeled training set containing 'id', 'text', and 'annotations'
+* the unlabeled dev set containing 'id' and 'annotations'
+
+You can download the data [add-link](link-tbd)
+
+**note**: do we provide example submissions?
+
+**Goal** of this subtask are to predict both (i) the binary distribution ('dist_bin'), and (ii) the multi score distribution ('dist_multi'):
+  * dist_bin: predict the percentage of annotators choosing sexist ('dist_bin_1') and not sexist ('dist_bin_0')
+  * dist_multi: predict the percentage of annotators for each possible label, so a list of 5 values [0,1] for the scores 0 ('dist_multi_0'), 1 ('dist_multi_1'), 2 ('dist_multi_2'), 3 ('dist_multi_3'), 4 ('dist_multi_4')
+
+Both values of 'dist_bin' need to add up to 1 and all 5 values of 'dist_multi' need to add up to 1.
+
+For each submission:
+* save your predictions to a separate csv file. The file needs to contain the following columns:
+  * 'id': the unique ID of each text, as specified in the dev/test data
+  * 'dist_bin_0'
+  * 'dist_bin_1'
+  * 'dist_multi_0'
+  * 'dist_multi_1'
+  * 'dist_multi_2'
+  * 'dist_multi_3'
+  * 'dist_multi_4'
+* compress this csv file into a zip file.
+* under My Submissions, fill out the submission form and submit the zip file.
+
+**note**: do we want submissions as a .csv file or as a .json file?
+
+For the Development Phase, multiple submissions are allowed and they serve the purpose of developing the model.
+
+For the Test Phase, participants may only submit two times, to allow for a mistake in the first submission. Please note that only the latest valid submission determines the final task ranking.
+
+**note**: for EDOS, they restricted the submission in the test phase to 2. Do we want that as well?
+
+## Submission errors
+
+A submission is successful, if it has the submission status 'finished'. 'Failed' submissions can be investigated for error sources by clicking at '?' next to 'failed' and looking at LOGS > scoring logs > stderr. 
+
+
+## Evaluation
+
+### Evaluation Data
+
+For the Development Phase, systems will be evaluated on the development data labels. For the Test Phase, systems will be evaluated on the test labels. The development data is available [add link](add-link). The test sets will be available as soon as the corresponding test phase starts.
+
+### Evaluation Metrics
+
+System performance on subtask 2 (both the open and the closed track) is evaluated using the Jensen-Shannon distance for both (i) the prediction of the binary distribution, and (ii) the prediction of the multi score distribution. We chose the Jensen-Shannon distance as it is a standard method for measuring
+the similarity between two probability distributions. It is the square root of the Jensen-Shannon divergence, which is based on the Kullback-Leibler divergence, but is symmetric and always has a finite value.
+
+We compute the Jensen-Shannon distance using scipy's spatial distance function. The full evaluation script on CodaBench is available on GitHub [add-link](add-link).
+
+**note**: do we publish the evaluation script when the competition starts or when it has ended?
diff --git a/site/terms.md b/site/terms.md
@@ -1,18 +1,15 @@
 # Terms and Conditions
 
-**Participation in the competition**: Any interested person may freely participate in the competition. By participating in the competition, you agree to the terms and conditions in their entirety, without amendment or provision. By participating in the competition, you consent to the public release of your scores and submissions at the GermEval-2024 workshop and in the associated proceedings. Participation is understood as any direct or indirect contributions to this site or the shared task organizers, such as, but not limited to: results of automatic scoring programs; manual, qualitative and quantitative assessments of the data submitted; task and systems papers submitted.
+**Participation in the competition**: Any interested person may participate in the competition. By your participation, you agree to the terms and conditions in their entirety, without amendment or provision. By participating in the competition, you consent to the public release of your scores and submissions at the GermEval-2024 workshop and in the associated proceedings. Participation is understood as any direct or indirect contributions to this site or the shared task organizers, such as, but not limited to: results of automatic scoring programs; manual, qualitative and quantitative assessments of the data submitted; task and systems papers submitted.
 
-**Individual and Team Participation**: Participants may create teams, but participants may not be part of more than one team. Teams and individual participants must create exactly one account to participate in the Codabench competition. Team composition may not be changed once the Test Phase starts. Your system may be named according to the team name provided at the time of submission, or to a suitable shorthand as determined by the task organizers.
+**Individual and Team Participation**: Participants can participate as individuals or as part of one team. Teams and individual participants must create exactly one account to participate in the Codabench competition. Team composition may not be changed once the Test Phase starts. Your system is named according to the team name provided at the time of submission, or to a suitable shorthand as determined by the task organizers.
 
-**Scoring of submissions**: Submissions may be evaluated with automatic and manual quantitative judgements, qualitative judgements, and any other metrics as the task organizers see fit. You accept that the ultimate decision of metric choice and score value is that of the task organizers. Organizers are under no obligation to release scores. Official scores may be withheld if organizers judge the submission incomplete, erroneous, deceptive, or violating the letter or spirit of the competition's rules. Inclusion of a submission's scores is not an endorsement of a team or individual's submission, system, or science. If multiple submission files are uploaded during the Test Phase, the last submission file per group will be understood as the team's or participant's definitive submission and ranked as such in the task description paper.
+**Scoring of submissions**: Submissions are evaluated with automatic and manual quantitative judgements, qualitative judgements, and any other metrics as the task organizers see fit. You accept that the ultimate decision of metric choice and score value is that of the task organizers. Organizers are under no obligation to release scores. Official scores may be withheld if organizers judge the submission incomplete, erroneous, deceptive, or violating the letter or spirit of the competition's rules. Inclusion of a submission's scores is not an endorsement of a team or individual's submission. If multiple submission files are uploaded during the Test Phase, the last submission file per group will be understood as the team's or participant's definitive submission and ranked as such in the task description paper. 
 
 **Data usage**: The provided data should be used responsibly and ethically. Do not attempt to misuse it in any way, including, but not limited to, reconstructing test sets, any none-scientific use of the data, or any other unconscionable usage of the data. You may not redistribute the task data except in the manner prescribed by its licence.
 
-**Submission of systems description papers**: Participants having made at least one submission during the Test Phase will be invited to submit a paper describing their system. We strongly encourage a link to the code of systems being described will be made available to organizers or the public at large. We also encourage you to upload any systems and models to an open-source repository such as the HuggingFace Hub.
-
-**Specific conditions for closed tasks**: Participants agree to use no (i) existing models trained on additional data labelled for sexism or misogyny or (ii) additional data labelled for sexism or misogyny.
-
-**Specific conditions for open tasks**: If participants use additional models or embeddings trained on data labelled for sexism or misogyny, these models and data need to be open source and provided by the participants upon request. Participating in the closed task is a precondition to participate in the open task. However, participating in the open task is no precondition for participating in the closed task.
+**Specific conditions for closed and open tracks**: Participants agree to follow the specific conditions for [closed tracks](link-tbd) and [open tracks](link-tbd), which specify the type of data allowed for pretraining the model.
 
+**Submission of systems description papers**: Participants having made at least one submission for a closed track during the Test Phase will be invited to submit a paper describing their system. Participants having made only submissions for open tracks will not be invited to submit a paper describing their system (see the specific conditions for closed and open tracks). For both tracks, we strongly encourage participants to provide a link to the code of their system(s) to organizers or the public at large (on the submission page?). We also encourage you to upload any systems and models to an open-source repository such as the HuggingFace Hub. 
 
 **Acknowledgements**: This shared task was created by OFAI with funding from the FFG project EKIP.
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		In the closed tracks, participants agree to use only the annotated data provided within this task to develop their model. No (i) additional data labelled for sexism or misogyn or (ii) additional models trained on data labelled for sexism or misogyny are allowed. Participants having made at least one submission in a closed track during the Test Phase will be invited to submit a paper for the Shared Task at KONVENS 2024 describing their system. If participants also made at least one submission for an open track, they can also include a comparison of their approaches in the paper.
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		In the open tracks, participants are encouraged to use additional data or models trained on labelled data. These additional labelled data, embeddings or models need to be open source and provided by the participants upon request and if possible via the submission page. Participants submitting in open tracks are only invited to submit a paper for the Shared Task at KONVENS 2024 describing their system, if they also made a submission in a closed track during the Test Phase and compare their approaches in the paper. Due to reproducibility issues, e.g. when including generative LLM such as GPT 3.5, we do not accept papers who solely present approaches for the open tracks.