Skip to content

Commit f43cb18

Browse files
committed
WIP
1 parent 7d99097 commit f43cb18

File tree

5 files changed

+151
-56
lines changed

5 files changed

+151
-56
lines changed

site/announcement.md

+22
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,24 @@
1+
GermEval 2024: GerMS Sexism Detection in German Online News Fora
2+
3+
CALL FOR PARTICIPATION
4+
5+
GermEval 2024: GerMS
6+
(Sexism Detection in German Online News Fora)
7+
8+
9 September 2024 at KONVENS 2024, Vienna, Austria
9+
10+
[https://ofai.github.io/GermEval2024-GerMS/](https://ofai.github.io/GermEval2024-GerMS/)
11+
12+
---- Introduction ----
13+
14+
15+
---- Task description ----
16+
17+
18+
---- Timeline ----
19+
20+
21+
---- Organizers ----
22+
123

224

site/download.md

+13
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# GermEval2024 GerMS - Download
2+
3+
On this page, the files for training and labeling can be downloaded
4+
for each of the phases of the GermEval2024 GerMS competition.
5+
6+
7+
## Trial Phase
8+
9+
10+
## Development Phase
11+
12+
13+
## Competition Phase

site/index.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -54,11 +54,11 @@ are organized into two different tracks:
5454

5555
## Timeline
5656

57-
* **Trial phase**: April 14 - April 30, 2024
57+
* **Trial phase**: April 16 - April 30, 2024
5858
* A small labeled dataset for training and a small unlabeled dataset to use for the submission are provided. This phase is for getting to know the
5959
problem, dataset format, how to submit predictions, how submissions are evaluated and the evaluation shows up on the leaderboard etc.
6060
* **Development phase**: May 1 - June 6, 2024
61-
* During this phase, a labeled training set and an unlabeled test set are made available. The training set will contain the labeled versions of the
61+
* During this phase, a labeled training set and an unlabeled test set are made available. The training set will contain the updated labeled versions of the
6262
training and test set of the previous phase plus additional labeled examples. Submissions have to contain the predictions for the unlabeled test set
6363
and the evaluation of the submission will sho up on the leaderbord.
6464
* **Competition phase**: June 7 - June 25, 2024

site/subtask1.md

+21-14
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ While the annotation guidelines define what kind of sexism/misogyny should get a
1818
give rules about how to decide on the strength. For this reason, if an annotator decided that sexism/misogyny is present in a text,
1919
the strength assigned is a matter of personal judgement.
2020

21-
The labels to predict in subtask one reflect different strategies for how multiple labels from annotators can be use to derive a final
21+
The labels to predict in subtask 1 reflect different strategies for how multiple labels from annotators can be use to derive a final
2222
target label:
2323

2424
* `bin_maj`: predict `1` if a majority of annotators assigned a label other than `0-Kein`, predict `0` if a majority of annotators assigned a label
@@ -31,11 +31,18 @@ target label:
3131

3232
## Data
3333

34-
For the development phase of subtask 1, we provide all participants with the following data:
34+
For the *trial phase* of subtask 1, we provide a small dataset, containing
35+
* a small labeled dataset containing 'id', 'text', and 'annotations' (annotator ids and the label assigned by them)
36+
* a small unlabeled dataset containing 'id', 'text' and 'annotators' (annotator ids)
37+
38+
For the *development phase* of subtask 1, we provide all participants with the following data:
3539
* the labeled training set containing 'id', 'text', and 'annotations' (annotator ids and the label assigned by them)
3640
* the unlabeled dev set containing 'id', 'text' and 'annotators' (annotator ids)
3741

38-
Both files are in JSONL format (one JSON-serialized object per line) where each object is a dictionary with the following
42+
For the *competition phase* of subtask 1, we provide
43+
* the unlabeled test set containing 'id', 'text' and 'annotators' (annotator ids)
44+
45+
All of the five files are in JSONL format (one JSON-serialized object per line) where each object is a dictionary with the following
3946
fields:
4047

4148
* `id`: a hash that identifies the example
@@ -46,8 +53,7 @@ fields:
4653
* Note that the number of annotations and the specific annotators who assigned labels vary between examples
4754
* `annotators` (only in the unlabeled dataset): an array of annotator ids who labeled the example
4855

49-
You can [download](download.md) the labeled and unlabeled data for the development phase and for the competition phase.
50-
56+
You can [download](download.md) the data for each phase as soon as the corresponding phase starts.
5157

5258
## Submission
5359

@@ -79,21 +85,22 @@ To submit your predictions to the competition:
7985

8086
## Phases
8187

82-
* For the Development Phase, multiple submissions are allowed and they serve the purpose of developing and improving the model(s).
83-
* For the Test Phase, participants may only submit a limited number of times. Please note that only the latest valid submission determines the final task ranking.
88+
* For the *trial phase*, multiple submissions are allowed for getting to know the problem and the subtask.
89+
* For the *development phase*, multiple submissions are allowed and they serve the purpose of developing and improving the model(s).
90+
* For the *competition phase*, participants may only submit a limited number of times. Please note that only the latest valid submission determines the final task ranking.
8491

8592
## Evaluation
8693

87-
### Evaluation Data
94+
System performance on all five predicted labels (`bin_maj`, `bin_one`, `bin_all`, `multi_maj`, `disagree_bin`) is evaluated using F1 macro score
95+
over all classes.
8896

89-
For the Development Phase, systems will be evaluated on the development data labels. For the Test Phase, systems will be evaluated on the test labels. The development data is available [add link](add-link). The test sets will be available as soon as the corresponding test phase starts.
97+
The final `score` which is used for ranking the submissions is calculated as the unweighted average over all 5 scores.
9098

91-
### Evaluation Metrics
9299

93-
TBD
94-
95-
## Submission errors
100+
## Submission errors and warnings
96101

97102
A submission is successful, if it has the submission status 'finished'. 'Failed' submissions can be investigated for error sources by clicking at '?' next to 'failed' and looking at LOGS > scoring logs > stderr.
98-
99103

104+
If you experience any issue such as a submission file stuck with a "scoring" status, please cancel the submission and try again. In case the problem persists you can contact us using the Forum.
105+
106+
Following a successful submission, you need to refresh the web page in order to see your score and your result on the leaderboard.

site/subtask2.md

+93-40
Original file line numberDiff line numberDiff line change
@@ -1,56 +1,109 @@
11
# GermEval2024 GerMS - Subtask 2
22

3-
For the development phase of subtask 1, we provide all participants with the following data:
4-
* the labeled training set containing 'id', 'text', and 'annotations'
5-
* the unlabeled dev set containing 'id' and 'annotations'
6-
7-
You can download the data [add-link](link-tbd)
3+
IMPORTANT: please note that there is a [closed](closed-track.md) and an [open](open-track.md) track for this subtask!
4+
5+
In subtask 2 the goal is to predict the distribution for each text in a dataset where the distribution is derived from the original distribution of labels assigned by several human annotators.
6+
7+
The human annotators assigned (according to the [annotation guidelines](guidelines.md) )
8+
the strength of misogyny/sexism present in the given text via the following labels:
9+
10+
* `0-Kein`: no sexism/misogyny present
11+
* `1-Gering`: mild sexism/misogyny
12+
* `2-Vorhanden`: sexism/misogyny present
13+
* `3-Stark`: strong sexism/misogyny
14+
* `4-Extrem`: extreme sexism/misogyny
15+
16+
While the annotation guidelines define what kind of sexism/misogyny should get annotated, there has been made no attempt to give rules about how to decide on the strength. For this reason, if an annotator decided that sexism/misogyny is present in a text, the strength assigned is a matter of personal judgement.
17+
18+
The distributions to predict in subtask 2 are
19+
* the binary distribution ('dist_bin'): two values are predicted, which add up to 1.
20+
* `dist_bin_0`: refers to the portion of annotators labeling the text as 'not-sexist' (`0-Kein`)
21+
* `dist_bin_1`: refers to the portion of annotators labeling the text as 'sexist' (`1-Gering`, `2-Vorhanden`, `3-Stark`, or `4-Extrem`).
22+
* the multi score distribution ('dist_multi'): five values are predicted, which add up to 1.
23+
* `dist_multi_0`: predict the portion of annotators labeling the text as `0-Kein`.
24+
* `dist_multi_1`: predict the portion of annotators labeling the text as `1-Gering`.
25+
* `dist_multi_2`: predict the portion of annotators labeling the text as `2-Vorhanden`.
26+
* `dist_multi_3`: predict the portion of annotators labeling the text as `3-Stark`.
27+
* `dist_multi_4`: predict the portion of annotators labeling the text as `4-Extrem`.
28+
29+
## Data
30+
31+
For the *trial phase* of subtask 1, we provide a small dataset, containing
32+
* a small labeled dataset containing 'id', 'text', and 'annotations' (annotator ids and the label assigned by them)
33+
* a small unlabeled dataset containing 'id', 'text' and 'annotators' (annotator ids)
34+
35+
For the *development phase* of subtask 1, we provide all participants with the following data:
36+
* the labeled training set containing 'id', 'text', and 'annotations' (annotator ids and the label assigned by them)
37+
* the unlabeled dev set containing 'id', 'text' and 'annotators' (annotator ids)
38+
39+
For the *competition phase* of subtask 1, we provide
40+
* the unlabeled test set containing 'id', 'text' and 'annotators' (annotator ids)
41+
42+
All of the five files are in JSONL format (one JSON-serialized object per line) where each object is a dictionary with the following
43+
fields:
44+
45+
* `id`: a hash that identifies the example
46+
* `text`: the text to classify. The text can contain arbitrary Unicode and new lines
47+
* `annotations` (only in the labeled dataset): an array of dictionaries which contain the following key/value pairs:
48+
* `user`: a string in the form "A003" which is an anonymized id for the annotator who assigned the label
49+
* `label`: the label assigned by the annotator
50+
* Note that the number of annotations and the specific annotators who assigned labels vary between examples
51+
* `annotators` (only in the unlabeled dataset): an array of annotator ids who labeled the example
52+
53+
You can [download](download.md) the data for each phase as soon as the corresponding phase starts.
54+
55+
## Submission
56+
57+
Your submission must be a file in TSV (tab separated values) format which contains the following columns in any order:
58+
59+
* `id`: the id of the example in the unlabeled dataset for which the predictions are submitted
60+
* `dist_bin_0`: prediction of one value between 0 and 1 (all `dist_bin` values need to add up to 1).
61+
* `dist_bin_1`: prediction of one value between 0 and 1 (all `dist_bin` values need to add up to 1).
62+
* `dist_multi_0`: prediction of one value between 0 and 1 (all `dist_multi` values need to add up to 1).
63+
* `dist_multi_1`: prediction of one value between 0 and 1 (all `dist_multi` values need to add up to 1).
64+
* `dist_multi_2`: prediction of one value between 0 and 1 (all `dist_multi` values need to add up to 1).
65+
* `dist_multi_3`: prediction of one value between 0 and 1 (all `dist_multi` values need to add up to 1).
66+
* `dist_multi_4`: prediction of one value between 0 and 1 (all `dist_multi` values need to add up to 1).
67+
68+
Note that the way how you derive those values is up to you (as long as the rules for the closed or open tracks are followed):
69+
70+
* you can train several models or a single model to get the predicted distribution
71+
* you can derive the mode-specific training set in any way from the labeled training data
72+
* you can use the information of which annotator assigned the label or ignore that
73+
74+
To submit your predictions to the competition:
75+
76+
* the file MUST have the file name extension `.tsv`
77+
* the TSV file must get compressed into a ZIP file with extension `.zip`
78+
* the ZIP file should then get uploaded as a submission to the correct competition
79+
* !! Please make sure you submit to the competition that corresponds to the correct subtask (1 or 2) and correct track (Open or Closed)!
80+
* under "My Submissions" make sure to fill out the form and:
81+
* enter the name of your team which has been registered for the competition
82+
* give a name to your method
83+
* confirm that you have checked that you are indeed submitting to the correct competition for the subtask and track desired
84+
885

986
**note**: do we provide example submissions?
1087

11-
**Goal** of this subtask are to predict both (i) the binary distribution ('dist_bin'), and (ii) the multi score distribution ('dist_multi'):
12-
* dist_bin: predict the percentage of annotators choosing sexist ('dist_bin_1') and not sexist ('dist_bin_0')
13-
* dist_multi: predict the percentage of annotators for each possible label, so a list of 5 values [0,1] for the scores 0 ('dist_multi_0'), 1 ('dist_multi_1'), 2 ('dist_multi_2'), 3 ('dist_multi_3'), 4 ('dist_multi_4')
14-
15-
Both values of 'dist_bin' need to add up to 1 and all 5 values of 'dist_multi' need to add up to 1.
88+
## Phases
1689

17-
For each submission:
18-
* save your predictions to a separate csv file. The file needs to contain the following columns:
19-
* 'id': the unique ID of each text, as specified in the dev/test data
20-
* 'dist_bin_0'
21-
* 'dist_bin_1'
22-
* 'dist_multi_0'
23-
* 'dist_multi_1'
24-
* 'dist_multi_2'
25-
* 'dist_multi_3'
26-
* 'dist_multi_4'
27-
* compress this csv file into a zip file.
28-
* under My Submissions, fill out the submission form and submit the zip file.
90+
* For the *trial phase*, multiple submissions are allowed for getting to know the problem and the subtask.
91+
* For the *development phase*, multiple submissions are allowed and they serve the purpose of developing and improving the model(s).
92+
* For the *competition phase*, participants may only submit a limited number of times. Please note that only the latest valid submission determines the final task ranking.
2993

30-
**note**: do we want submissions as a .csv file or as a .json file?
94+
## Evaluation
3195

32-
For the Development Phase, multiple submissions are allowed and they serve the purpose of developing the model.
96+
System performance on subtask 2 is evaluated using the Jensen-Shannon distance for both (i) the prediction of the binary distribution, and (ii) the prediction of the multi score distribution. We chose the Jensen-Shannon distance as it is a standard method for measuring the similarity between two probability distributions and it is a proper
97+
distance metric which is between 0 and 1. It is the square root of the Jensen-Shannon divergence, which is based on the Kullback-Leibler divergence.
3398

34-
For the Test Phase, participants may only submit two times, to allow for a mistake in the first submission. Please note that only the latest valid submission determines the final task ranking.
99+
The overall score which is used for ranking the submissions is calculated as the unweighted average between the two JS-distances.
35100

36-
**note**: for EDOS, they restricted the submission in the test phase to 2. Do we want that as well?
37101

38-
## Submission errors
102+
## Submission errors and warnings
39103

40104
A submission is successful, if it has the submission status 'finished'. 'Failed' submissions can be investigated for error sources by clicking at '?' next to 'failed' and looking at LOGS > scoring logs > stderr.
41105

106+
If you experience any issue such as a submission file stuck with a "scoring" status, please cancel the submission and try again. In case the problem persists you can contact us using the Forum.
42107

43-
## Evaluation
44-
45-
### Evaluation Data
46-
47-
For the Development Phase, systems will be evaluated on the development data labels. For the Test Phase, systems will be evaluated on the test labels. The development data is available [add link](add-link). The test sets will be available as soon as the corresponding test phase starts.
48-
49-
### Evaluation Metrics
50-
51-
System performance on subtask 2 (both the open and the closed track) is evaluated using the Jensen-Shannon distance for both (i) the prediction of the binary distribution, and (ii) the prediction of the multi score distribution. We chose the Jensen-Shannon distance as it is a standard method for measuring
52-
the similarity between two probability distributions. It is the square root of the Jensen-Shannon divergence, which is based on the Kullback-Leibler divergence, but is symmetric and always has a finite value.
53-
54-
We compute the Jensen-Shannon distance using scipy's spatial distance function. The full evaluation script on CodaBench is available on GitHub [add-link](add-link).
108+
Following a successful submission, you need to refresh the submission page in order to see your score and your result on the leaderboard.
55109

56-
**note**: do we publish the evaluation script when the competition starts or when it has ended?

0 commit comments

Comments
 (0)