|
1 | 1 | # GermEval2024 GerMS - Subtask 2
|
2 | 2 |
|
3 |
| -For the development phase of subtask 1, we provide all participants with the following data: |
4 |
| -* the labeled training set containing 'id', 'text', and 'annotations' |
5 |
| -* the unlabeled dev set containing 'id' and 'annotations' |
6 |
| - |
7 |
| -You can download the data [add-link](link-tbd) |
| 3 | +IMPORTANT: please note that there is a [closed](closed-track.md) and an [open](open-track.md) track for this subtask! |
| 4 | + |
| 5 | +In subtask 2 the goal is to predict the distribution for each text in a dataset where the distribution is derived from the original distribution of labels assigned by several human annotators. |
| 6 | + |
| 7 | +The human annotators assigned (according to the [annotation guidelines](guidelines.md) ) |
| 8 | +the strength of misogyny/sexism present in the given text via the following labels: |
| 9 | + |
| 10 | +* `0-Kein`: no sexism/misogyny present |
| 11 | +* `1-Gering`: mild sexism/misogyny |
| 12 | +* `2-Vorhanden`: sexism/misogyny present |
| 13 | +* `3-Stark`: strong sexism/misogyny |
| 14 | +* `4-Extrem`: extreme sexism/misogyny |
| 15 | + |
| 16 | +While the annotation guidelines define what kind of sexism/misogyny should get annotated, there has been made no attempt to give rules about how to decide on the strength. For this reason, if an annotator decided that sexism/misogyny is present in a text, the strength assigned is a matter of personal judgement. |
| 17 | + |
| 18 | +The distributions to predict in subtask 2 are |
| 19 | +* the binary distribution ('dist_bin'): two values are predicted, which add up to 1. |
| 20 | + * `dist_bin_0`: refers to the portion of annotators labeling the text as 'not-sexist' (`0-Kein`) |
| 21 | + * `dist_bin_1`: refers to the portion of annotators labeling the text as 'sexist' (`1-Gering`, `2-Vorhanden`, `3-Stark`, or `4-Extrem`). |
| 22 | +* the multi score distribution ('dist_multi'): five values are predicted, which add up to 1. |
| 23 | + * `dist_multi_0`: predict the portion of annotators labeling the text as `0-Kein`. |
| 24 | + * `dist_multi_1`: predict the portion of annotators labeling the text as `1-Gering`. |
| 25 | + * `dist_multi_2`: predict the portion of annotators labeling the text as `2-Vorhanden`. |
| 26 | + * `dist_multi_3`: predict the portion of annotators labeling the text as `3-Stark`. |
| 27 | + * `dist_multi_4`: predict the portion of annotators labeling the text as `4-Extrem`. |
| 28 | + |
| 29 | +## Data |
| 30 | + |
| 31 | +For the *trial phase* of subtask 1, we provide a small dataset, containing |
| 32 | +* a small labeled dataset containing 'id', 'text', and 'annotations' (annotator ids and the label assigned by them) |
| 33 | +* a small unlabeled dataset containing 'id', 'text' and 'annotators' (annotator ids) |
| 34 | + |
| 35 | +For the *development phase* of subtask 1, we provide all participants with the following data: |
| 36 | +* the labeled training set containing 'id', 'text', and 'annotations' (annotator ids and the label assigned by them) |
| 37 | +* the unlabeled dev set containing 'id', 'text' and 'annotators' (annotator ids) |
| 38 | + |
| 39 | +For the *competition phase* of subtask 1, we provide |
| 40 | +* the unlabeled test set containing 'id', 'text' and 'annotators' (annotator ids) |
| 41 | + |
| 42 | +All of the five files are in JSONL format (one JSON-serialized object per line) where each object is a dictionary with the following |
| 43 | +fields: |
| 44 | + |
| 45 | +* `id`: a hash that identifies the example |
| 46 | +* `text`: the text to classify. The text can contain arbitrary Unicode and new lines |
| 47 | +* `annotations` (only in the labeled dataset): an array of dictionaries which contain the following key/value pairs: |
| 48 | + * `user`: a string in the form "A003" which is an anonymized id for the annotator who assigned the label |
| 49 | + * `label`: the label assigned by the annotator |
| 50 | + * Note that the number of annotations and the specific annotators who assigned labels vary between examples |
| 51 | +* `annotators` (only in the unlabeled dataset): an array of annotator ids who labeled the example |
| 52 | + |
| 53 | +You can [download](download.md) the data for each phase as soon as the corresponding phase starts. |
| 54 | + |
| 55 | +## Submission |
| 56 | + |
| 57 | +Your submission must be a file in TSV (tab separated values) format which contains the following columns in any order: |
| 58 | + |
| 59 | +* `id`: the id of the example in the unlabeled dataset for which the predictions are submitted |
| 60 | +* `dist_bin_0`: prediction of one value between 0 and 1 (all `dist_bin` values need to add up to 1). |
| 61 | +* `dist_bin_1`: prediction of one value between 0 and 1 (all `dist_bin` values need to add up to 1). |
| 62 | +* `dist_multi_0`: prediction of one value between 0 and 1 (all `dist_multi` values need to add up to 1). |
| 63 | +* `dist_multi_1`: prediction of one value between 0 and 1 (all `dist_multi` values need to add up to 1). |
| 64 | +* `dist_multi_2`: prediction of one value between 0 and 1 (all `dist_multi` values need to add up to 1). |
| 65 | +* `dist_multi_3`: prediction of one value between 0 and 1 (all `dist_multi` values need to add up to 1). |
| 66 | +* `dist_multi_4`: prediction of one value between 0 and 1 (all `dist_multi` values need to add up to 1). |
| 67 | + |
| 68 | +Note that the way how you derive those values is up to you (as long as the rules for the closed or open tracks are followed): |
| 69 | + |
| 70 | +* you can train several models or a single model to get the predicted distribution |
| 71 | +* you can derive the mode-specific training set in any way from the labeled training data |
| 72 | +* you can use the information of which annotator assigned the label or ignore that |
| 73 | + |
| 74 | +To submit your predictions to the competition: |
| 75 | + |
| 76 | +* the file MUST have the file name extension `.tsv` |
| 77 | +* the TSV file must get compressed into a ZIP file with extension `.zip` |
| 78 | +* the ZIP file should then get uploaded as a submission to the correct competition |
| 79 | +* !! Please make sure you submit to the competition that corresponds to the correct subtask (1 or 2) and correct track (Open or Closed)! |
| 80 | +* under "My Submissions" make sure to fill out the form and: |
| 81 | + * enter the name of your team which has been registered for the competition |
| 82 | + * give a name to your method |
| 83 | + * confirm that you have checked that you are indeed submitting to the correct competition for the subtask and track desired |
| 84 | + |
8 | 85 |
|
9 | 86 | **note**: do we provide example submissions?
|
10 | 87 |
|
11 |
| -**Goal** of this subtask are to predict both (i) the binary distribution ('dist_bin'), and (ii) the multi score distribution ('dist_multi'): |
12 |
| - * dist_bin: predict the percentage of annotators choosing sexist ('dist_bin_1') and not sexist ('dist_bin_0') |
13 |
| - * dist_multi: predict the percentage of annotators for each possible label, so a list of 5 values [0,1] for the scores 0 ('dist_multi_0'), 1 ('dist_multi_1'), 2 ('dist_multi_2'), 3 ('dist_multi_3'), 4 ('dist_multi_4') |
14 |
| - |
15 |
| -Both values of 'dist_bin' need to add up to 1 and all 5 values of 'dist_multi' need to add up to 1. |
| 88 | +## Phases |
16 | 89 |
|
17 |
| -For each submission: |
18 |
| -* save your predictions to a separate csv file. The file needs to contain the following columns: |
19 |
| - * 'id': the unique ID of each text, as specified in the dev/test data |
20 |
| - * 'dist_bin_0' |
21 |
| - * 'dist_bin_1' |
22 |
| - * 'dist_multi_0' |
23 |
| - * 'dist_multi_1' |
24 |
| - * 'dist_multi_2' |
25 |
| - * 'dist_multi_3' |
26 |
| - * 'dist_multi_4' |
27 |
| -* compress this csv file into a zip file. |
28 |
| -* under My Submissions, fill out the submission form and submit the zip file. |
| 90 | +* For the *trial phase*, multiple submissions are allowed for getting to know the problem and the subtask. |
| 91 | +* For the *development phase*, multiple submissions are allowed and they serve the purpose of developing and improving the model(s). |
| 92 | +* For the *competition phase*, participants may only submit a limited number of times. Please note that only the latest valid submission determines the final task ranking. |
29 | 93 |
|
30 |
| -**note**: do we want submissions as a .csv file or as a .json file? |
| 94 | +## Evaluation |
31 | 95 |
|
32 |
| -For the Development Phase, multiple submissions are allowed and they serve the purpose of developing the model. |
| 96 | +System performance on subtask 2 is evaluated using the Jensen-Shannon distance for both (i) the prediction of the binary distribution, and (ii) the prediction of the multi score distribution. We chose the Jensen-Shannon distance as it is a standard method for measuring the similarity between two probability distributions and it is a proper |
| 97 | +distance metric which is between 0 and 1. It is the square root of the Jensen-Shannon divergence, which is based on the Kullback-Leibler divergence. |
33 | 98 |
|
34 |
| -For the Test Phase, participants may only submit two times, to allow for a mistake in the first submission. Please note that only the latest valid submission determines the final task ranking. |
| 99 | +The overall score which is used for ranking the submissions is calculated as the unweighted average between the two JS-distances. |
35 | 100 |
|
36 |
| -**note**: for EDOS, they restricted the submission in the test phase to 2. Do we want that as well? |
37 | 101 |
|
38 |
| -## Submission errors |
| 102 | +## Submission errors and warnings |
39 | 103 |
|
40 | 104 | A submission is successful, if it has the submission status 'finished'. 'Failed' submissions can be investigated for error sources by clicking at '?' next to 'failed' and looking at LOGS > scoring logs > stderr.
|
41 | 105 |
|
| 106 | +If you experience any issue such as a submission file stuck with a "scoring" status, please cancel the submission and try again. In case the problem persists you can contact us using the Forum. |
42 | 107 |
|
43 |
| -## Evaluation |
44 |
| - |
45 |
| -### Evaluation Data |
46 |
| - |
47 |
| -For the Development Phase, systems will be evaluated on the development data labels. For the Test Phase, systems will be evaluated on the test labels. The development data is available [add link](add-link). The test sets will be available as soon as the corresponding test phase starts. |
48 |
| - |
49 |
| -### Evaluation Metrics |
50 |
| - |
51 |
| -System performance on subtask 2 (both the open and the closed track) is evaluated using the Jensen-Shannon distance for both (i) the prediction of the binary distribution, and (ii) the prediction of the multi score distribution. We chose the Jensen-Shannon distance as it is a standard method for measuring |
52 |
| -the similarity between two probability distributions. It is the square root of the Jensen-Shannon divergence, which is based on the Kullback-Leibler divergence, but is symmetric and always has a finite value. |
53 |
| - |
54 |
| -We compute the Jensen-Shannon distance using scipy's spatial distance function. The full evaluation script on CodaBench is available on GitHub [add-link](add-link). |
| 108 | +Following a successful submission, you need to refresh the submission page in order to see your score and your result on the leaderboard. |
55 | 109 |
|
56 |
| -**note**: do we publish the evaluation script when the competition starts or when it has ended? |
|
0 commit comments