This process generates a JSON file for every sentence in the WSJ corpus. It depends on OntoNotes 4.0, the OntoNotes DB Tool, Stanford CoreNLP, NomBank, the BBN Pronoun Coreference and Entity Type Corpus, etc. See the shell scripts for paths to these resources. The JSON files generated by this pipeline can serve as input to the AMR Generation pipeline.
The preprocessing pipeline consists of the stage0 and stage1 directories. stage0 generates several files for each document by running various tools. stage1 then generates per-sentence JSON files, with the second portion (stage1b) refining the JSON files output by the first portion.
Schematization of the data flow:
Original OntoNotes 4
S | prep_and_parse.sh -or- generate_json.sh
T | + wsj_????.onf ---> onf2txt.py ---> wsj_????.txt
A | + wsj_????.txt ---> parse_txt.py ---> wsj_????.json
G | + wsj_????.onf with no .parse ---> onf2parse.py ---> wsj_????.parse
E | + nombank.1.0 --> distribute-nombank.py ---> wsj_????.nom
0 v + wsj_????.parse ---> wsj_????.dep{,_basic} [TODO: what about generate_json.sh?]
wsj_????.{txt,json,parse,dep,dep_basic}
S | generate_sent_json.sh
T | + run_prefix.sh
A | + Predicate.py
G | (uses files created above + OntoNotes DB Tool API)
E | + stage1b scripts: further refine the resulting wsj_????.*.json
1 v
wsj_????.*.json