Skip to content

Files

Latest commit

06e9b45 · Aug 31, 2012

History

History

prep

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
Aug 31, 2012
Aug 31, 2012
Aug 31, 2012

ONTONOTES WSJ PREPROCESSING PIPELINE

This process generates a JSON file for every sentence in the WSJ corpus. It depends on OntoNotes 4.0, the OntoNotes DB Tool, Stanford CoreNLP, NomBank, the BBN Pronoun Coreference and Entity Type Corpus, etc. See the shell scripts for paths to these resources. The JSON files generated by this pipeline can serve as input to the AMR Generation pipeline.

The preprocessing pipeline consists of the stage0 and stage1 directories. stage0 generates several files for each document by running various tools. stage1 then generates per-sentence JSON files, with the second portion (stage1b) refining the JSON files output by the first portion.

Schematization of the data flow:

Original OntoNotes 4

  S  | prep_and_parse.sh -or- generate_json.sh
  T  |  + wsj_????.onf --->                  onf2txt.py ---> wsj_????.txt
  A  |  + wsj_????.txt --->                parse_txt.py ---> wsj_????.json
  G  |  + wsj_????.onf with no .parse ---> onf2parse.py ---> wsj_????.parse
  E  |  + nombank.1.0 -->         distribute-nombank.py ---> wsj_????.nom
  0  v  + wsj_????.parse --->                         wsj_????.dep{,_basic} [TODO: what about generate_json.sh?]
     
wsj_????.{txt,json,parse,dep,dep_basic}

  S  | generate_sent_json.sh
  T  |  + run_prefix.sh
  A  |      + Predicate.py
  G  |           (uses files created above + OntoNotes DB Tool API)
  E  |  + stage1b scripts: further refine the resulting wsj_????.*.json
  1  v
 
wsj_????.*.json