prep

Aug 31, 2012

06e9b45 · Aug 31, 2012

Name	Name	Last commit message	Last commit date
parent directory ..
stage0	stage0	Preprocessing Pipeline	Aug 31, 2012
stage1	stage1	Preprocessing Pipeline	Aug 31, 2012
README.md	README.md	Preprocessing Pipeline	Aug 31, 2012

README.md

ONTONOTES WSJ PREPROCESSING PIPELINE

This process generates a JSON file for every sentence in the WSJ corpus. It depends on OntoNotes 4.0, the OntoNotes DB Tool, Stanford CoreNLP, NomBank, the BBN Pronoun Coreference and Entity Type Corpus, etc. See the shell scripts for paths to these resources. The JSON files generated by this pipeline can serve as input to the AMR Generation pipeline.

The preprocessing pipeline consists of the stage0 and stage1 directories. stage0 generates several files for each document by running various tools. stage1 then generates per-sentence JSON files, with the second portion (stage1b) refining the JSON files output by the first portion.

Schematization of the data flow:

Original OntoNotes 4

  S  | prep_and_parse.sh -or- generate_json.sh
  T  |  + wsj_????.onf --->                  onf2txt.py ---> wsj_????.txt
  A  |  + wsj_????.txt --->                parse_txt.py ---> wsj_????.json
  G  |  + wsj_????.onf with no .parse ---> onf2parse.py ---> wsj_????.parse
  E  |  + nombank.1.0 -->         distribute-nombank.py ---> wsj_????.nom
  0  v  + wsj_????.parse --->                         wsj_????.dep{,_basic} [TODO: what about generate_json.sh?]
     
wsj_????.{txt,json,parse,dep,dep_basic}

  S  | generate_sent_json.sh
  T  |  + run_prefix.sh
  A  |      + Predicate.py
  G  |           (uses files created above + OntoNotes DB Tool API)
  E  |  + stage1b scripts: further refine the resulting wsj_????.*.json
  1  v
 
wsj_????.*.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

prep

prep

README.md

ONTONOTES WSJ PREPROCESSING PIPELINE

Files

prep

Directory actions

More options

Directory actions

More options

Latest commit

History

prep

Folders and files

parent directory

README.md

ONTONOTES WSJ PREPROCESSING PIPELINE