Definition of Requirement

@attr means attr is an attribute rather than a tag.

Tag and attribute names, e.g., @form and @characterOffsetBegin are determined by the following principle:

Use the naming convention found in Universal Dependencies if available (e.g., @form, @lemma, and @deprel).
Otherwise, follow StanfordCoreNLP if available (e.g., @characterOffsetBegin).

Structure of Requirement:

There are two types of requirements: generic requirements and language-specific requirements.
Generic requirements provide the common interface across languages. Language-specific or tool-specific annotation can be handled by language-specific requirements.
Requirement has a type hierarchy, and has (possibly more than one) parents. For example, TokenizeWithIPA used in Japanese has three parents: Tokenize, POS, and Lemma. This means that TokenizeWithIPA guarantees all attributes given by Tokenize, POS, and Lemma (that is, @form, @pos, and @lemma) are provided (in addition to annotations by TokenizeWithIPA itself).

Generic requirements (universal across languages)

Ssplit

Provide sentence boundaries.
Assigned tags:
<sentences> in <document>
<sentence id="*"> in <sentences>

Tokenize

Segment a sentence into tokens.
<tokens> are provided below <sentence>. Each <tokens> has several <token>.
Each <token> has the following attributes:
@id, @form, @characterOffsetBegin, and @characterOffsetEnd
Example:
<token id="s0_0" form="The" characterOffsetBegin="0" characterOffsetEnd="3">

POS

Assign part-of-speech tag on each token.
@pos is provided on <token>.

UPOS

Assign universal part-of-speech tag on each token.
@upos is provided on <token>.

Lemma

Assign lemmatization on each token.
@lemma is provided on <token>.

UDFeatures

Assign universal syntactic features (http://universaldependencies.org/u/feat/index.html) on each token.
@feats is provided on <token>.

Parse

Constituent parse for a sentence.
Add <parse> below <sentence>, which has several <span>s.
@root in <parse> points to the id of root <span>.
Each <span> corresponds to one rule, having the following attributes:
@symbol: the nonterminal symbol (e.g., NP) governing the span.
@children: the ids separated by space (e.g., sp2 sp3); each id points to another <span> or <token>. This means that <parse> has no preterminal annotations, which are managed by (@pos of) <token> instead.
Example:

<sentence id="s0" characterOffsetBegin="0" characterOffsetEnd="8">
  dogs run
  <tokens>
    <token pos="NNS" characterOffsetEnd="4" characterOffsetBegin="0" id="t0" form="dogs"/>
    <token pos="VBN" characterOffsetEnd="8" characterOffsetBegin="5" id="t1" form="run"/>
  </tokens>
  <parse root="s0_berksp0">
    <span id="s0_berksp0" symbol="S" children="s0_berksp1 s0_berksp2"/> <!--- these point to other spans --->
    <span id="s0_berksp1" symbol="NP" children="t0"/> <!--- this points to the first token --->
    <span id="s0_berksp2" symbol="VP" children="t1"/>
  </parse>
</sentence>

Dependencies

Dependency parse for a sentence.
Add <dependencies> below <sentence>, which has several <dependency>s.
Each <dependency> has the following attributes:
@head: @id of the head token in one dependency link. If @dependent in this link is a root token, special ROOT symbol is used (e.g., <dependency head="ROOT" dependent="t0" ...> means t0 is the root token of the dependency tree).
@dependent: @id of the dependent token in one dependency.
@deprel: label on the dependency link.

BasicDependencies

Parent = (Dependencies)
Add special <dependencies> that has @type="basic" attribute.
This is intended to replicate the Stanford CoreNLP's basic dependencies but has some problems. Specifically, this requirement says nothing about the annotation format of dependencies, which may be Stanford typed dependencies (SD) or Universal dependences (UD). We defer this issue as Stanford CoreNLP also does not distinguish the annotation format.

CollapsedDependencies

Parent = (Dependencies)
Add special <dependencies> that has @type="collapsed" attribute, which corresponds to the collapsed dependencies in the Stanford parser output.

CollapsedCCProcessedDependencies

Parent = (Dependencies)
Add special <dependencies> that has @type="collapsed-ccprocessed" attribute, which corresponds to the collapsed cc-processed dependencies in the Stanford parser output.

Chunk

Result of chunking or shallow parsing.
Add <chunks> below <sentence>, which has several <chunk>s.
Each <chunk> has @tokens attribute, pointing to the span of chunk (e.g., @tokens="t0 t1").

NER

Add <NEs> below <sentence>, which has several <NE>s.
Each NE is an named entity and has the following attributes:
@tokens: token ids
@label: ORGANIZATION, etc

Coreference

Add <coreferences>, a set of <coreference> chains on a document, each of which consists of several mentions on sentences.
Attributes on <coreference>:
@mentions: entity ids in some sentence (possibly multiple sentences).

PredArg

Add <predargs> below <sentence>, a set of <predarg>, which represents a predicate argument link between a predicate in the sentence and a predicted argument (possibly on the other sentences).
Attributes on <predarg>:
@pred: id of the predicate. (some basic phrase in the case of KNP)
@arg: id of the argument. (some coreference, a set of mentions, in the case of KNP)
@deprel: Relation label on this pred-arg link.

Language-specific requirements

Japanese

TokenizeWithIPA

Parent = (Tokenize, POS, Lemma)
Guarantee that morphological analysis is performed based on IPA dictionary.
In addition to the attributes by Tokenize, POS, Lemma, guarantee the following attributes on each token:
@pos1, @pos2, @pos3, @cType, @cForm, @yomi, @pron

TokenizeWithJumandic

Parent = (Tokenize, POS, Lemma)
Guarantee that morphological analysis is performed based on jumandic dictionary.
In addition to the attributes by Tokenize, POS, Lemma, guarantee the following attributes on each token:
@pos1, @cType, @cForm, @yomi, @misc

TokenizeWithUnidic

Parent = (Tokenize, POS, Lemma)
Guarantee that morphological analysis is performed based on unidic dictionary.
In addition to the attributes by Tokenize, POS, Lemma, guarantee the following attributes on each token:
@pos1, @pos2, @pos3, @cType, @cForm, @lForm, @orth, @pron, @orthBase, @pronBase, @goshu, @iType, @iForm, @fType, @fForm

Juman

Parent = (TokenizeWithJumandic)
Guarantee that morphological analysis is performed by juman (not other analyzers using jumandic, which corresponds to TokenizeWithJumandic)
This requirement is necessary since only the output of juman can be processed by KNP in a pipeline in some reasons.
In addition to the attributes by TokenizeWithJuman, guarantee the following attributes on each token:
@posId, @pos1Id, @cTypeId, @cFormId
<token> may have a child element <tokenAlt> (alternative analysis), which preserves ambiguities that juman does not resolve.

CabochaChunk

Parent = (Chunk)
In addition to attributes by Chunk, guarantee the following attributes: @head and @func.

KNPChunk

Parent = (Chunk)
In addition to attributes by Chunk, guarantee the attribute @misc.

ChunkDependencies

Dependencies between chunks.
Add <dependencies unit="chunk">

LabeledChunkDependencies

Add @deprel for each <dependency> in <dependencies unit="chunk">.

BasicPhrase

Output of KNP

BasicPhraseDependencies

Output of KNP
Dependencies between basic phrases.
Add <dependencies unit="basicPhrase">

KNPPredArg

@pred points to some basic phrase.
@arg points to some coreference, a set of mentions.
Additional attributes on <predarg>:
@flag: see CaseRelation
@text: see CaseRelation

CaseRelation

Add <caseRelations> below <sentence>, a set of <caseRelation>.
Attributes on <caseRelation>:
@pred: basic phrase
@arg: token (unk if unspecified)
@deprel: Relation label (e.g., ガ or オ)
@flag: C, N, O, etc (see http://nlp.ist.i.kyoto-u.ac.jp/index.php?KNP%2F%E6%A0%BC%E8%A7%A3%E6%9E%90%E7%B5%90%E6%9E%9C%E6%9B%B8%E5%BC%8F)
@text: 太郎, 不特定,- etc (text output in KNP)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Definition of Requirement

Generic requirements (universal across languages)

Ssplit

Tokenize

POS

UPOS

Lemma

UDFeatures

Parse

Dependencies

BasicDependencies

CollapsedDependencies

CollapsedCCProcessedDependencies

Chunk

NER

Coreference

PredArg

Language-specific requirements

Japanese

TokenizeWithIPA

TokenizeWithJumandic

TokenizeWithUnidic

Juman

CabochaChunk

KNPChunk

ChunkDependencies

LabeledChunkDependencies

BasicPhrase

BasicPhraseDependencies

KNPPredArg

CaseRelation

Clone this wiki locally