Skip to content

Definition of Requirement

Hiroshi Noji edited this page Feb 8, 2018 · 28 revisions

@attr means attr is an attribute rather than a tag.

Tag and attribute names, e.g., @form and @characterOffsetBegin are determined by the following principle:

  • Use the naming convention found in Universal Dependencies if available (e.g., @form, @lemma, and @deprel).
  • Otherwise, follow StanfordCoreNLP if available (e.g., @characterOffsetBegin).

Structure of Requirement:

  • There are two types of requirements: generic requirements and language-specific requirements.
  • Generic requirements provide the common interface across languages. Language-specific or tool-specific annotation can be handled by language-specific requirements.
  • Requirement has a type hierarchy, and has (possibly more than one) parents. For example, TokenizeWithIPA used in Japanese has three parents: Tokenize, POS, and Lemma. This means that TokenizeWithIPA guarantees all attributes given by Tokenize, POS, and Lemma (that is, @form, @pos, and @lemma) are provided (in addition to annotations by TokenizeWithIPA itself).

Generic requirements (universal across languages)

Ssplit

  • Provide sentence boundaries.
  • Assigned tags:
  • <sentences> in <document>
  • <sentence id="*"> in <sentences>

Tokenize

  • Segment a sentence into tokens.
  • <tokens> are provided below <sentence>. Each <tokens> has several <token>.
  • Each <token> has the following attributes:
  • @id, @form, @characterOffsetBegin, and @characterOffsetEnd
  • Example:
  • <token id="s0_0" form="The" characterOffsetBegin="0" characterOffsetEnd="3">

POS

  • Assign part-of-speech tag on each token.
  • @pos is provided on <token>.

UPOS

  • Assign universal part-of-speech tag on each token.
  • @upos is provided on <token>.

Lemma

  • Assign lemmatization on each token.
  • @lemma is provided on <token>.

UDFeatures

Parse

  • Constituent parse for a sentence.
  • Add <parse> below <sentence>, which has several <span>s.
  • @root in <parse> points to the id of root <span>.
  • Each <span> corresponds to one rule, having the following attributes:
  • @symbol: the nonterminal symbol (e.g., NP) governing the span.
  • @children: the ids separated by space (e.g., sp2 sp3); each id points to another <span> or <token>. This means that <parse> has no preterminal annotations, which are managed by (@pos of) <token> instead.
  • Example:
<sentence id="s0" characterOffsetBegin="0" characterOffsetEnd="8">
  dogs run
  <tokens>
    <token pos="NNS" characterOffsetEnd="4" characterOffsetBegin="0" id="t0" form="dogs"/>
    <token pos="VBN" characterOffsetEnd="8" characterOffsetBegin="5" id="t1" form="run"/>
  </tokens>
  <parse root="s0_berksp0">
    <span id="s0_berksp0" symbol="S" children="s0_berksp1 s0_berksp2"/> <!--- these point to other spans --->
    <span id="s0_berksp1" symbol="NP" children="t0"/> <!--- this points to the first token --->
    <span id="s0_berksp2" symbol="VP" children="t1"/>
  </parse>
</sentence>

Dependencies

  • Dependency parse for a sentence.
  • Add <dependencies> below <sentence>, which has several <dependency>s.
  • Each <dependency> has the following attributes:
  • @head: @id of the head token in one dependency link. If @dependent in this link is a root token, special ROOT symbol is used (e.g., <dependency head="ROOT" dependent="t0" ...> means t0 is the root token of the dependency tree).
  • @dependent: @id of the dependent token in one dependency.
  • @deprel: label on the dependency link.

BasicDependencies

  • Parent = (Dependencies)
  • Add special <dependencies> that has @type="basic" attribute.
  • This is intended to replicate the Stanford CoreNLP's basic dependencies but has some problems. Specifically, this requirement says nothing about the annotation format of dependencies, which may be Stanford typed dependencies (SD) or Universal dependences (UD). We defer this issue as Stanford CoreNLP also does not distinguish the annotation format.

CollapsedDependencies

  • Parent = (Dependencies)
  • Add special <dependencies> that has @type="collapsed" attribute, which corresponds to the collapsed dependencies in the Stanford parser output.

CollapsedCCProcessedDependencies

  • Parent = (Dependencies)
  • Add special <dependencies> that has @type="collapsed-ccprocessed" attribute, which corresponds to the collapsed cc-processed dependencies in the Stanford parser output.

Chunk

  • Result of chunking or shallow parsing.
  • Add <chunks> below <sentence>, which has several <chunk>s.
  • Each <chunk> has @tokens attribute, pointing to the span of chunk (e.g., @tokens="t0 t1").

NER

  • Add <NEs> below <sentence>, which has several <NE>s.
  • Each NE is an named entity and has the following attributes:
  • @tokens: token ids
  • @label: ORGANIZATION, etc

Coreference

  • Add <coreferences>, a set of <coreference> chains on a document, each of which consists of several mentions on sentences.
  • Attributes on <coreference>:
  • @mentions: entity ids in some sentence (possibly multiple sentences).

PredArg

  • Add <predargs> below <sentence>, a set of <predarg>, which represents a predicate argument link between a predicate in the sentence and a predicted argument (possibly on the other sentences).
  • Attributes on <predarg>:
  • @pred: id of the predicate. (some basic phrase in the case of KNP)
  • @arg: id of the argument. (some coreference, a set of mentions, in the case of KNP)
  • @deprel: Relation label on this pred-arg link.

Language-specific requirements

Japanese

TokenizeWithIPA

  • Parent = (Tokenize, POS, Lemma)
  • Guarantee that morphological analysis is performed based on IPA dictionary.
  • In addition to the attributes by Tokenize, POS, Lemma, guarantee the following attributes on each token:
  • @pos1, @pos2, @pos3, @cType, @cForm, @yomi, @pron

TokenizeWithJumandic

  • Parent = (Tokenize, POS, Lemma)
  • Guarantee that morphological analysis is performed based on jumandic dictionary.
  • In addition to the attributes by Tokenize, POS, Lemma, guarantee the following attributes on each token:
  • @pos1, @cType, @cForm, @yomi, @misc

TokenizeWithUnidic

  • Parent = (Tokenize, POS, Lemma)
  • Guarantee that morphological analysis is performed based on unidic dictionary.
  • In addition to the attributes by Tokenize, POS, Lemma, guarantee the following attributes on each token:
  • @pos1, @pos2, @pos3, @cType, @cForm, @lForm, @orth, @pron, @orthBase, @pronBase, @goshu, @iType, @iForm, @fType, @fForm

Juman

  • Parent = (TokenizeWithJumandic)
  • Guarantee that morphological analysis is performed by juman (not other analyzers using jumandic, which corresponds to TokenizeWithJumandic)
  • This requirement is necessary since only the output of juman can be processed by KNP in a pipeline in some reasons.
  • In addition to the attributes by TokenizeWithJuman, guarantee the following attributes on each token:
  • @posId, @pos1Id, @cTypeId, @cFormId
  • <token> may have a child element <tokenAlt> (alternative analysis), which preserves ambiguities that juman does not resolve.

CabochaChunk

  • Parent = (Chunk)
  • In addition to attributes by Chunk, guarantee the following attributes: @head and @func.

KNPChunk

  • Parent = (Chunk)
  • In addition to attributes by Chunk, guarantee the attribute @misc.

ChunkDependencies

  • Dependencies between chunks.
  • Add <dependencies unit="chunk">

LabeledChunkDependencies

  • Add @deprel for each <dependency> in <dependencies unit="chunk">.

BasicPhrase

  • Output of KNP

BasicPhraseDependencies

  • Output of KNP
  • Dependencies between basic phrases.
  • Add <dependencies unit="basicPhrase">

KNPPredArg

  • @pred points to some basic phrase.
  • @arg points to some coreference, a set of mentions.
  • Additional attributes on <predarg>:
  • @flag: see CaseRelation
  • @text: see CaseRelation

CaseRelation