Skip to content

Definition of Requirement

Hiroshi Noji edited this page Feb 28, 2016 · 28 revisions

@attr means attr is an attribute rather than a tag.

Tag and attribute names, e.g., @form and @characterOffsetBegin are determined by the following principle:

  • Use the naming convention found in Universal Dependencies if available (e.g., @form, @lemma, and @deprel).
  • Otherwise, follow StanfordCoreNLP if available (e.g., @characterOffsetBegin).

Structure of Requirement:

  • There are two types of requirements: generic requirements and language-specific requirements.
  • Generic requirements provide the common interface across languages. Language-specific or tool-specific annotation can be handled by language-specific requirements.
  • Requirement has a type hierarchy, and has (possibly more than one) parents. For example, TokenizeWithIPA used in Japanese has three parents: Tokenize, POS, and Lemma. This means that TokenizeWithIPA guarantees all attributes given by Tokenize, POS, and Lemma (that is, @form, @pos, and lemma) are provided (in addition to annotations by TokenizeWithIPA itself).

Generic requirements (universal across languages)

Ssplit

  • Provide sentence boundaries.
  • Assigned tags:
  • <sentences> in <document>
  • <sentence id="*"> in <sentences>

Tokenize

  • Segment a sentence into tokens.
  • <tokens> are provided below <sentence>. Each <tokens> has several <token>.
  • Each <token> has the following attributes:
  • @id, @form, @characterOffsetBegin, and @characterOffsetEnd
  • Example:
  • <token id="s0_0" form="The" characterOffsetBegin="0" characterOffsetEnd="3">

POS

  • Assign part-of-speech tag on each token.
  • @pos is provided on <token>.

Lemma

  • Assign lemmatization on each token.
  • @lemma is provided on <token>.

Parse

  • Constituent parse for a sentence.
  • <parse> below <sentence> has several <spans>.
  • Each <span> corresponds to one rule, having the following attributes:
  • @type, @symbol, and @children

Dependencies

  • Dependency parse for a sentence.
  • <dependencies> below <sentence> has several <dependency>.
  • Each <dependency> has the following attributes:
  • @head, @dependent, and @deprel

Chunk

  • Result of chunking or shallow parsing.
  • <chunks> below <sentence> has several <chunk>.
  • Each <chunk> has @tokens attribute.

NER

  • Named entity recognition
  • This may be language-specific?

Language-specific requirements

Japanese

TokenizeWithIPA

  • Parent = (Tokenize, POS, Lemma)
  • Guarantee that morphological analysis is performed based on IPA dictionary.
  • In addition to the attributes by Tokenize, POS, Lemma, guarantee the following attributes on each token:
  • @pos1, @pos2, @pos3, @cType, @cForm, @yomi, @pron

TokenizeWithJumandic

  • Parent = (Tokenize, POS, Lemma)
  • Guarantee that morphological analysis is performed based on jumandic dictionary.
  • In addition to the attributes by Tokenize, POS, Lemma, guarantee the following attributes on each token:
  • @pos1, @cType, @cForm, @yomi, @misc

TokenizeWithUnidic

  • Parent = (Tokenize, POS, Lemma)
  • Guarantee that morphological analysis is performed based on unidic dictionary.
  • In addition to the attributes by Tokenize, POS, Lemma, guarantee the following attributes on each token:
  • @pos1, @pos2, @pos3, @cType, @cForm, @lForm, @orth, @pron, @orthBase, @pronBase, @goshu, @iType, @iForm, @fType, @fForm

Juman

  • Parent = (TokenizeWithJumandic)
  • Guarantee that morphological analysis is performed by juman (not other analyzers using jumandic, which corresponds to TokenizeWithJumandic)
  • This requirement is necessary since only the output of juman can be processed by KNP in a pipeline in some reasons.
  • In addition to the attributes by TokenizeWithJuman, guarantee the following attributes on each token:
  • @posId, @pos1Id, @cTypeId, @cFormId
  • <token> may have a child element <tokenAlt> (alternative analysis), which preserves ambiguities that juman does not resolve.

CabochaChunk

  • Parent = (Chunk)
  • In addition to attributes by Chunk, guarantee the following attributes: @head and @func.

KNPChunk

  • Parent = (Chunk)
  • In addition to attributes by Chunk, guarantee the attribute @misc.

ChunkDependencies

  • Dependencies between chunks.

BasicPhrase

  • Output of KNP

BasicPhraseDependencies

  • Output of KNP

Coreference

  • Add <coreferences>, a set of <coreference> chains on a document, each of which consists of several mentions on sentences.
  • Attributes on <coreference>:
  • @mentions: entity ids in some sentence (possibly multiple sentences).

PredArg

  • Add <predargs> below <sentence>, a set of <predarg>, which of which represents predicate argument structure given a predicate in the sentence and a predicted argument (possibly on the other sentences).
  • Attributes on <predarg>:
  • @pred: id of the predicate. (some basic phrase in the case of KNP)
  • @arg: id of the argument. (some coreference, a set of mentions, in the case of KNP)
  • @deprel: Relation label on this pred-arg link.
  • @flag: is this general?
Clone this wiki locally