Skip to content

Definition of Requirement

Hiroshi Noji edited this page Feb 23, 2016 · 28 revisions

@attr means attr is an attribute rather than a tag.

Tag and attribute names, e.g., @form and @characterOffsetBegin are determined by the following principle:

  • Use the naming convention found in Universal Dependencies if available (e.g., @form, @lemma, and @deprel).
  • Otherwise, follow StanfordCoreNLP if available (e.g., @characterOffsetBegin).

Structure of Requirement:

  • There are two types of requirements: generic requirements and language-specific requirements.
  • Generic requirements provide the common interface across languages. Language-specific or tool-specific annotation can be handled by language-specific requirements.
  • Requirement has a type hierarchy, and has (possibly more than one) parents. For example, TokenizeWithIPA used in Japanese has three parents: Tokenize, POS, and Lemma. This means that TokenizeWithIPA guarantees all attributes given by Tokenize, POS, and Lemma (that is, @form, @postag, and lemma) are provided (in addition to annotations by TokenizeWithIPA itself).

Generic requirements (universal across languages)

Sentence

  • Provide sentence boundaries.
  • Assigned tags:
  • <sentences> in <document>
  • <sentence id="*"> in <sentences>

Tokenize

  • Segment a sentence into tokens.
  • <tokens> are provided on <sentence>. Each <tokens> has several <token>.
  • Each <token> has the following attributes:
  • @id, @form, @characterOffsetBegin, and @characterOffsetEnd
  • Example:
  • <token id="s0_0" form="The" characterOffsetBegin="0" characterOffsetEnd="3">

POS

  • Assign part-of-speech on each token.
  • @postag is provided on a <token>.

Lemma

  • Assign lemmatization on each token.
  • @lemma is provided on a <token>.

Parse

  • Constituent parse for a sentence.

Dependencies

  • Dependency parse for a sentence.
  • <dependencies> on <sentence> has several <dependency>.
  • Each <dependency> has the following attributes:
  • @head, @dependent, and @deprel

NER

  • Named entity recognition
  • This may be language-specific?

Language-specific requirements

Japanese

TokenizeWithIPA

  • Parent = (Tokenize, POS, Lemma)
  • Guarantee that morphological analysis is performed based on IPA dictionary.
  • In addition to the attributes by Tokenize, POS, Lemma, guarantee the following attributes on each token
  • @postag1, @postag2, @postag3, @conjType, @conjForm, @yomi, @pron

TokenizeWithJumandic

  • Parent = (Tokenize, POS, Lemma)
  • Guarantee that morphological analysis is performed based on jumandic dictionary.
  • A child Requirement of Tokenize, which guarantees the following attributes are given on each token.
  • @postag1, @conjType, @conjForm, @yomi, @misc

TokenizeWithUnidic

  • Parent = (Tokenize, POS, Lemma)
  • Guarantee that morphological analysis is performed based on unidic dictionary.
  • A child Requirement of Tokenize, which guarantees the following attributes are given on each token.
  • @postag1, @postag2, @postag3, @conjType, @conjForm, @lForm, @orth, @pron, @orthBase, @pronBase, @goshu, @iType, @iForm, @fType, @fForm

TokenizeByJuman

  • Parent = (Tokenize, POS, Lemma)
  • Guarantee that morphological analysis is performed by juman (not other analyzers using jumandic, which corresponds to TokenizeWithJumandic)
  • This requirement is necessary since only the output of juman can be processed by KNP in a pipeline in some reasons.
  • A child Requirement of Tokenize, which guarantees the following attributes are given on each token.
  • @postag1, conjType, conjForm, @postagId, @postag1Id, @conjTypeId, @conjFormId, @reading, @misc
  • <token> may have a child element <tokenAlt>, which preserves the ambiguities that juman does not resolve.
Clone this wiki locally