Definition of Requirement

Jump to bottom Edit New page

Hiroshi Noji edited this page Feb 23, 2016 · 28 revisions

@attr means attr is an attribute rather than a tag.

Tag and attribute names, e.g., @form and @characterOffsetBegin are determined by the following principle:

Use the naming convention found in Universal Dependencies if available (e.g., @form, @lemma, and @deprel).
Otherwise, follow StanfordCoreNLP if available (e.g., @characterOffsetBegin).

Structure of Requirement:

There are two types of requirements: generic requirements and language-specific requirements.
Generic requirements provide the common interface across languages. Language-specific or tool-specific annotation can be handled by language-specific requirements.
Requirement has a type hierarchy, and has (possibly more than one) parents. For example, TokenizeWithIPA used in Japanese has three parents: Tokenize, POS, and Lemma. This means that TokenizeWithIPA guarantees all attributes given by Tokenize, POS, and Lemma (that is, @form, @postag, and lemma) are provided (in addition to annotations by TokenizeWithIPA itself).

Generic requirements (universal across languages)

Sentence

Provide sentence boundaries.
Assigned tags:
<sentences> in <document>
<sentence id="*"> in <sentences>

Tokenize

Segment a sentence into tokens.
<tokens> are provided on <sentence>. Each <tokens> has several <token>.
Each <token> has the following attributes:
@id, @form, @characterOffsetBegin, and @characterOffsetEnd
Example:
<token id="s0_0" form="The" characterOffsetBegin="0" characterOffsetEnd="3">

POS

Assign part-of-speech on each token.
@postag is provided on a <token>.

Lemma

Assign lemmatization on each token.
@lemma is provided on a <token>.

Parse

Constituent parse for a sentence.

Dependencies

Dependency parse for a sentence.
<dependencies> on <sentence> has several <dependency>.
Each <dependency> has the following attributes:
@head, @dependent, and @deprel

NER

Named entity recognition
This may be language-specific?

Language-specific requirements

Japanese

TokenizeWithIPA

Parent = (Tokenize, POS, Lemma)
Guarantee that morphological analysis is performed based on IPA dictionary.
In addition to the attributes by Tokenize, POS, Lemma, guarantee the following attributes on each token
@postag1, @postag2, @postag3, @conjType, @conjForm, @yomi, @pron

TokenizeWithJumandic

Parent = (Tokenize, POS, Lemma)
Guarantee that morphological analysis is performed based on jumandic dictionary.
A child Requirement of Tokenize, which guarantees the following attributes are given on each token.
@postag1, @conjType, @conjForm, @yomi, @misc

TokenizeWithUnidic

Parent = (Tokenize, POS, Lemma)
Guarantee that morphological analysis is performed based on unidic dictionary.
A child Requirement of Tokenize, which guarantees the following attributes are given on each token.
@postag1, @postag2, @postag3, @conjType, @conjForm, @lForm, @orth, @pron, @orthBase, @pronBase, @goshu, @iType, @iForm, @fType, @fForm

TokenizeByJuman

Parent = (Tokenize, POS, Lemma)
Guarantee that morphological analysis is performed by juman (not other analyzers using jumandic, which corresponds to TokenizeWithJumandic)
This requirement is necessary since only the output of juman can be processed by KNP in a pipeline in some reasons.
A child Requirement of Tokenize, which guarantees the following attributes are given on each token.
@postag1, conjType, conjForm, @postagId, @postag1Id, @conjTypeId, @conjFormId, @reading, @misc
<token> may have a child element <tokenAlt>, which preserves the ambiguities that juman does not resolve.