Definition of Requirement

Jump to bottom Edit New page

Hiroshi Noji edited this page Feb 28, 2016 · 28 revisions

@attr means attr is an attribute rather than a tag.

Tag and attribute names, e.g., @form and @characterOffsetBegin are determined by the following principle:

Use the naming convention found in Universal Dependencies if available (e.g., @form, @lemma, and @deprel).
Otherwise, follow StanfordCoreNLP if available (e.g., @characterOffsetBegin).

Structure of Requirement:

There are two types of requirements: generic requirements and language-specific requirements.
Generic requirements provide the common interface across languages. Language-specific or tool-specific annotation can be handled by language-specific requirements.
Requirement has a type hierarchy, and has (possibly more than one) parents. For example, TokenizeWithIPA used in Japanese has three parents: Tokenize, POS, and Lemma. This means that TokenizeWithIPA guarantees all attributes given by Tokenize, POS, and Lemma (that is, @form, @pos, and lemma) are provided (in addition to annotations by TokenizeWithIPA itself).

Generic requirements (universal across languages)

Ssplit

Provide sentence boundaries.
Assigned tags:
<sentences> in <document>
<sentence id="*"> in <sentences>

Tokenize

Segment a sentence into tokens.
<tokens> are provided below <sentence>. Each <tokens> has several <token>.
Each <token> has the following attributes:
@id, @form, @characterOffsetBegin, and @characterOffsetEnd
Example:
<token id="s0_0" form="The" characterOffsetBegin="0" characterOffsetEnd="3">

POS

Assign part-of-speech tag on each token.
@pos is provided on <token>.

Lemma

Assign lemmatization on each token.
@lemma is provided on <token>.

Parse

Constituent parse for a sentence.
<parse> below <sentence> has several <spans>.
Each <span> corresponds to one rule, having the following attributes:
@type, @symbol, and @children

Dependencies

Dependency parse for a sentence.
<dependencies> below <sentence> has several <dependency>.
Each <dependency> has the following attributes:
@head, @dependent, and @deprel

Chunk

Result of chunking or shallow parsing.
<chunks> below <sentence> has several <chunk>.
Each <chunk> has @tokens attribute.

NER

Named entity recognition
This may be language-specific?

Language-specific requirements

Japanese

TokenizeWithIPA

Parent = (Tokenize, POS, Lemma)
Guarantee that morphological analysis is performed based on IPA dictionary.
In addition to the attributes by Tokenize, POS, Lemma, guarantee the following attributes on each token:
@pos1, @pos2, @pos3, @cType, @cForm, @yomi, @pron

TokenizeWithJumandic

Parent = (Tokenize, POS, Lemma)
Guarantee that morphological analysis is performed based on jumandic dictionary.
In addition to the attributes by Tokenize, POS, Lemma, guarantee the following attributes on each token:
@pos1, @cType, @cForm, @yomi, @misc

TokenizeWithUnidic

Parent = (Tokenize, POS, Lemma)
Guarantee that morphological analysis is performed based on unidic dictionary.
In addition to the attributes by Tokenize, POS, Lemma, guarantee the following attributes on each token:
@pos1, @pos2, @pos3, @cType, @cForm, @lForm, @orth, @pron, @orthBase, @pronBase, @goshu, @iType, @iForm, @fType, @fForm

Juman

Parent = (TokenizeWithJumandic)
Guarantee that morphological analysis is performed by juman (not other analyzers using jumandic, which corresponds to TokenizeWithJumandic)
This requirement is necessary since only the output of juman can be processed by KNP in a pipeline in some reasons.
In addition to the attributes by TokenizeWithJuman, guarantee the following attributes on each token:
@posId, @pos1Id, @cTypeId, @cFormId
<token> may have a child element <tokenAlt> (alternative analysis), which preserves ambiguities that juman does not resolve.

CabochaChunk

Parent = (Chunk)
In addition to attributes by Chunk, guarantee the following attributes: @head and @func.

KNPChunk

Parent = (Chunk)
In addition to attributes by Chunk, guarantee the attribute @misc.

ChunkDependencies

Dependencies between chunks.

BasicPhrase

Output of KNP

BasicPhraseDependencies

Output of KNP

Coreference

Add <coreferences>, a set of <coreference> chains on a document, each of which consists of several mentions on sentences.
Attributes on <coreference>:
@mentions: entity ids in some sentence (possibly multiple sentences).

PredArg

Add <predargs> below <sentence>, a set of <predarg>, which of which represents predicate argument structure given a predicate in the sentence and a predicted argument (possibly on the other sentences).
Attributes on <predarg>:
@pred: id of the predicate. (some basic phrase in the case of KNP)
@arg: id of the argument. (some coreference, a set of mentions, in the case of KNP)
@deprel: Relation label on this pred-arg link.
@flag: is this general?