-
Notifications
You must be signed in to change notification settings - Fork 20
Definition of Requirement
Hiroshi Noji edited this page Feb 28, 2016
·
28 revisions
@attr means attr is an attribute rather than a tag.
Tag and attribute names, e.g., @form
and @characterOffsetBegin
are determined by the following principle:
- Use the naming convention found in Universal Dependencies if available (e.g.,
@form
,@lemma
, and@deprel
). - Otherwise, follow StanfordCoreNLP if available (e.g.,
@characterOffsetBegin
).
Structure of Requirement:
- There are two types of requirements: generic requirements and language-specific requirements.
- Generic requirements provide the common interface across languages. Language-specific or tool-specific annotation can be handled by language-specific requirements.
- Requirement has a type hierarchy, and has (possibly more than one) parents. For example, TokenizeWithIPA used in Japanese has three parents: Tokenize, POS, and Lemma. This means that TokenizeWithIPA guarantees all attributes given by Tokenize, POS, and Lemma (that is,
@form
,@pos
, andlemma
) are provided (in addition to annotations by TokenizeWithIPA itself).
- Provide sentence boundaries.
- Assigned tags:
-
<sentences>
in<document>
-
<sentence id="*">
in<sentences>
- Segment a sentence into tokens.
-
<tokens>
are provided below<sentence>
. Each<tokens>
has several<token>
. - Each
<token>
has the following attributes: -
@id
,@form
,@characterOffsetBegin
, and@characterOffsetEnd
- Example:
<token id="s0_0" form="The" characterOffsetBegin="0" characterOffsetEnd="3">
- Assign part-of-speech tag on each token.
-
@pos
is provided on<token>
.
- Assign lemmatization on each token.
-
@lemma
is provided on<token>
.
- Constituent parse for a sentence.
-
<parse>
below<sentence>
has several<spans>
. - Each
<span>
corresponds to one rule, having the following attributes: -
@type
,@symbol
, and@children
- Dependency parse for a sentence.
-
<dependencies>
below<sentence>
has several<dependency>
. - Each
<dependency>
has the following attributes: -
@head
,@dependent
, and@deprel
- Result of chunking or shallow parsing.
-
<chunks>
below<sentence>
has several<chunk>
. - Each
<chunk>
has@tokens
attribute.
- Named entity recognition
- This may be language-specific?
- Parent = (Tokenize, POS, Lemma)
- Guarantee that morphological analysis is performed based on IPA dictionary.
- In addition to the attributes by Tokenize, POS, Lemma, guarantee the following attributes on each token:
-
@pos1
,@pos2
,@pos3
,@cType
,@cForm
,@yomi
,@pron
- Parent = (Tokenize, POS, Lemma)
- Guarantee that morphological analysis is performed based on jumandic dictionary.
- In addition to the attributes by Tokenize, POS, Lemma, guarantee the following attributes on each token:
-
@pos1
,@cType
,@cForm
,@yomi
,@misc
- Parent = (Tokenize, POS, Lemma)
- Guarantee that morphological analysis is performed based on unidic dictionary.
- In addition to the attributes by Tokenize, POS, Lemma, guarantee the following attributes on each token:
-
@pos1
,@pos2
,@pos3
,@cType
,@cForm
,@lForm
,@orth
,@pron
,@orthBase
,@pronBase
,@goshu
,@iType
,@iForm
,@fType
,@fForm
- Parent = (TokenizeWithJumandic)
- Guarantee that morphological analysis is performed by juman (not other analyzers using jumandic, which corresponds to TokenizeWithJumandic)
- This requirement is necessary since only the output of juman can be processed by KNP in a pipeline in some reasons.
- In addition to the attributes by TokenizeWithJuman, guarantee the following attributes on each token:
-
@posId
,@pos1Id
,@cTypeId
,@cFormId
-
<token>
may have a child element<tokenAlt>
(alternative analysis), which preserves ambiguities that juman does not resolve.
- Parent = (Chunk)
- In addition to attributes by Chunk, guarantee the following attributes:
@head
and@func
.
- Parent = (Chunk)
- In addition to attributes by Chunk, guarantee the attribute
@misc
.
- Dependencies between chunks.
- Output of KNP
- Output of KNP
- Add
<coreferences>
, a set of<coreference>
chains on a document, each of which consists of several mentions on sentences. - Attributes on
<coreference>
: -
@mentions
: entity ids in some sentence (possibly multiple sentences).
- Add
<predargs>
below<sentence>
, a set of<predarg>
, which of which represents predicate argument structure given a predicate in the sentence and a predicted argument (possibly on the other sentences). - Attributes on
<predarg>
: -
@pred
: id of the predicate. (some basic phrase in the case of KNP) -
@arg
: id of the argument. (some coreference, a set of mentions, in the case of KNP) -
@deprel
: Relation label on this pred-arg link. -
@flag
: is this general?