-
Notifications
You must be signed in to change notification settings - Fork 20
Definition of Requirement
Hiroshi Noji edited this page Feb 23, 2016
·
28 revisions
@attr means attr is an attribute rather than a tag.
Tag and attribute names, e.g., @form
and @characterOffsetBegin
are determined by the following principle:
- Use the naming convention found in Universal Dependencies if available (e.g.,
@form
,@lemma
, and@deprel
). - Otherwise, follow StanfordCoreNLP if available (e.g.,
@characterOffsetBegin
).
Structure of Requirement:
- There are two types of requirements: generic requirements and language-specific requirements.
- Generic requirements provide the common interface across languages. Language-specific or tool-specific annotation can be handled by language-specific requirements.
- Requirement has a type hierarchy, and has (possibly more than one) parents. For example, TokenizeWithIPA used in Japanese has three parents: Tokenize, POS, and Lemma. This means that TokenizeWithIPA guarantees all attributes given by Tokenize, POS, and Lemma (that is,
@form
,@postag
, andlemma
) are provided (in addition to annotations by TokenizeWithIPA itself).
- Provide sentence boundaries.
- Assigned tags:
-
<sentences>
in<document>
-
<sentence id="*">
in<sentences>
- Segment a sentence into tokens.
-
<tokens>
are provided on<sentence>
. Each<tokens>
has several<token>
. - Each
<token>
has the following attributes: -
@id
,@form
,@characterOffsetBegin
, and@characterOffsetEnd
- Example:
<token id="s0_0" form="The" characterOffsetBegin="0" characterOffsetEnd="3">
- Assign part-of-speech on each token.
-
@postag
is provided on a<token>
.
- Assign lemmatization on each token.
-
@lemma
is provided on a<token>
.
- Constituent parse for a sentence.
- Dependency parse for a sentence.
-
<dependencies>
on<sentence>
has several<dependency>
. - Each
<dependency>
has the following attributes: -
@head
,@dependent
, and@deprel
- Named entity recognition
- This may be language-specific?
- Parent = (Tokenize, POS, Lemma)
- Guarantee that morphological analysis is performed based on IPA dictionary.
- In addition to the attributes by Tokenize, POS, Lemma, guarantee the following attributes on each token
-
@postag1
,@postag2
,@postag3
,@conjType
,@conjForm
,@yomi
,@pron
- Parent = (Tokenize, POS, Lemma)
- Guarantee that morphological analysis is performed based on jumandic dictionary.
- A child Requirement of Tokenize, which guarantees the following attributes are given on each token.
-
@postag1
,@conjType
,@conjForm
,@yomi
,@misc
- Parent = (Tokenize, POS, Lemma)
- Guarantee that morphological analysis is performed based on unidic dictionary.
- A child Requirement of Tokenize, which guarantees the following attributes are given on each token.
-
@postag1
,@postag2
,@postag3
,@conjType
,@conjForm
,@lForm
,@orth
,@pron
,@orthBase
,@pronBase
,@goshu
,@iType
,@iForm
,@fType
,@fForm
- Parent = (Tokenize, POS, Lemma)
- Guarantee that morphological analysis is performed by juman (not other analyzers using jumandic, which corresponds to TokenizeWithJumandic)
- This requirement is necessary since only the output of juman can be processed by KNP in a pipeline in some reasons.
- A child Requirement of Tokenize, which guarantees the following attributes are given on each token.
-
@postag1
,conjType
,conjForm
,@postagId
,@postag1Id
,@conjTypeId
,@conjFormId
,@reading
,@misc
-
<token>
may have a child element<tokenAlt>
, which preserves the ambiguities that juman does not resolve.