-
Notifications
You must be signed in to change notification settings - Fork 20
Definition of Requirement
Hiroshi Noji edited this page Feb 8, 2018
·
28 revisions
@attr means attr is an attribute rather than a tag.
Tag and attribute names, e.g., @form
and @characterOffsetBegin
are determined by the following principle:
- Use the naming convention found in Universal Dependencies if available (e.g.,
@form
,@lemma
, and@deprel
). - Otherwise, follow StanfordCoreNLP if available (e.g.,
@characterOffsetBegin
).
Structure of Requirement:
- There are two types of requirements: generic requirements and language-specific requirements.
- Generic requirements provide the common interface across languages. Language-specific or tool-specific annotation can be handled by language-specific requirements.
- Requirement has a type hierarchy, and has (possibly more than one) parents. For example, TokenizeWithIPA used in Japanese has three parents: Tokenize, POS, and Lemma. This means that TokenizeWithIPA guarantees all attributes given by Tokenize, POS, and Lemma (that is,
@form
,@pos
, and@lemma
) are provided (in addition to annotations by TokenizeWithIPA itself).
- Provide sentence boundaries.
- Assigned tags:
-
<sentences>
in<document>
-
<sentence id="*">
in<sentences>
- Segment a sentence into tokens.
-
<tokens>
are provided below<sentence>
. Each<tokens>
has several<token>
. - Each
<token>
has the following attributes: -
@id
,@form
,@characterOffsetBegin
, and@characterOffsetEnd
- Example:
<token id="s0_0" form="The" characterOffsetBegin="0" characterOffsetEnd="3">
- Assign part-of-speech tag on each token.
-
@pos
is provided on<token>
.
- Assign universal part-of-speech tag on each token.
-
@upos
is provided on<token>
.
- Assign lemmatization on each token.
-
@lemma
is provided on<token>
.
- Assign universal syntactic features (http://universaldependencies.org/u/feat/index.html) on each token.
-
@feats
is provided on<token>
.
- Constituent parse for a sentence.
- Add
<parse>
below<sentence>
, which has several<span>
s. -
@root
in<parse>
points to the id of root<span>
. - Each
<span>
corresponds to one rule, having the following attributes: -
@symbol
: the nonterminal symbol (e.g., NP) governing the span. -
@children
: the ids separated by space (e.g.,sp2 sp3
); each id points to another<span>
or<token>
. This means that<parse>
has no preterminal annotations, which are managed by (@pos
of)<token>
instead. - Example:
<sentence id="s0" characterOffsetBegin="0" characterOffsetEnd="8">
dogs run
<tokens>
<token pos="NNS" characterOffsetEnd="4" characterOffsetBegin="0" id="t0" form="dogs"/>
<token pos="VBN" characterOffsetEnd="8" characterOffsetBegin="5" id="t1" form="run"/>
</tokens>
<parse root="s0_berksp0">
<span id="s0_berksp0" symbol="S" children="s0_berksp1 s0_berksp2"/> <!--- these point to other spans --->
<span id="s0_berksp1" symbol="NP" children="t0"/> <!--- this points to the first token --->
<span id="s0_berksp2" symbol="VP" children="t1"/>
</parse>
</sentence>
- Dependency parse for a sentence.
- Add
<dependencies>
below<sentence>
, which has several<dependency>
s. - Each
<dependency>
has the following attributes: -
@head
:@id
of the head token in one dependency link. If@dependent
in this link is a root token, specialROOT
symbol is used (e.g.,<dependency head="ROOT" dependent="t0" ...>
meanst0
is the root token of the dependency tree). -
@dependent
:@id
of the dependent token in one dependency. -
@deprel
: label on the dependency link.
- Parent = (Dependencies)
- Add special
<dependencies>
that has@type="basic"
attribute. - This is intended to replicate the Stanford CoreNLP's basic dependencies but has some problems. Specifically, this requirement says nothing about the annotation format of dependencies, which may be Stanford typed dependencies (SD) or Universal dependences (UD). We defer this issue as Stanford CoreNLP also does not distinguish the annotation format.
- Parent = (Dependencies)
- Add special
<dependencies>
that has@type="collapsed"
attribute, which corresponds to the collapsed dependencies in the Stanford parser output.
- Parent = (Dependencies)
- Add special
<dependencies>
that has@type="collapsed-ccprocessed"
attribute, which corresponds to the collapsed cc-processed dependencies in the Stanford parser output.
- Result of chunking or shallow parsing.
- Add
<chunks>
below<sentence>
, which has several<chunk>
s. - Each
<chunk>
has@tokens
attribute, pointing to the span of chunk (e.g.,@tokens="t0 t1"
).
- Add
<NEs>
below<sentence>
, which has several<NE>
s. - Each
NE
is an named entity and has the following attributes: -
@tokens
: token ids -
@label
: ORGANIZATION, etc
- Add
<coreferences>
, a set of<coreference>
chains on a document, each of which consists of several mentions on sentences. - Attributes on
<coreference>
: -
@mentions
: entity ids in some sentence (possibly multiple sentences).
- Add
<predargs>
below<sentence>
, a set of<predarg>
, which represents a predicate argument link between a predicate in the sentence and a predicted argument (possibly on the other sentences). - Attributes on
<predarg>
: -
@pred
: id of the predicate. (some basic phrase in the case of KNP) -
@arg
: id of the argument. (some coreference, a set of mentions, in the case of KNP) -
@deprel
: Relation label on this pred-arg link.
- Parent = (Tokenize, POS, Lemma)
- Guarantee that morphological analysis is performed based on IPA dictionary.
- In addition to the attributes by Tokenize, POS, Lemma, guarantee the following attributes on each token:
-
@pos1
,@pos2
,@pos3
,@cType
,@cForm
,@yomi
,@pron
- Parent = (Tokenize, POS, Lemma)
- Guarantee that morphological analysis is performed based on jumandic dictionary.
- In addition to the attributes by Tokenize, POS, Lemma, guarantee the following attributes on each token:
-
@pos1
,@cType
,@cForm
,@yomi
,@misc
- Parent = (Tokenize, POS, Lemma)
- Guarantee that morphological analysis is performed based on unidic dictionary.
- In addition to the attributes by Tokenize, POS, Lemma, guarantee the following attributes on each token:
-
@pos1
,@pos2
,@pos3
,@cType
,@cForm
,@lForm
,@orth
,@pron
,@orthBase
,@pronBase
,@goshu
,@iType
,@iForm
,@fType
,@fForm
- Parent = (TokenizeWithJumandic)
- Guarantee that morphological analysis is performed by juman (not other analyzers using jumandic, which corresponds to TokenizeWithJumandic)
- This requirement is necessary since only the output of juman can be processed by KNP in a pipeline in some reasons.
- In addition to the attributes by TokenizeWithJuman, guarantee the following attributes on each token:
-
@posId
,@pos1Id
,@cTypeId
,@cFormId
-
<token>
may have a child element<tokenAlt>
(alternative analysis), which preserves ambiguities that juman does not resolve.
- Parent = (Chunk)
- In addition to attributes by Chunk, guarantee the following attributes:
@head
and@func
.
- Parent = (Chunk)
- In addition to attributes by Chunk, guarantee the attribute
@misc
.
- Dependencies between chunks.
- Add
<dependencies unit="chunk">
- Add
@deprel
for each<dependency>
in<dependencies unit="chunk">
.
- Output of KNP
- Output of KNP
- Dependencies between basic phrases.
- Add
<dependencies unit="basicPhrase">
-
@pred
points to some basic phrase. -
@arg
points to some coreference, a set of mentions. - Additional attributes on
<predarg>
: -
@flag
: see CaseRelation -
@text
: see CaseRelation
- Add
<caseRelations>
below<sentence>
, a set of<caseRelation>
. - Attributes on
<caseRelation>
: -
@pred
: basic phrase -
@arg
: token (unk
if unspecified) -
@deprel
: Relation label (e.g., ガ or オ) -
@flag
: C, N, O, etc (see http://nlp.ist.i.kyoto-u.ac.jp/index.php?KNP%2F%E6%A0%BC%E8%A7%A3%E6%9E%90%E7%B5%90%E6%9E%9C%E6%9B%B8%E5%BC%8F) -
@text
: 太郎, 不特定,- etc (text output in KNP)