-
Notifications
You must be signed in to change notification settings - Fork 531
[Feature] Add Ner Suffix feature #1123
base: v0.x
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1123 +/- ##
==========================================
- Coverage 88.34% 88.21% -0.13%
==========================================
Files 66 66
Lines 6290 6290
==========================================
- Hits 5557 5549 -8
- Misses 733 741 +8
|
Job PR-1123/1 is complete. |
@@ -77,6 +77,8 @@ def parse_args(): | |||
help='Learning rate for optimization') | |||
arg_parser.add_argument('--warmup-ratio', type=float, default=0.1, | |||
help='Warmup ratio for learning rate scheduling') | |||
arg_parser.add_argument('--tagging-first-token', type=str2bool, default=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about parser.add_argument('--tag-last-token', action='store_true')
. It seems simpler to call finetune_bert.py --tag-last-token
than finetune_bert.py --tagging-first-token=False
.
In either case please update the test case in scripts/tests/
to run invoke the finetune_bert.py
with both options. You can parametrize the test following for example Haibin's recent PR: https://github.com/dmlc/gluon-nlp/pull/1121/files#diff-fa82d34d543ff657c2fe09553bd0fa34R234
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I will update it.
Have you found any performance differences? |
@sxjscience I've tried the default parameters set in the scripts on conll2003 dataset. The performance using suffix feature will be a little lower than using the prefix feature. |
I think we can try the following:
|
One problem is that since we are using self-attention, we are able to tailor the attention weights to cover the first, last, average cases. Thus, I don't think selecting the first/last token will impact the performance much. |
@sxjscience In classification task, I think it does not matter. But in sequence labeling task, one word has one label. If we break the word 'w' into several subwords [sw1,sw2,...], then only sw1 will have the label, and the labels of the others will set to NULL. I think it does not make sense. |
@@ -81,7 +85,7 @@ def main(config): | |||
train_config.dropout_prob) | |||
|
|||
dataset = BERTTaggingDataset(text_vocab, None, None, config.test_path, | |||
config.seq_len, train_config.cased, tag_vocab=tag_vocab) | |||
config.seq_len, train_config.cased, tag_vocab=tag_vocab,tagging_first_token=config.tagging_first_token) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pls add white space after the comma.
Due to the fact that we are using attention, the state bound to sw1 will be related to the other sub-words. The same thing happens for sw_n. A reasonable approach is to mask the loss corresponding to the other sub-word tokens and only use the state of the first subword as the contextualized word embedding.
Get Outlook for iOS<https://aka.ms/o0ukef>
…________________________________
From: Shawnyu <[email protected]>
Sent: Sunday, January 19, 2020 7:28:28 PM
To: dmlc/gluon-nlp <[email protected]>
Cc: Xingjian SHI <[email protected]>; Mention <[email protected]>
Subject: Re: [dmlc/gluon-nlp] [Feature] Add Ner Suffix feature (#1123)
@sxjscience<https://github.com/sxjscience> In classification task, I think it does not matter. But in sequence labeling task, one word has one label. If we break the word 'w' into several subwords [sw1,sw2,...], then only sw1 will have the label, and the labels of the others will set to NULL. I think it does not make sense.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#1123?email_source=notifications&email_token=ABHQH3RQMOA4X4WSVHE6VADQ6UK5ZA5CNFSM4KIXYPMKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJLHMDY#issuecomment-576091663>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABHQH3UUE5HVOS5AHRINZLLQ6UK5ZANCNFSM4KIXYPMA>.
|
entries.append(PredictedToken(text=text, | ||
true_tag=true_tag, pred_tag=pred_tag)) | ||
tmptext = '' | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can both cases be merged here? For example, if len(tmptext) == 0
, you can still have text = tmptext + token_text
which is equivalent to token_text
.
true_tag=true_tag, pred_tag=pred_tag)) | ||
|
||
if true_tag == NULL_TAG: | ||
tmptext += token_text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better name it as tmp_text
. Or what about partial_text
?
@sxjscience Agree with you, and I'll try this method. |
I am confused about this part. Why masking loss of other sub-word tokens is reasonable? For example on NER tasks, suffix is much more important than prefix in words like |
Since we are using attention, the higher-level state associated with |
@sxjscience Do you think we should continue with this pull request? |
Description
Add a parameter "tagging_first_token", so you can choose to use the first piece or the last piece of each word. The first piece catches the prefix feature of a word, and the last piece catches the suffix feature of a word.
Checklist
Essentials
Comments