You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An idea for enhancement: adding a new tab "Tokenization" or "Tokens". In this tab a user could upload a file in IOB2/BIO format and manually correct tokenization and tags. For example, a user could split or merge the tokens and modify the corresponding tags.
Motivation: the current tokenizer in MedTator for BIO export has relatively low accuracy for many special cases. In those cases the BIO files cannot be currently used for training. But even if the tokenizer will be replaced by another one (as mentioned in #7), it is unlikely to have good performance for all languages and use cases. Therefore having an opportunity to fix manually the tokens and tags in BIO format would be very helpful for building a gold standard corpus.
The text was updated successfully, but these errors were encountered:
Thank you so much for your suggestion! I think it's a very nice feature for improving the dataset quality!
My understanding is that your idea is about revising the file content of the IOB2/BIO format data, such as split/merge/delete tokens and tags, if there are any errors in tokenization or tagging. If so, I think the most challenging part is how to locate the errors in the long IOB-format file. As you described, the errors can be caused by exceptional tokenization cases, and we are not sure where they are in the file, which needs manual correction.
The editing, such as split/merge/delete tokens, can be done easily in any text editor (VSCode, Sublime Text, etc.).
I agree that a tool can be more helpful in making both searching and editing easier. Could you share some cases related to fixing issues in IOB2/BIO format files? For example, how to locate the errors and what the changes to the files. Then I think we can check how to improve this process.
I found one tool, neat, which may have similar functions for IOB2/BIO file editing. But I'm not sure if there is any other tool for reference. If you have any thoughts, please feel free to discuss :)
Indeed, neat looks like a good tool for tokenization correction. I'll test it.
With respect to locating errors and changes to the file: I tested MedTator on a text with many special characters and then converted it to BIO format, the special characters were considered as separate tokens. In this case I need to merge many tokens, for example.
An idea for enhancement: adding a new tab "Tokenization" or "Tokens". In this tab a user could upload a file in IOB2/BIO format and manually correct tokenization and tags. For example, a user could split or merge the tokens and modify the corresponding tags.
Motivation: the current tokenizer in MedTator for BIO export has relatively low accuracy for many special cases. In those cases the BIO files cannot be currently used for training. But even if the tokenizer will be replaced by another one (as mentioned in #7), it is unlikely to have good performance for all languages and use cases. Therefore having an opportunity to fix manually the tokens and tags in BIO format would be very helpful for building a gold standard corpus.
The text was updated successfully, but these errors were encountered: