Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Words with ' are split on tokenization step #1

Open
marlon-br opened this issue Nov 19, 2021 · 2 comments
Open

Words with ' are split on tokenization step #1

marlon-br opened this issue Nov 19, 2021 · 2 comments

Comments

@marlon-br
Copy link

Hello, I have tested French model and in general it works great.

One issue for me is on tokenization step. The words with ' are split on 2, so l'empire turns into l' and empire or c'était turns onto c' and était. Is that expected behavior and what is a was to join such words back into one (expect just checking for ' )?

Thanks!

@benob
Copy link
Owner

benob commented Nov 19, 2021

We miss a tokenizer that preserves offsets from the source text in order to insert punctuation without altering the text. Currently, a set of rules is applied for detokenization, and they dont’t remove the space after single quotes.

For now, you can apply your own rewriting rules as preprocessing. We hope to be able to do better in the future.

@marlon-br
Copy link
Author

Sure, thanks for the quick answer

benob pushed a commit that referenced this issue Dec 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants