Words with ' are split on tokenization step #1

marlon-br · 2021-11-19T12:32:38Z

Hello, I have tested French model and in general it works great.

One issue for me is on tokenization step. The words with ' are split on 2, so l'empire turns into l' and empire or c'était turns onto c' and était. Is that expected behavior and what is a was to join such words back into one (expect just checking for ' )?

Thanks!

benob · 2021-11-19T13:26:47Z

We miss a tokenizer that preserves offsets from the source text in order to insert punctuation without altering the text. Currently, a set of rules is applied for detokenization, and they dont’t remove the space after single quotes.

For now, you can apply your own rewriting rules as preprocessing. We hope to be able to do better in the future.

marlon-br · 2021-11-19T13:37:20Z

Sure, thanks for the quick answer

Merge pull requests

benob pushed a commit that referenced this issue Dec 7, 2022

Merge pull request #1 from benob/main

c81dbcb

Merge pull requests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Words with ' are split on tokenization step #1

Words with ' are split on tokenization step #1

marlon-br commented Nov 19, 2021

benob commented Nov 19, 2021

marlon-br commented Nov 19, 2021

Words with ' are split on tokenization step #1

Words with ' are split on tokenization step #1

Comments

marlon-br commented Nov 19, 2021

benob commented Nov 19, 2021

marlon-br commented Nov 19, 2021