comprehensive french tokenizer without exceptions list #13378
+125
−16,097
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The current french tokenizer doesn't handle hyphens and apostrophes very well. It uses a gigantic (15600) list of words with hyphen that must not be split on the hyphen. This list is not only huge (full of village names such as Minaucourt-le-Mesnil-lès-Hurlus, or Beaujeu-Saint-Vallier-Pierrejux-et-Quitteur), but also very incomplete. This list has no chance to ever become exhaustive, because the number of french common nouns and proper names that contain a hypen and must not be split by the tokenizer is virtually infinite: the hyphen is called in french trait d'union (union trait), it unifies, it joins separate words into one semantic word (and token). For example, the verb porter (to carry) produces nouns porte-clé (a thing we use to carry keys), porte-manteau, and we can invent any word like this (with porter or any other word). Plus, there is inclusive language (relecteur-rice-s). And of course there are people and places names, wich often containd hyphens, combining existing names or words into new and larger names. At the other hand, there are cases where a hyphen must split a substring into two words, and these cases are easily handled with a simple regex, because unlike the infinite exceptions, they are not very diverse: a) verb-subject inversion where subject is pronominalized; b) verb-object form where object is pronominalized; for a total of 21 words (suffixes). This current pull requests replaces the tokenizer exceptions by a new 're_infixes' function, that easily handles each of the 15600 exceptions, and many more. It reverses the rule-exception relation: rule = keep as one token the words containing a hyphen; exception = split words containing a hyphen if the hyphen is followed by one of the registered word (pronominalized subject/object).