Skip to content

Conversation

@fkoyer
Copy link

@fkoyer fkoyer commented Oct 8, 2025

Problems with old definitions:

  • Tries to match UTF-8 and Latin-1 characters in same expression. e.g. <A> includes the byte sequence for "ã" in Latin-1 (\xE3) and UTF-8 (\xC3\xA3). This seems like a good thing at first but it can cause false positives if the text is in UTF-8 and the pattern is looking for Latin-1
  • Contains redundant characters. e.g. \xE3 appears multiple times in <A>
  • Contains unnecessary characters. e.g. \xE3 also appears in <V> and <Y>
  • Patterns are case-insensitive. e.g. <I> attempts to match lowercase L but because it's case-insensitive, it also matches uppercase L
  • Some look-alike characters aren't matched e.g. \xEA\x93\xAE = LISU LETTER A (U+A4EE)

Changes:

  • All byte sequences are UTF-8 only (no Latin-1)
  • All patterns are case-sensitive
  • Removed redundant and unnecessary characters
  • Added additional look-alike characters

@fkoyer
Copy link
Author

fkoyer commented Oct 8, 2025

Note: these definitions are based on work I did on Text::ASCII::Convert

@Fneufneu
Copy link

You did a great job but i think removing ISO-8859 can't be done. The documentation say:

When using rules with extended characters / diacritics, you should always use both ISO-8859-1 / UTF-8 encodings.
Body content can be different depending on normalize_charset setting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants