Improve A-Z replace_tag definitions #19

fkoyer · 2025-10-08T01:10:15Z

Problems with old definitions:

Tries to match UTF-8 and Latin-1 characters in same expression. e.g. <A> includes the byte sequence for "ã" in Latin-1 (\xE3) and UTF-8 (\xC3\xA3). This seems like a good thing at first but it can cause false positives if the text is in UTF-8 and the pattern is looking for Latin-1
Contains redundant characters. e.g. \xE3 appears multiple times in <A>
Contains unnecessary characters. e.g. \xE3 also appears in <V> and <Y>
Patterns are case-insensitive. e.g. <I> attempts to match lowercase L but because it's case-insensitive, it also matches uppercase L
Some look-alike characters aren't matched e.g. \xEA\x93\xAE = LISU LETTER A (U+A4EE)

Changes:

All byte sequences are UTF-8 only (no Latin-1)
All patterns are case-sensitive
Removed redundant and unnecessary characters
Added additional look-alike characters

fkoyer · 2025-10-08T01:22:35Z

Note: these definitions are based on work I did on Text::ASCII::Convert

Fneufneu · 2025-10-17T12:21:06Z

You did a great job but i think removing ISO-8859 can't be done. The documentation say:

When using rules with extended characters / diacritics, you should always use both ISO-8859-1 / UTF-8 encodings.
Body content can be different depending on normalize_charset setting.

Improved A-Z replace_tags

8d25220

fkoyer mentioned this pull request Oct 8, 2025

fix FUZZY_SECURITY with french words #18

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve A-Z replace_tag definitions #19

Improve A-Z replace_tag definitions #19

Uh oh!

fkoyer commented Oct 8, 2025

Uh oh!

fkoyer commented Oct 8, 2025

Uh oh!

Fneufneu commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Improve A-Z replace_tag definitions #19

Are you sure you want to change the base?

Improve A-Z replace_tag definitions #19

Uh oh!

Conversation

fkoyer commented Oct 8, 2025

Uh oh!

fkoyer commented Oct 8, 2025

Uh oh!

Fneufneu commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants