Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-breaking spaces #57

Open
birkett83 opened this issue Feb 21, 2017 · 0 comments
Open

Non-breaking spaces #57

birkett83 opened this issue Feb 21, 2017 · 0 comments

Comments

@birkett83
Copy link
Contributor

birkett83 commented Feb 21, 2017

I recently received a spam containing a non-breaking space (encoded as =C2=A0 in quoted-printable UTF-8 if that is relevant). When running pyzor predigest, the non-breaking space is kept in the predigest output. I have no idea if spammers do this but they could randomly replace spaces with non-breaking spaces before sending mail to generate a different fingerprint each time and evade detection.

I believe that simply changing

    ws_ptrn = re.compile(r'\s')

to

    ws_ptrn = re.compile(r'\s', flags=re.UNICODE)

would address this (including all the other unicode space characters), but at the cost of breaking compatibility with signatures from older versions of pyzor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants