Non-breaking spaces #57

birkett83 · 2017-02-21T08:02:10Z

I recently received a spam containing a non-breaking space (encoded as =C2=A0 in quoted-printable UTF-8 if that is relevant). When running pyzor predigest, the non-breaking space is kept in the predigest output. I have no idea if spammers do this but they could randomly replace spaces with non-breaking spaces before sending mail to generate a different fingerprint each time and evade detection.

I believe that simply changing

    ws_ptrn = re.compile(r'\s')

to

    ws_ptrn = re.compile(r'\s', flags=re.UNICODE)

would address this (including all the other unicode space characters), but at the cost of breaking compatibility with signatures from older versions of pyzor.

alexkiro added the enhancement label Feb 21, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-breaking spaces #57

Non-breaking spaces #57

birkett83 commented Feb 21, 2017 •

edited

Loading

Non-breaking spaces #57

Non-breaking spaces #57

Comments

birkett83 commented Feb 21, 2017 • edited Loading

birkett83 commented Feb 21, 2017 •

edited

Loading