Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[2.1]: Long multi-byte words dropped in log_search_words #8312

Open
sbulen opened this issue Sep 16, 2024 · 1 comment
Open

[2.1]: Long multi-byte words dropped in log_search_words #8312

sbulen opened this issue Sep 16, 2024 · 1 comment

Comments

@sbulen
Copy link
Contributor

sbulen commented Sep 16, 2024

Basic Information

The problem here is hard to see: long words with multi-byte characters don't make it into log_search_words, they are dropped.

Lots of subtleties here, but the core issue is a non-mb-safe substring is taken.

The sequence of events:

  • Given a long multi-byte word in a new topic subject, like this: 三藩市道德委員會收到投訴:針對政治捐款「打包」組織
  • Gets passed thru text2words, Subs.php, line 5354
  • From there, it's passed to truncate, Load.php, line 225
  • In line 225, a (non-mb) substr is taken, resulting in a corrupt, invalid utf8 final char: 三藩市道德委�
  • The now-invalid utf8 string is passed to $smcFunc['strlen'], Load.php, line 182
  • The preg_match on line 182 fails, passing null to php's strlen(), also on line 182
  • strlen issues a warning "Passing null to parameter # 1 ($string) of type string is deprecated", which is suppressed
  • The word is never stored.

Note, if a text2words is called during a background task, an error is logged:
Cron error: 8192: strlen(): Passing null to parameter # 1 ($string) of type string is deprecated (load.php, line 182)

This error is suppressed in the app, as deprecation errors are still suppressed in index.php. But not in cron.php.

Similar (but different) report: #6405

Bigger issue? The above term isn't actually a word, it's a sentence...

This issue exists both in 2.1 & 3.0. Even when cutting over to UTF8MB4 in 3.0, it may still exist, depending on whether/how the smf truncate function is rewritten.

Steps to reproduce

  1. Create a new post with this in the subject: 三藩市道德委員會收到投訴:針對政治捐款「打包」組織
  2. Post it

Expected result

A word in log_search_words

Actual result

No words in log_search_words

Version/Git revision

3.0 alpha 2 & 2.1.4

Database Engine

All

Database Version

8.4

PHP Version

8.3.8

Logs

No response

Additional Information

No response

@sbulen
Copy link
Contributor Author

sbulen commented Oct 6, 2024

I can no longer reproduce this with 3.0. I believe @Sesquipedalian fixed the issue in 3.0 with #8298 .

In fact, I think #8298 fixed my broader concern above that we weren't properly breaking on words. E.g., 3.0 now properly recognizes that 委員會 = "committee", and places a single entry into log_search_words for that portion of the test string above.

Very cool.

Issue still exists with 2.1.

@sbulen sbulen changed the title [2.1] & [3.0]: Long multi-byte words dropped in log_search_words [2.1]: Long multi-byte words dropped in log_search_words Oct 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant