Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

better sanitiziation of analysis results #1162

Closed
wants to merge 1 commit into from

Conversation

jstucke
Copy link
Collaborator

@jstucke jstucke commented Nov 2, 2023

resolves #1161

@jstucke jstucke added the bug label Nov 2, 2023
@jstucke jstucke requested review from dorpvom and maringuu November 2, 2023 16:30
@jstucke jstucke self-assigned this Nov 2, 2023
Copy link
Collaborator

@maringuu maringuu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A general question: Isn't this sanitation in the wrong place?
Should it be needed at all?

As far as I understand plugin results can contain invalid things, so they are sanitized.
But shouldn't this sanitation be done by the plugin?
How does the sanitation make sure that the plugin results stay valid?

Regarding #1161 this fix would make the result invalid, wouldn't it?

# saved in the PostgreSQL database
json_string = json.dumps(string)
if JSON_UNICODE_REGEX.search(json_string):
logging.warning(f'Sanitizing unicode characters in string {json_string[100:]}')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the [100:] here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The string could be very long (there are some big analysis results which are stored as strings) and I thought not printing a few MBs worth of string would generally be a good idea

@jstucke
Copy link
Collaborator Author

jstucke commented Nov 7, 2023

Isn't this sanitation in the wrong place?
Should it be needed at all?
As far as I understand plugin results can contain invalid things, so they are sanitized.
But shouldn't this sanitation be done by the plugin?

I generally agree but it isn't that easy to sanitize the string (it is a regular string but with unicode characters, so it is in fact valid JSON -- just the database can't handle it) and I don't think this should be handled by each plugin individually either. Also we have the potential problem of custom plugins which we can't control. Maybe it could be handled in the analysis plugin base class but I don't know if that is a better place than the module which explicitly handles data conversion for the database

@jstucke
Copy link
Collaborator Author

jstucke commented Nov 7, 2023

This seems to only be an issue if the database encoding is not set to "UTF8". My database has this encoding (you can test it with the command SHOW SERVER_ENCODING; in psql), so only null bytes are a problem (which should already be handled). We need to determine in which cases and why the database encoding is not set to "UTF8". Until then I will mark this PR as draft

@dorpvom
Copy link
Collaborator

dorpvom commented Nov 25, 2024

Has not become an issue. Will close.

@dorpvom dorpvom closed this Nov 25, 2024
@wuzhongjie1997
Copy link

@dorpvom please see my reply in original issue.
I'd also like to ask, if there is a way to collect all files against which a certain analysis plugin failed to operate? Say i frequently experience 'unable to write database' issues with the crypto_hints plugin, I wonder if pulling all the extracted files which caused this error would aid in debugging the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Database error when trying to write to the database
4 participants