Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clp-s: Add option to fail ingestion on invalid utf-8 sequence. #655

Open
gibber9809 opened this issue Jan 8, 2025 · 0 comments
Open

clp-s: Add option to fail ingestion on invalid utf-8 sequence. #655

gibber9809 opened this issue Jan 8, 2025 · 0 comments
Labels
enhancement New feature or request

Comments

@gibber9809
Copy link
Contributor

Request

The current ingestion implementation will take invalid utf-8 code sequences and replace them with a placeholder utf-8 sequence (this follows the behaviour of get_string(true) in simdjson). This allows us to automatically handle invalid utf-8, and ensure that archives always contain valid utf-8 data.

However, some users may want to instead fail ingestion when encountering invalid utf-8 so that they can make fixes upstream (or maybe replace the utf-8 with a placeholder and allow ingestion to succeed but somehow notify the user).

Possible implementation

  1. Allow users to pass a flag to ingestion indicating that they want to fail on invalid utf-8
  2. Fail ingestion and notify user where invalid utf-8 was encountered
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant