Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added LookUp error handling during encoding detection. #412

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

ggcr
Copy link

@ggcr ggcr commented Dec 6, 2024

Please check Issue #411

I go into details on how to reproduce and the reasoning behind my simple fix.


In the current implementation of parse_file, the exception handling only catches UnicodeDecodeError.

# Open the file and read its content. Determine the encoding using cchardet. Skip over binary files.
with zip_ref.open(file_info, "r") as file:
content = file.read()
# Determine the encoding of the file
encoding = chardet.detect(content)["encoding"]
if not encoding:
return None
try:
content = content.decode(encoding)
except UnicodeDecodeError:
# If the file cannot be decoded, return None
return None

This can be updated to also catch LookupError.

except (UnicodeDecodeError, LookupError):
    return None

While this works, it will still trigger an error when prompted with an encoding not available in the runtime system. Would be very nice to parse line by line instead, this way we would only skip the line and not ditch the whole github repo from the curation process. However, parsing the file line by line might introduce a lot of overhead for big repos.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant