Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: proper class for field info #1730

Draft
wants to merge 17 commits into
base: fix/parallel-datapackage
Choose a base branch
from

Conversation

pierrecamilleri
Copy link
Collaborator

@pierrecamilleri pierrecamilleri commented Jan 24, 2025

This is a refactoring PR.

Currently, a complex private object field_info is created in "Table.__open_row_stream" (resources/table.py) and used in the (non-public) Row __init__ method.

In addition, taking into account the schema_sync option leads to a lot of changes at many places, and it is very intricate and error-prone.

This PR introduces the same functionality without the field info, and with a proper implementation for schema_sync.

Sorry for the complicated review ! I try to provide a clear explanation to help with it.

Details

  • As a reminder, the schema_sync option allows to change the order of columns in the data, to drop columns (except if required) and to add extra columns. Even if it will soon be probably deprecated (better way to control this in the v2 spec), the changes introduced here will help to implement the v2 changes.

  • First, the schema_sync option would mess with the schema itself, which is a bad idea because 1. it deceives the expectation to find the schema as provided and not modified, 2. some schema fields need to be kept on hand, e.g. missing required columns, to be able to properly raise appropriate errors. This would in very intricate code, where these fields would be kept for the header validation and dropped for the row validation.

  • Taking into account the schema_sync option needs to happen with both schema and labels on hand : so all schema_sync specific code has been moved from the "detector.py" to the "header.py" file.

  • The Header class can now directly deal with identifying missing required columns (_get_missing_fields method) or extra labels (_get_extra_labels). Before this change, they were determined by comparing the (possibly modified) schema and the data. The header now also provides schema fields associated to the columns expected in the data, in a single step (get_expected_fields method). These methods deal with schema_sync, but will be able to deal with a large range of expectations as with the v2 spec fieldsMatch property in the future.

  • These expected fields are provided to the Row, and serve the same role as the former FieldInfo.

Changes orthogonal to the refactoring

  • Row.__str__ and Row.__repr__ were having side effects - the row was processed if it was observed. This is error-prone, and bit me as setting breakpoints for a debugger would change the behavior because of this.

WIP notes

Next steps / investigation: 

  • Explore the reason why there is a create_cell_reader function, instead of a more direct read_cell, which at first glance would simplify the logic.
    1. Some constraints parsing happens in create_cell_reader (maybe to reuse the value_reader). This does not seem the right place.
    2. for creating the value_reader once and for all (but same question, why create a value_reader instead of a read_row method.
  • Can (should?) the perf be improved by not find the field number with .index each time it is needed.
  • do not mess with schema fields when schema_sync=True, instead, create a separate list or mapping of the actual data fields.

@pierrecamilleri pierrecamilleri marked this pull request as draft January 24, 2025 14:47
Test passes, surprisingly.
No special effort has been made to support `header_case` option, or
"required" columns with `schema_sync`
@pierrecamilleri pierrecamilleri changed the base branch from main to fix/parallel-datapackage January 29, 2025 14:18
TODO still some tidy up :

- Remove FieldsInfo, use header instead
- Less error-prone way for `_normalize`
@@ -509,12 +509,6 @@ def test_resource_validate_detector_sync_schema():
)
report = resource.validate()
assert report.valid
assert resource.schema.to_descriptor() == {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is removed as the schema is not modified anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant