-
Notifications
You must be signed in to change notification settings - Fork 151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: proper class for field info #1730
Draft
pierrecamilleri
wants to merge
17
commits into
fix/parallel-datapackage
Choose a base branch
from
refactor/field_info
base: fix/parallel-datapackage
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Test passes, surprisingly. No special effort has been made to support `header_case` option, or "required" columns with `schema_sync`
996f64d
to
8b72ae8
Compare
TODO still some tidy up : - Remove FieldsInfo, use header instead - Less error-prone way for `_normalize`
pierrecamilleri
commented
Feb 7, 2025
@@ -509,12 +509,6 @@ def test_resource_validate_detector_sync_schema(): | |||
) | |||
report = resource.validate() | |||
assert report.valid | |||
assert resource.schema.to_descriptor() == { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is removed as the schema is not modified anymore.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a refactoring PR.
Currently, a complex private object
field_info
is created in "Table.__open_row_stream" (resources/table.py) and used in the (non-public) Row__init__
method.In addition, taking into account the
schema_sync
option leads to a lot of changes at many places, and it is very intricate and error-prone.This PR introduces the same functionality without the field info, and with a proper implementation for
schema_sync
.Sorry for the complicated review ! I try to provide a clear explanation to help with it.
Details
As a reminder, the
schema_sync
option allows to change the order of columns in the data, to drop columns (except if required) and to add extra columns. Even if it will soon be probably deprecated (better way to control this in the v2 spec), the changes introduced here will help to implement the v2 changes.First, the
schema_sync
option would mess with the schema itself, which is a bad idea because 1. it deceives the expectation to find the schema as provided and not modified, 2. some schema fields need to be kept on hand, e.g. missing required columns, to be able to properly raise appropriate errors. This would in very intricate code, where these fields would be kept for the header validation and dropped for the row validation.Taking into account the
schema_sync
option needs to happen with both schema and labels on hand : so allschema_sync
specific code has been moved from the "detector.py" to the "header.py" file.The
Header
class can now directly deal with identifying missing required columns (_get_missing_fields
method) or extra labels (_get_extra_labels
). Before this change, they were determined by comparing the (possibly modified) schema and the data. The header now also provides schema fields associated to the columns expected in the data, in a single step (get_expected_fields
method). These methods deal withschema_sync
, but will be able to deal with a large range of expectations as with the v2 specfieldsMatch
property in the future.These expected fields are provided to the Row, and serve the same role as the former
FieldInfo
.Changes orthogonal to the refactoring
Row.__str__
andRow.__repr__
were having side effects - the row was processed if it was observed. This is error-prone, and bit me as setting breakpoints for a debugger would change the behavior because of this.WIP notes
Next steps / investigation:
create_cell_reader
function, instead of a more directread_cell
, which at first glance would simplify the logic.create_cell_reader
(maybe to reuse thevalue_reader
). This does not seem the right place.value_reader
once and for all (but same question, why create avalue_reader
instead of aread_row
method..index
each time it is needed.schema_sync=True
, instead, create a separate list or mapping of the actual data fields.