refactor: proper class for field info #1730

pierrecamilleri · 2025-01-24T14:47:02Z

This is a refactoring PR.

Currently, a complex private object field_info is created in "Table.__open_row_stream" (resources/table.py) and used in the (non-public) Row __init__ method.

In addition, taking into account the schema_sync option leads to a lot of changes at many places, and it is very intricate and error-prone.

This PR introduces the same functionality without the field info, and with a proper implementation for schema_sync.

Sorry for the complicated review ! I try to provide a clear explanation to help with it.

Details

As a reminder, the schema_sync option allows to change the order of columns in the data, to drop columns (except if required) and to add extra columns. Even if it will soon be probably deprecated (better way to control this in the v2 spec), the changes introduced here will help to implement the v2 changes.
First, the schema_sync option would mess with the schema itself, which is a bad idea because 1. it deceives the expectation to find the schema as provided and not modified, 2. some schema fields need to be kept on hand, e.g. missing required columns, to be able to properly raise appropriate errors. This would in very intricate code, where these fields would be kept for the header validation and dropped for the row validation.
Taking into account the schema_sync option needs to happen with both schema and labels on hand : so all schema_sync specific code has been moved from the "detector.py" to the "header.py" file.
The Header class can now directly deal with identifying missing required columns (_get_missing_fields method) or extra labels (_get_extra_labels). Before this change, they were determined by comparing the (possibly modified) schema and the data. The header now also provides schema fields associated to the columns expected in the data, in a single step (get_expected_fields method). These methods deal with schema_sync, but will be able to deal with a large range of expectations as with the v2 spec fieldsMatch property in the future.
These expected fields are provided to the Row, and serve the same role as the former FieldInfo.

Changes orthogonal to the refactoring

Row.__str__ and Row.__repr__ were having side effects - the row was processed if it was observed. This is error-prone, and bit me as setting breakpoints for a debugger would change the behavior because of this.

WIP notes

Next steps / investigation:

Explore the reason why there is a create_cell_reader function, instead of a more direct read_cell, which at first glance would simplify the logic.
1. Some constraints parsing happens in create_cell_reader (maybe to reuse the value_reader). This does not seem the right place.
2. for creating the value_reader once and for all (but same question, why create a value_reader instead of a read_row method.
Can (should?) the perf be improved by not find the field number with .index each time it is needed.
do not mess with schema fields when schema_sync=True, instead, create a separate list or mapping of the actual data fields.

Test passes, surprisingly. No special effort has been made to support `header_case` option, or "required" columns with `schema_sync`

TODO still some tidy up : - Remove FieldsInfo, use header instead - Less error-prone way for `_normalize`

pierrecamilleri · 2025-02-07T15:48:03Z

frictionless/resource/__spec__/test_validate.py

@@ -509,12 +509,6 @@ def test_resource_validate_detector_sync_schema():
    )
    report = resource.validate()
    assert report.valid
-    assert resource.schema.to_descriptor() == {


This test is removed as the schema is not modified anymore.

pierrecamilleri marked this pull request as draft January 24, 2025 14:47

pierrecamilleri added 6 commits January 27, 2025 15:23

first attempt

ee4ad2f

squash! first attempt

39aa0f2

🔵 Mv to resources/table

70aadac

🔵 rename

add8dac

Schema sync functionnality inside FieldsInfo

e26393b

Test passes, surprisingly. No special effort has been made to support `header_case` option, or "required" columns with `schema_sync`

🔵 remove empty / unused file

8b72ae8

pierrecamilleri force-pushed the refactor/field_info branch from 996f64d to 8b72ae8 Compare January 29, 2025 14:18

pierrecamilleri changed the base branch from main to fix/parallel-datapackage January 29, 2025 14:18

pierrecamilleri added 2 commits January 29, 2025 18:10

🟢 Test passes

c7479c6

TODO still some tidy up : - Remove FieldsInfo, use header instead - Less error-prone way for `_normalize`

🟢 get rid of FieldInfo

9b7d0ad

pierrecamilleri commented Feb 7, 2025

View reviewed changes

pierrecamilleri added 9 commits February 7, 2025 16:51

remove unnecessary review noise

a80e79a

remove unused function

b7f693d

remove unused FieldInfo

77a2c9f

linting

305d2b9

Information on processing for Row.__str__ and Row.__repr__

91d3ffb

typo

575885b

unintended rename

c41590f

Remove __repr__ change as it is used for tests

37fd709

fix: oopsie

c6c90a2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: proper class for field info #1730

refactor: proper class for field info #1730

pierrecamilleri commented Jan 24, 2025 •

edited

Loading

pierrecamilleri Feb 7, 2025

refactor: proper class for field info #1730

Are you sure you want to change the base?

refactor: proper class for field info #1730

Conversation

pierrecamilleri commented Jan 24, 2025 • edited Loading

Details

Changes orthogonal to the refactoring

WIP notes

pierrecamilleri Feb 7, 2025

Choose a reason for hiding this comment

pierrecamilleri commented Jan 24, 2025 •

edited

Loading