Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: revisit column ID assignment #17340

Open
xxchan opened this issue Jun 19, 2024 · 3 comments
Open

refactor: revisit column ID assignment #17340

xxchan opened this issue Jun 19, 2024 · 3 comments

Comments

@xxchan
Copy link
Member

xxchan commented Jun 19, 2024

During work in #17293, we can see it's a mess. It's error-prone and we cannot understand how it works precisely.

  • In some places, we use ColumnId::placeholder(), and use col_id_gen to fill it at the end

  • In other places, we create column id ad-hoc. AND MAY OR MAY NOT use col_id_gen again to assign the col id.

  • For TableVersion we have next_column_id, but not for SourceVersion. This means DROP COLUMN for source might be problematic (although we don't support it).

    However, it's also possible that it can work. As fix(source): fix panic for ALTER SOURCE with schema registry #17293 (comment), perhaps (fixed) column id for source is not relied on, but we have no clue yet.

A little more background:

@github-actions github-actions bot added this to the release-1.10 milestone Jun 19, 2024
@xxchan xxchan removed this from the release-1.10 milestone Jul 10, 2024
@xxchan
Copy link
Member Author

xxchan commented Sep 6, 2024

Another factor to consider: In iceberg it's required to have a field_id, including nested types' fields (should be unique in the table schema), which will be used in schema evolution and column projection.

@xxchan
Copy link
Member Author

xxchan commented Dec 11, 2024

As discussed with @BugenZhao, the col id assignment for source is not very wrong for now:

Things that matter

  1. Storage uses col id. So:
    1.1 existing column's id should not change
    1.2 we cannot reuse col id which was used before. (after DROP COLUMN)
  2. downstream needs to pick correct cols after schema change. (via col_index_mapping)

Table's implementation

  • We re-plan the whole table
  • Column ids are generated via col_id_gen (ColumnIdGenerator::new_alter(old_catalog))
    • col_id_gen looks up by col_name, and preserve their ID.

So both the properties are ensured.

Source's implementation

Source doesn't persist data, so property 1 is not needed at all. (It seems safe to reuse used ID.)

  • For user-specified schema
    • We only bind new columns, and then append it to the original columns.
  • For schema registry
    • We rebind all columns (bind_columns_from_source), then calculate column diff with columns_minus (by name) to get added columns.
  • In both cases, col IDs are not generated via a col_id_gen, but randomly according to each schema's implementation.
    • But we ensure existing cols are unchanged.
    • For new column, we use max_column_id + 1 to assign.

So it looks correct now.

Struct?

Currently we didn't handle struct field's col id carefully. It might have problem in the future when we have alter struct column. Not sure now.

@BugenZhao
Copy link
Member

BugenZhao commented Dec 11, 2024

(It seems safe to reuse used ID.)

As long as we don't drop a column and add a new column reusing that ID in a single request. 🤣

It might have problem in the future when we have alter struct column. Not sure now.

We effectively have once we support REFRESH SCHEMA. See #19736 #19755.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants