refactor: revisit column ID assignment #17340

xxchan · 2024-06-19T06:23:06Z

During work in #17293, we can see it's a mess. It's error-prone and we cannot understand how it works precisely.

In some places, we use ColumnId::placeholder(), and use col_id_gen to fill it at the end
In other places, we create column id ad-hoc. AND MAY OR MAY NOT use col_id_gen again to assign the col id.
For TableVersion we have next_column_id, but not for SourceVersion. This means DROP COLUMN for source might be problematic (although we don't support it).

However, it's also possible that it can work. As fix(source): fix panic for ALTER SOURCE with schema registry #17293 (comment), perhaps (fixed) column id for source is not relied on, but we have no clue yet.

A little more background:

ColumnIdGenerator is introduced in feat(frontend): introduce the skeleton of schema change #7083
After the refactor refactor(binder): bind create table #10307, we introduced ColumnId::placeholder(). This makes coding easier, but also more error-prone.
frontend: refactor source schema resolution #9828 also mentioned:

avoid manipulating columns manually, which bypasses ColumnIdGenerator and can be problematic if we support ALTER TABLE with connector

The text was updated successfully, but these errors were encountered:

xxchan · 2024-09-06T14:32:31Z

Another factor to consider: In iceberg it's required to have a field_id, including nested types' fields (should be unique in the table schema), which will be used in schema evolution and column projection.

xxchan · 2024-12-11T07:50:08Z

As discussed with @BugenZhao, the col id assignment for source is not very wrong for now:

Things that matter

Storage uses col id. So:
1.1 existing column's id should not change
1.2 we cannot reuse col id which was used before. (after DROP COLUMN)
downstream needs to pick correct cols after schema change. (via col_index_mapping)

Table's implementation

We re-plan the whole table
Column ids are generated via col_id_gen (ColumnIdGenerator::new_alter(old_catalog))
- col_id_gen looks up by col_name, and preserve their ID.

So both the properties are ensured.

Source's implementation

Source doesn't persist data, so property 1 is not needed at all. (It seems safe to reuse used ID.)

For user-specified schema
- We only bind new columns, and then append it to the original columns.
For schema registry
- We rebind all columns (bind_columns_from_source), then calculate column diff with columns_minus (by name) to get added columns.
In both cases, col IDs are not generated via a col_id_gen, but randomly according to each schema's implementation.
- But we ensure existing cols are unchanged.
- For new column, we use max_column_id + 1 to assign.

So it looks correct now.

Struct?

Currently we didn't handle struct field's col id carefully. It might have problem in the future when we have alter struct column. Not sure now.

BugenZhao · 2024-12-11T07:54:58Z

(It seems safe to reuse used ID.)

As long as we don't drop a column and add a new column reusing that ID in a single request. 🤣

It might have problem in the future when we have alter struct column. Not sure now.

We effectively have once we support REFRESH SCHEMA. See #19736 #19755.

xxchan added the type/feature label Jun 19, 2024

github-actions bot added this to the release-1.10 milestone Jun 19, 2024

xxchan mentioned this issue Jun 28, 2024

fix: remove column id check #17494

Merged

9 tasks

xxchan removed this from the release-1.10 milestone Jul 10, 2024

This was referenced Dec 11, 2024

feat(frontend): support alter add column for shared source #19649

Merged

Support ALTER SOURCE ADD COLUMN for shared source #19063

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: revisit column ID assignment #17340

refactor: revisit column ID assignment #17340

xxchan commented Jun 19, 2024

xxchan commented Sep 6, 2024

xxchan commented Dec 11, 2024

BugenZhao commented Dec 11, 2024 •

edited by xxchan

Loading

refactor: revisit column ID assignment #17340

refactor: revisit column ID assignment #17340

Comments

xxchan commented Jun 19, 2024

xxchan commented Sep 6, 2024

xxchan commented Dec 11, 2024

Things that matter

Table's implementation

Source's implementation

Struct?

BugenZhao commented Dec 11, 2024 • edited by xxchan Loading

BugenZhao commented Dec 11, 2024 •

edited by xxchan

Loading