Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(batch): support system column _rw_timestamp for tables #19232

Open
wants to merge 25 commits into
base: main
Choose a base branch
from

Conversation

chenzl25
Copy link
Contributor

@chenzl25 chenzl25 commented Nov 1, 2024

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

  • Resolve Add system column rw_timestamp for any tables #11629
  • Support system column _rw_timestamp (datatype is timestamptz) for tables. We will add a hidden column _rw_timestamp to every table catalog when loading it from a proto, but we never persist this column info.
  • Only support selecting _rw_timestamp in a batch query. Using it in a streaming query will cause an error.
  • For batch batch queries, we add a field epoch_idx to the storage table which indicates the position where we should put the epoch into.
  • Since the state store get_row interface doesn't expose any epoch information, we will use the iter interface to support point get if _rw_timestamp is selected.
  • Lots of change is caused by planner tests, the core change is about 200+ LOC.

Example:

dev=> select *, _rw_timestamp from t;
 id | a |         _rw_timestamp
----+---+-------------------------------
  2 | 2 | 2024-11-05 07:05:24.488+00:00
  1 | 1 | 2024-11-05 07:05:19.487+00:00
(2 rows)

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added test labels as necessary. See details.
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
  • All checks passed in ./risedev check (or alias, ./risedev c)
  • My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)
  • My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

  • My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

  • Support _rw_timestamp system column for tables. Users can use a batch query to select this column to check the internal epoch/timestamp of each row which is useful when you want to know when the rows have been updated recently.

@chenzl25 chenzl25 added the user-facing-changes Contains changes that are visible to users label Nov 1, 2024
@graphite-app graphite-app bot requested a review from a team November 1, 2024 10:16
@@ -219,7 +219,8 @@ impl Binder {
.iter()
.enumerate()
.filter_map(|(i, c)| {
(!c.is_generated()).then_some(InputRef::new(i, c.data_type().clone()).into())
c.can_dml()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it mean that we (previously) also allow DML on __row_id column? Shall we ban this as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We allow DML like deletes on _row_id column. I think no need to ban it.

Comment on lines +373 to +374
// `get_row` doesn't support select `_rw_timestamp` yet.
assert!(self.epoch_idx.is_none());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we extend the get interface to also return the user key (thus epoch)?

Copy link
Contributor Author

@chenzl25 chenzl25 Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We do have a plan to add an extended interface (get_with_epoch) to return the epoch from the storage. Let's leave it for a later PR.

Comment on lines 580 to 583
if !columns.contains(&rw_timestamp_column) {
// Add system column `_rw_timestamp` to every table, but notice that this column is never persisted.
columns.push(rw_timestamp_column);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks a bit hack to me. Since it's never persisted, can we use a separate fields to avoid confusion with real persisted columns?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to persist this column because we have many already persisted tables in prod. Users won't want to rebuild their tables to use this feature.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree with that. I'm just wondering if it's possible to not have this column in runtime TableCatalog at all, as all other columns are persisted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I admit it is a bit hacky. If you take a look at how LogicalScan constructs the schema, it uses table catalog columns directly. If we want to keep the table catalog the same as before, we need to add complexity to the table scan and change how it represents the schema, and I think TableScan schema construction is already quite complex. So how about adding an additional field to ColumnDesc to distinguish the system column from others?

Copy link
Contributor

@st1page st1page Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So how about adding an additional field to ColumnDesc to distinguish the system column from others

+1

Comment on lines 426 to 428
if let Some(timer) = timer {
timer.observe_duration()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the timer should be moved to the outer scope, below the if-else block, to capture the time it takes to get a row, whether from get_row or from the iter interface.

Comment on lines +405 to +421
.batch_chunk_iter_with_pk_bounds(
epoch.into(),
&pk_prefix,
range_bounds,
false,
1,
PrefetchOptions::new(false, false),
)
.await?;
pin_mut!(iter);
let chunk = iter.next().await.transpose().map_err(BatchError::from)?;
if let Some(chunk) = chunk {
let row = chunk.row_at(0).0.to_owned_row();
Ok(Some(row))
} else {
Ok(None)
}
Copy link
Contributor

@kwannoel kwannoel Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactoring this into a separate method like get_row_with_rw_timestamp, and calling it here seems more readable.

Copy link
Contributor

@st1page st1page left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of change is caused by planner tests, the core change is about 200+ LOC.

Why _rw_timestamp must appear in the LogicalScan. Can we prune it in the LogicalScan?

@chenzl25
Copy link
Contributor Author

chenzl25 commented Nov 4, 2024

Lots of change is caused by planner tests, the core change is about 200+ LOC.

Why _rw_timestamp must appear in the LogicalScan. Can we prune it in the LogicalScan?

After column pruning, this column will be pruned.

@st1page
Copy link
Contributor

st1page commented Nov 4, 2024

Lots of change is caused by planner tests, the core change is about 200+ LOC.

Why _rw_timestamp must appear in the LogicalScan. Can we prune it in the LogicalScan?

After column pruning, this column will be pruned.

Ohhh SORRY for misreading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/feature user-facing-changes Contains changes that are visible to users
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add system column rw_timestamp for any tables
4 participants