Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ignore nulls in over window function #17601

Open
lmatz opened this issue Jul 8, 2024 · 5 comments · May be fixed by #19847
Open

ignore nulls in over window function #17601

lmatz opened this issue Jul 8, 2024 · 5 comments · May be fixed by #19847
Assignees

Comments

@lmatz
Copy link
Contributor

lmatz commented Jul 8, 2024

Is your feature request related to a problem? Please describe.

For example, LAST_VALUE (order_id IGNORE NULLS) is supported in popular DBs such as MySQL, DuckDB, Redshift, Snowflake and Starrocks.

It has appeared in four users' use cases that ignore nulls is a must for expressing their workload. Otherwise, not only do they need to impose awkward limitations on the application semantics they want to convey, but they also have much worse performance.

The use case is similar to the one described in the question: https://stackoverflow.com/questions/37470931/how-to-ignore-nulls-in-postgresql-window-functions-or-return-the-next-non-null:

I need another column indicating the next non-null COL1 value for each row

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

@github-actions github-actions bot added this to the release-1.10 milestone Jul 8, 2024
@lmatz lmatz removed this from the release-1.10 milestone Jul 10, 2024
@lmatz
Copy link
Contributor Author

lmatz commented Jul 11, 2024

Just document an example of finding the closet index event of the each trade event.

For each row in the table merge, it is either a index event or trade event. One of them must be NULL and the other one must be non-NULL.

CREATE TABLE merge (
    index varchar,
    trades varchar,
    event_time INT
);

insert into merge values ('a', NULL, 1);
insert into merge values ('b', NULL, 5);
insert into merge values ('c', NULL, 10);
insert into merge values ('d', NULL, 20);

insert into merge values (NULL, 'X', 2);
insert into merge values (NULL, 'Y', 8);
insert into merge values (NULL, 'W', 10);
insert into merge values (NULL, 'Z', 20);


select * from merge;
           Result
--------------------------
index,trades,event_time
null,X,2
null,Y,8
null,W,10
null,Z,20
a,null,1
b,null,5
c,null,10
d,null,20

Then we construct the following query:

select t2.index, t2.trades, t2.event_time, t2.after, t2.before, t2.after_time, merge.event_time as before_time FROM
(
select t1.index, t1.trades, t1.event_time, t1.after, t1.before, merge.event_time as after_time from
(
    SELECT
    index,
    trades,
    event_time,
    first_value(index ignore nulls) over (order by event_time rows between CURRENT ROW AND unbounded following) as after,
    last_value(index ignore nulls) over (order by event_time rows between unbounded preceding AND CURRENT ROW) as before
    from merge
) t1
inner join 
merge 
on
t1.after = merge.index
) t2
inner JOIN
merge
on
t2.before = merge.index
where t2.index is NULL

            Result
-------------------------
index,trades,event_time,after,before,after_time,before_time
null,X,2,b,a,5,1
null,Y,8,c,b,10,5
null,W,10,c,b,10,5
null,Z,20,d,c,20,10

Copy link
Contributor

This issue has been open for 60 days with no activity.

If you think it is still relevant today, and needs to be done in the near future, you can comment to update the status, or just manually remove the no-issue-activity label.

You can also confidently close this issue as not planned to keep our backlog clean.
Don't worry if you think the issue is still valuable to continue in the future.
It's searchable and can be reopened when it's time. 😄

@lmatz
Copy link
Contributor Author

lmatz commented Oct 18, 2024

Since now we have supported ASOF Join, ignore nulls can be paused if there is no real use case.

@stdrc
Copy link
Member

stdrc commented Nov 29, 2024

Recently a user asked about the following use case:

create table raw_data (
    ts timestamp,
    foo int,
    bar int
);
insert into raw_data values (now(), null, 1);
insert into raw_data values (now(), null, 10);
insert into raw_data values (now(), 7, null);
insert into raw_data values (now(), null, 8);

create materialized view mv1 as
select
    ts,
    last_value(foo ignore nulls) over (order by ts) as foo,
    last_value(bar ignore nulls) over (order by ts) as bar
from raw_data;

create materialized view mv2 as
select
    ts,
    last_value(foo) filter (where foo is not null) over (order by ts) as foo,
    last_value(bar) filter (where bar is not null) over (order by ts) as bar
from raw_data;

-- Desired output:
ts  |  foo  |  bar
--------------------
... |  null |  1
... |  null |  10
... |  7    |  10
... |  7    |  8

Each new row inserted into raw_data only updates one data column, with null on other columns. The two MVs, which are equivalent, are what the user want. However we don't support neither of the two syntaxes yet.

But I found an interesting workaround using SINK INTO TABLE and CHANGELOG, for this specific case:

create table raw_data (
    ts timestamp,
    foo int,
    bar int
) append only;

create table latest (
    id int,
    ts timestamp,
    foo int,
    bar int,
    primary key (id)
) on conflict do update if not null with version column (ts);

create sink s into latest as
select 1 as id, ts, foo, bar from t;

create materialized view changes as
with cl as changelog from latest
select ts, foo, bar from cl where id = 1 and changelog_op = 1 or changelog_op = 3;

The changes MV will have what the user desired.

@tabVersion
Copy link
Contributor

Recently a user asked about the following use case:

create table raw_data (
    ts timestamp,
    foo int,
    bar int
);
insert into raw_data values (now(), null, 1);
insert into raw_data values (now(), null, 10);
insert into raw_data values (now(), 7, null);
insert into raw_data values (now(), null, 8);

create materialized view mv1 as
select
    ts,
    last_value(foo ignore nulls) over (order by ts) as foo,
    last_value(bar ignore nulls) over (order by ts) as bar
from raw_data;

create materialized view mv2 as
select
    ts,
    last_value(foo) filter (where foo is not null) over (order by ts) as foo,
    last_value(bar) filter (where bar is not null) over (order by ts) as bar
from raw_data;

-- Desired output:
ts  |  foo  |  bar
--------------------
... |  null |  1
... |  null |  10
... |  7    |  10
... |  7    |  8

Each new row inserted into raw_data only updates one data column, with null on other columns. The two MVs, which are equivalent, are what the user want. However we don't support neither of the two syntaxes yet.

But I found an interesting workaround using SINK INTO TABLE and CHANGELOG, for this specific case:

create table raw_data (
    ts timestamp,
    foo int,
    bar int
) append only;

create table latest (
    id int,
    ts timestamp,
    foo int,
    bar int,
    primary key (id)
) on conflict do update if not null with version column (ts);

create sink s into latest as
select 1 as id, ts, foo, bar from t;

create materialized view changes as
with cl as changelog from latest
select ts, foo, bar from cl where id = 1 and changelog_op = 1 or changelog_op = 3;

The changes MV will have what the user desired.

worth a blog 🤣 #HackOnRisingWave

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants