Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add exp-weighted functions to DuckDB #5139

Closed
mauropagano opened this issue Oct 30, 2022 · 13 comments
Closed

Add exp-weighted functions to DuckDB #5139

mauropagano opened this issue Oct 30, 2022 · 13 comments
Labels

Comments

@mauropagano
Copy link
Contributor

Exponentially weighted functions are used frequently in the finance world but they are tricky to implement in SQL.
In the pandas they are ewm().<<function>> (sum, mean, stdev, var) and they are implemented in cython

>>> import pandas as pd
>>> df = pd.DataFrame(dict(a=[1,2,3,4]))
>>> df.assign(s=df.a.ewm(alpha=0.5).sum(), m=df.a.ewm(alpha=0.5).mean())
   a      s         m
0  1  1.000  1.000000
1  2  2.500  1.666667
2  3  4.250  2.428571
3  4  6.125  3.266667

The logic to implement them is not too complicated but it requires carrying state across rows.
A user could implement them using with recursive but that's slow or using a UDAF but that isn't a thing in DuckDB AFAIK.

I asked on Discord if people felt they would be useful to have them in DuckDB and Alex seemed supportive (not sure why tagging him doesn't work), filing this issue as per @hawkfish suggestion.

Is that something you feel should go into DuckDB core (asking here as per Contributing Guidelines)? My biased sense is yes, not many others are offering these functionalities and it seems kind of low-hanging fruit.
Do you feel this is within the level of complexity an outsider can implement? I'll for sure need some help, I only contributed #4259

If yes, I'd volunteer to do the work to add sum_w and mean_w (name tbd), starting with exposing only alpha parameter.
Most other parameters are just ways to implement the same thing, adjust should probably be True most of the times anyway and ignore_na is something a user can deal with using filter conditions.
Also std and var could be in another PR.

PS: I didn't use the standard template as this is not a bug, happy to switch

@hawkfish
Copy link
Contributor

hawkfish commented Nov 3, 2022

The pandas documentation is here.

This is a right royal pain to do in straight SQL - you have to compute the weights using RANK() and then use them twice in another window - with the same OVER clause - which we won't optimise right now (dependencies between window functions are not supported) so we will resort.

To implement this as a window function using segment trees, you would need the RANK value, which would require passing in the partition mask and somehow hooking that into the aggregation infrastructure. Either that or provide a separate holistic window function that computes it on the fly (not even sure that would work).

I think this will have to wait until we get a chance to refactor the segment trees to be on a function-by-function basis, with a common default for most. This would also facilitate the new merge sort trees approach that Adrian presented at SIGMOD 2022.

@mauropagano
Copy link
Contributor Author

Thanks for taking a look at it Richard!

I'm not sure I follow how the double RANK() solves it though.
For adjust=True the exponent of (1 - alpha) for any row needs to change (increase, lowering the weight) the more you walk down the rows in the window, I'm struggling to see how the double pass enables that.

My naive idea was to have a dedicated aggregate function that carries state across rows.
When used as straight aggregation it would the final row value and when used as window it would report the so-far computed value.
Once is a loop the scaling becomes trivial (e.g. for sum value(x) = (value(x-1) * (1-alpha)) + x).
Same approach already works with with recursive but that's quite slow.

@github-actions
Copy link

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days.

@github-actions github-actions bot added the stale label Jul 29, 2023
@mauropagano
Copy link
Contributor Author

Unless you guys feel there is a good reason to close this, I'd keep it open.

There is currently still no way to do this efficiently in duckdb, UDAF might be an option when it comes available and assuming performance is good there.

Just to provide a bit more justification to keep this open, the same functionalities exists / has been requested in polars, spark, cudf, dask

@github-actions github-actions bot removed the stale label Aug 1, 2023
@github-actions
Copy link

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days.

@github-actions github-actions bot added the stale label Oct 31, 2023
@mauropagano
Copy link
Contributor Author

I can't remove the tag myself so I guess I'll comment

@jakkes
Copy link

jakkes commented Jan 8, 2024

Definitely waiting for this feature..!

Copy link

github-actions bot commented Apr 8, 2024

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days.

@github-actions github-actions bot added the stale label Apr 8, 2024
@jakkes
Copy link

jakkes commented Apr 8, 2024

Bump

@hawkfish
Copy link
Contributor

hawkfish commented Apr 8, 2024

This is a feature, not a bug, so please convert this to a discussion if you want it kept alive. Thx.

@mauropagano
Copy link
Contributor Author

I filed this one, how do I convert it to a discussion?

@hawkfish
Copy link
Contributor

I filed this one, how do I convert it to a discussion?

I'm not sure! I did find this document but maybe you need to do it?

@mauropagano
Copy link
Contributor Author

I can't seem to find the button either (it's not on the right side like the doc suggests, probably I don't have the right privs), is it ok if I keep this one open for now or anybody with the right privilege can move it to discussion if needed?

@carlopi carlopi converted this issue into discussion #12591 Jun 18, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
Projects
None yet
Development

No branches or pull requests

5 participants