-
Notifications
You must be signed in to change notification settings - Fork 894
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] need official EWM function #1263
Comments
@yidong72 @randerzander @beckernick What are the specific aggregations needed to implement this on top of the new rolling window functionality? |
Hello, any plan in merging this or other implementation/s of EWM function? Or any temp fix that I could use for now? |
@kkraus14 @yidong72 @beckernick need help understanding what is needed from libcudf. |
We haven't scoped this function as of yet from the cuDF Python side so we can't guide libcudf as of yet. I don't think this is currently a high priority for us. |
@harrism. The implementation I have is just adding a weight term to the time series items in the rolling window fun. So it should be straight forward implementation on top of rolling window fun |
We have seen some folks at FSI who are interested in the official EWM function. Check this issue we got from gQuant project. I can fix it in a hacky way but it is nice to have official support from cudf. Please increase the priority of this issue. |
In pandas, EWM provides various exponential weighted functions including mean, variance, standard deviation, and more. I'm going to update the issue to include a task-list of the various functions. Exponential weighted mean is the canonical usage, which makes it a good starting point for the next release. |
I've scoped this out and there's a couple of design caveats I would like to discuss before proceeding with an implementation. TL;DR: I am not sure how to do this in a way that is actually performant. This function in pandas behaves more like a single large window over the entire data than a rolling window function like what is normally envisioned. That is to say that by default, each element of the result is the weighted average of all of those that come before it in the sequence. The formula for a single result is quite clear from the pandas documentation: There's really two straightforward ways of computing this sequence and neither of them seem to really help us very much.
In the case where we really are using this within a window function, this problem goes away, as long as the window size is small relative to the data size (each thread applies the above sequential algorithm for its window). We could thus implement this on top of rolling technically, but we can't just wrap that functionality with It seems like what is needed here is a truly parallel algorithm that properly balances the work each computing element is doing across the moving average calculation. |
This can be computed efficiently in parallel using two scans (one for the numerator, one for the denominator) and a binop (divide). |
Unless I am misunderstanding that works for getting one of the datapoints we need (any single one) but not the entire sequence. Each element of the result is the result of dividing two things, but those things are the sums of sequences and those sequences are different for each element in question. Consider the first few denominators
In general
Meaning each successive term is related to the last by
Which makes for an efficient serial algorithm for computing these terms without having to actually sum over an entirely new set of numbers. Unfortunately this doesn't seem to help us towards a thrust implementation because if we were trying to do an inclusive scan, we'd have this as our binary_op: def f(d_previous, d_this):
return (beta * d_previous) + 1
beta = 0.1
f(1, f(2,3))
# 1.11
f(f(1,2), 3)
# 1.1 I believe this breaks the associativity needed for an inclusive scan. |
Is this the naive implementation, or is this totally wrong?
|
Solving recurrence equations is in Guy Bleloch's classic paper "Prefix Sums and their applications". http://www.cs.cmu.edu/~blelloch/papers/Ble93.pdf (Section 1.4) The trick is to maintain the intermediates as pairs, rather than as individual values. Let
Test input demonstrates associativity:
To get the numerator out of the scan, after performing the inclusive scan, just extract all the second elements of the pairs. Intuitively, we are propagating the product of the This paper is required reading, IMO. You will see scans everywhere once you start seeing them. :) (Note, implementation with Thrust is pretty simple -- just use a zip iterator with a constant iterator (1-alpha) and the input iterator and use a lambda that returns the modified pair as in the Python |
I stumbled on this paper this morning while Googling "Prefix sums recursion relations" after a few of us met to discuss this problem yesterday. It's so elegant how separating the current power of the prefactor makes the recursion operator associative! Thanks for pointing us in the right direction. |
thanks @harrism this works perfectly using thrust in my experiments. It's a little hard to for me to tell if this really belongs as a rolling aggregation, should that still be the plan or is there a more appropriate place for this to live inside of libcudf? |
My pleasure. I don't know the answer to your question. Is it different from a rolling aggregation in some way? Does it have finite window extents, or does every element depend on all preceding elements over the entire series? CC @jrhemstad |
The particular pandas API is the version where every element depends on all the previous ones. pandas does support a windowed version of this via different API. But I am not sure our version, were we ever to support it, would need to actually parallelize within the windows - at least for small window sizes relative to the data the normal recurrence relation might perform fine on its own within the windows. |
For the one where every element depends on all previous ones, it may be best to add this as an operator to our existing scan API. The windowed version sounds like rolling. Or could be done as an operator to the segmented scan API. |
Dose it i supported in 23.02? |
Hi @Haidow , |
@brandon-b-miller Yes plz! |
Thanks @brandon-b-miller, any update on this? |
#9027 adds the |
Congrats on finally merging this @brandon-b-miller! Scan FTW! |
Is your feature request related to a problem? Please describe.
EWM is a very popular method used in time series analysis, especially in the domain of FSI. cuIndicator is using EWM a lot to compute the technical indicators. It is good to have official support in the cuDF.
Describe the solution you'd like
DataFrame.ewm(com=None, span=None, halflife=None, alpha=None, min_periods=0, adjust=True, ignore_na=False, axis=0). The same interface as the Pandas EWM function
Describe alternatives you've considered
cuIndicator has the implementation that is based on rolling window methods. cuIndicator EWM.
Additional context
EWM can be implemented by prefix-sum method if we weight the past carefully. I have the example implementation for it.
The text was updated successfully, but these errors were encountered: