Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loess 0.6 fails (never completes) while Loess 0.5.4 completes in a few seconds #73

Open
palday opened this issue Aug 8, 2023 · 1 comment

Comments

@palday
Copy link
Member

palday commented Aug 8, 2023

Data for MWE (compressed because GitHub doesn't like .arrow as attachments)

loess.arrow.zip

using Arrow
using Loess

tbl = Arrow.Table("loess.arrow")
# runs quickly on 0.5.4; never completes on 0.6.1
@time loess(tbl.x, tbl.y)

cc @dmbates

@palday palday changed the title Loess 0.6 fails (never completes) while Loess 0.5.4 completes in a few second Loess 0.6 fails (never completes) while Loess 0.5.4 completes in a few seconds Aug 8, 2023
@andreasnoack
Copy link
Member

andreasnoack commented Aug 13, 2023

I've taken a closer look at this. There seems to be two things going on here. The first thing is that the old version used a different rule for deciding on splits when building the KD tree. The rule was different from the one in the Loess papers. The old approach just used the median even when it wasn't unique. The papers search for the index where a change happens. It took some work to figure out exactly how they did it so I wrote a comment. However, this is a linear search for where the x value changes and there are a lot of ties in your data. Specifically, it searches forward and backwards for x == 8 and

julia> countmap(tbl.x)[8]
431154

and it has to do a bit of sorting for each iteration. The second issue is that the partial sort seems to be slower than it should and if I add a function barrier then it's much faster so it seem that the closure is causing some overhead, but even after that change, building the tree for your dataset is prohibitively slow. However, I tried the loess in R which is based on the original Fortran version and it also seems to be really slow for this dataset so I think it's a consequence of the algorithm. Maybe it is possible use a bisection strategy instead of the linear search. I guess the assumption in the original paper was that the number of ties would be small such that the linear search would be fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants