-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Add BORF transformer #2062
Conversation
Thank you for contributing to
|
great, thanks for that |
hi @fspinna sorry for the delay on this, we will definitely review this week. I'm afraid there are a couple of conflicts which should be easy to resolve |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks again, but my this is huge! A couple of questions
- You use the sparse package. Elsewhere we use from scipy.sparse import csr_matrix. scipy is a core dependency, so it would be preferable to use this if it is fit for purpose.
- It would also be good to have a constructor option to return the whole sparse array rather than the compressed version
- there are two arguments for n_jobs
n_jobs=self.n_jobs,
n_jobs_numba=self.n_jobs_numba,
how come? - You have reimplemented SAX. We have a version of SAX already, would it be possible to use that? If not, see 5 below.
- There are a lot of specific public functions which will bloat the docs, could you make the file specific functions private? things like
are_window_size_and_dilation_compatible_with_signal_length
Same with all the small sklearn transformers, better off private imo. - You have implemented a hash table?
Thank you a lot for taking the time for reviewing the PR. I know it's a lot of code. One of the contribution of the paper was to build a very fast transform, so I had to re-implement everything in numba, and sacrifice modularity for efficiency.
Sparse in this case is needed, because the output of
You mean an option to densify the sparse array to numpy?
One of the contributions of the paper was a better/faster way of computing window-wise SAX. Also, this version has dilation, stride and other parameters that are not usually in SAX. So I don't believe I can use your implementation.
I'll do that. So should I make all the functions besides the
I modified a little this implementation https://github.com/lhprojects/blog/blob/master/_posts/JupyterNotebooks/HashUnique.ipynb to make it fully compatible with numba, as numba dicts do not work great in my experience. |
It might be a bit ambitious for this PR, but it would be wonderful to include a fast sliding window SAX-transform with dilation if possible. |
do we need a lower bound on the sparse package? |
re SAX, we could investigate using this version as our standard and extending it as you say @patrickzib but not in this PR :) IT would be good to make an issue on improving SAX |
Also, it's not a perfect replacement for standard SAX, as the window size has to be divisible by the word length in my case. |
I like the idea of having a speed optimised constrained version and a more general case one, but its a project in its own right really. This looks ready to go in to me, thanks for the contribution. There are a couple of minor conflicts to resolve caused by changes elsewhere, sorry about that |
could we add your paper to this? https://www.aeon-toolkit.org/en/stable/papers_using_aeon.html Even better, can you add your paper to this (in a separate PR)? :) |
I am using the latest (0.15.4)
Sure!
Thank you! No problem, let me know if there is anything else to do! |
hi @fspinna if you resolve the conflicts I think this is ready to go in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed unused comments
# Conflicts: # .all-contributorsrc # docs/api_reference/transformations.rst # pyproject.toml
@TonyBagnall I fixed a few things (tags, doctest) but I am still failing on a few OS. Not really sure what's the issue here. |
Hi @fspinna, not too up to date with this PR but its only failing on some workflows because we split the tests up when running on PRs to save time. I can take a look at whats failing at some point if no one else finds the time. |
Alternatively, there is a Adding it temporarily to the skipped test list is also an option, and revisiting in another PR. |
Ah ok, I was wondering why I was not failing all of them actually!
So I probably should use the tag "cant_pickle" for this one. *edit also this one depends on the same issue:
So, the "non_deterministic" tag is also needed. This one puzzles me as I don't get it locally:
I believe it's a problem with some sklearn estimators I created. I'll try to look into it. |
…ith csr matrices, add n_feature_names_in_ and n_features_in_ attribute for sklearn compatibility
…RF for sklearn compatibility
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the contribution
What does this implement/fix? Explain your changes.
I am trying to add a Bag-Of-Receptive-Fields transformer as seen in
Spinnato, Francesco, et al. "A Bag of Receptive Fields for Time Series Extrinsic Predictions." arXiv preprint arXiv:2311.18029 (2023)
. It is basically a Bag-Of-Pattern transform, where a time series is transformed in a sparse matrix of counts of discretized subsequences.Does your contribution introduce a new dependency? If yes, which one?
This contribution introduces a dependency to the pydata sparse library (https://sparse.pydata.org/en/stable/).
Any other comments?
I have a couple of issues/doubts:
This is because the output of the transformer is a sparse scipy matrix (https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html#scipy.sparse.csr_matrix). I don't know if this is a problem. In practice, the sparse matrix is compatible with downstream sklearn estimators/transformers.
IndividualBORF
, which inherits from sklearn's BaseEstimator, TransformerMixin, and not BaseCollectionTransformer. Is this a problem?PR checklist
For all contributions
For new estimators and functions
__maintainer__
at the top of relevant files and want to be contacted regarding its maintenance. Unmaintained files may be removed. This is for the full file, and you should not add yourself if you are just making minor changes or do not want to help maintain its contents.