Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Add BORF transformer #2062

Merged
merged 25 commits into from
Oct 21, 2024
Merged

[ENH] Add BORF transformer #2062

merged 25 commits into from
Oct 21, 2024

Conversation

fspinna
Copy link
Contributor

@fspinna fspinna commented Sep 16, 2024

What does this implement/fix? Explain your changes.

I am trying to add a Bag-Of-Receptive-Fields transformer as seen in Spinnato, Francesco, et al. "A Bag of Receptive Fields for Time Series Extrinsic Predictions." arXiv preprint arXiv:2311.18029 (2023). It is basically a Bag-Of-Pattern transform, where a time series is transformed in a sparse matrix of counts of discretized subsequences.

Does your contribution introduce a new dependency? If yes, which one?

This contribution introduces a dependency to the pydata sparse library (https://sparse.pydata.org/en/stable/).

Any other comments?

I have a couple of issues/doubts:

  1. the transformer fails 2 tests:
'test_persistence_via_pickle(estimator=BORF(),datatype=EqualLengthUnivariate-Classification-numpy3D)': "FAILED: 'csr_matrix' object has no attribute 'ravel'",
'check_fit_deterministic(estimator=BORF(),datatype=EqualLengthUnivariate-Classification-numpy3D)': "FAILED: 'csr_matrix' object has no attribute 'ravel'"

This is because the output of the transformer is a sparse scipy matrix (https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html#scipy.sparse.csr_matrix). I don't know if this is a problem. In practice, the sparse matrix is compatible with downstream sklearn estimators/transformers.

  1. the sparse package is used as a soft dependency, which is imported in the transform of IndividualBORF, which inherits from sklearn's BaseEstimator, TransformerMixin, and not BaseCollectionTransformer. Is this a problem?

PR checklist

For all contributions
  • I've added myself to the list of contributors. Alternatively, you can use the @all-contributors bot to do this for you.
  • The PR title starts with either [ENH], [MNT], [DOC], [BUG], [REF], [DEP] or [GOV] indicating whether the PR topic is related to enhancement, maintenance, documentation, bugs, refactoring, deprecation or governance.
For new estimators and functions
  • I've added the estimator to the online API documentation.
  • (OPTIONAL) I've added myself as a __maintainer__ at the top of relevant files and want to be contacted regarding its maintenance. Unmaintained files may be removed. This is for the full file, and you should not add yourself if you are just making minor changes or do not want to help maintain its contents.

@aeon-actions-bot aeon-actions-bot bot added enhancement New feature, improvement request or other non-bug code enhancement transformations Transformations package labels Sep 16, 2024
@aeon-actions-bot
Copy link
Contributor

Thank you for contributing to aeon

I have added the following labels to this PR based on the title: [ $\color{#FEF1BE}{\textsf{enhancement}}$ ].
I have added the following labels to this PR based on the changes made: [ $\color{#41A8F6}{\textsf{transformations}}$ ]. Feel free to change these if they do not properly represent the PR.

The Checks tab will show the status of our automated tests. You can click on individual test runs in the tab or "Details" in the panel below to see more information if there is a failure.

If our pre-commit code quality check fails, any trivial fixes will automatically be pushed to your PR unless it is a draft.

Don't hesitate to ask questions on the aeon Slack channel if you have any.

PR CI actions

These checkboxes will add labels to enable/disable CI functionality for this PR. This may not take effect immediately, and a new commit may be required to run the new configuration.

  • Run pre-commit checks for all files
  • Run all pytest tests and configurations
  • Run all notebook example tests
  • Run numba-disabled codecov tests
  • Stop automatic pre-commit fixes (always disabled for drafts)
  • Push an empty commit to re-run CI checks

@TonyBagnall
Copy link
Contributor

great, thanks for that

@fspinna fspinna marked this pull request as ready for review October 1, 2024 10:01
@TonyBagnall
Copy link
Contributor

hi @fspinna sorry for the delay on this, we will definitely review this week. I'm afraid there are a couple of conflicts which should be easy to resolve

Copy link
Contributor

@TonyBagnall TonyBagnall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks again, but my this is huge! A couple of questions

  1. You use the sparse package. Elsewhere we use from scipy.sparse import csr_matrix. scipy is a core dependency, so it would be preferable to use this if it is fit for purpose.
  2. It would also be good to have a constructor option to return the whole sparse array rather than the compressed version
  3. there are two arguments for n_jobs
    n_jobs=self.n_jobs,
    n_jobs_numba=self.n_jobs_numba,
    how come?
  4. You have reimplemented SAX. We have a version of SAX already, would it be possible to use that? If not, see 5 below.
  5. There are a lot of specific public functions which will bloat the docs, could you make the file specific functions private? things like are_window_size_and_dilation_compatible_with_signal_length Same with all the small sklearn transformers, better off private imo.
  6. You have implemented a hash table?

@fspinna
Copy link
Contributor Author

fspinna commented Oct 15, 2024

thanks again, but my this is huge! A couple of questions

Thank you a lot for taking the time for reviewing the PR. I know it's a lot of code. One of the contribution of the paper was to build a very fast transform, so I had to re-implement everything in numba, and sacrifice modularity for efficiency.

  1. You use the sparse package. Elsewhere we use from scipy.sparse import csr_matrix. scipy is a core dependency, so it would be preferable to use this if it is fit for purpose.

Sparse in this case is needed, because the output of transform_sax_patterns is a 3D tensor, which is not supported by scipy.

  1. It would also be good to have a constructor option to return the whole sparse array rather than the compressed version

You mean an option to densify the sparse array to numpy?

  1. there are two arguments for n_jobs
    n_jobs=self.n_jobs,
    n_jobs_numba=self.n_jobs_numba,
    how come?

n_jobs is passed to the sklearn FeatureUnion. n_jobs_numba is passed to IndividualBorf. They are two separate ways of parallelizing the code. Combinations of the two are possible. Maybe a different name for n_jobs_numba would be better? Also setting n_jobs_numba=1 would be fine. But from experience, I saw that sometimes it's faster to parallelize with one or the other.

  1. You have reimplemented SAX. We have a version of SAX already, would it be possible to use that? If not, see 5 below.

One of the contributions of the paper was a better/faster way of computing window-wise SAX. Also, this version has dilation, stride and other parameters that are not usually in SAX. So I don't believe I can use your implementation.
I decided to keep it as a function (and not make it a class), because I had to enclose everything in numba. Unfortunately, this means less modularity.

  1. There are a lot of specific public functions which will bloat the docs, could you make the file specific functions private? things like are_window_size_and_dilation_compatible_with_signal_length Same with all the small sklearn transformers, better off private imo.

I'll do that. So should I make all the functions besides the BORF class private?

  1. You have implemented a hash table?

I modified a little this implementation https://github.com/lhprojects/blog/blob/master/_posts/JupyterNotebooks/HashUnique.ipynb to make it fully compatible with numba, as numba dicts do not work great in my experience.

@patrickzib
Copy link
Contributor

One of the contributions of the paper was a better/faster way of computing window-wise SAX. Also, this version has dilation, stride and other parameters that are not usually in SAX. So I don't believe I can use your implementation.

It might be a bit ambitious for this PR, but it would be wonderful to include a fast sliding window SAX-transform with dilation if possible.

@TonyBagnall
Copy link
Contributor

do we need a lower bound on the sparse package?

@TonyBagnall
Copy link
Contributor

re SAX, we could investigate using this version as our standard and extending it as you say @patrickzib but not in this PR :) IT would be good to make an issue on improving SAX

@fspinna
Copy link
Contributor Author

fspinna commented Oct 17, 2024

It might be a bit ambitious for this PR, but it would be wonderful to include a fast sliding window SAX-transform with dilation if possible.

re SAX, we could investigate using this version as our standard and extending it as you say @patrickzib but not in this PR :) IT would be good to make an issue on improving SAX

Also, it's not a perfect replacement for standard SAX, as the window size has to be divisible by the word length in my case.

@TonyBagnall
Copy link
Contributor

It might be a bit ambitious for this PR, but it would be wonderful to include a fast sliding window SAX-transform with dilation if possible.

re SAX, we could investigate using this version as our standard and extending it as you say @patrickzib but not in this PR :) IT would be good to make an issue on improving SAX

Also, it's not a perfect replacement for standard SAX, as the window size has to be divisible by the word length in my case.

I like the idea of having a speed optimised constrained version and a more general case one, but its a project in its own right really. This looks ready to go in to me, thanks for the contribution. There are a couple of minor conflicts to resolve caused by changes elsewhere, sorry about that

@TonyBagnall
Copy link
Contributor

could we add your paper to this? https://www.aeon-toolkit.org/en/stable/papers_using_aeon.html

Even better, can you add your paper to this (in a separate PR)? :)

@fspinna
Copy link
Contributor Author

fspinna commented Oct 18, 2024

do we need a lower bound on the sparse package?

I am using the latest (0.15.4)

could we add your paper to this? https://www.aeon-toolkit.org/en/stable/papers_using_aeon.html

Even better, can you add your paper to this (in a separate PR)? :)

Sure!

I like the idea of having a speed optimised constrained version and a more general case one, but its a project in its own right really. This looks ready to go in to me, thanks for the contribution. There are a couple of minor conflicts to resolve caused by changes elsewhere, sorry about that

Thank you! No problem, let me know if there is anything else to do!

@TonyBagnall
Copy link
Contributor

hi @fspinna if you resolve the conflicts I think this is ready to go in

Copy link
Contributor Author

@fspinna fspinna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed unused comments

# Conflicts:
#	.all-contributorsrc
#	docs/api_reference/transformations.rst
#	pyproject.toml
@fspinna
Copy link
Contributor Author

fspinna commented Oct 18, 2024

hi @fspinna if you resolve the conflicts I think this is ready to go in

@TonyBagnall I fixed a few things (tags, doctest) but I am still failing on a few OS. Not really sure what's the issue here.

@MatthewMiddlehurst
Copy link
Member

Hi @fspinna, not too up to date with this PR but its only failing on some workflows because we split the tests up when running on PRs to save time.

I can take a look at whats failing at some point if no one else finds the time.

@MatthewMiddlehurst
Copy link
Member

Alternatively, there is a non-deteminsitic and cant-pickle tag which will skip these particular tests, but we try to only use them if it is not an easily fixable issue.

Adding it temporarily to the skipped test list is also an option, and revisiting in another PR.

@fspinna
Copy link
Contributor Author

fspinna commented Oct 21, 2024

Hi @fspinna, not too up to date with this PR but its only failing on some workflows because we split the tests up when running on PRs to save time.

I can take a look at whats failing at some point if no one else finds the time.

Ah ok, I was wondering why I was not failing all of them actually!
This one is expected, as the output of the transform is a csr matrix, which is not currently supported I believe (but practically works with most sklearn estimators)

test_all_estimators[check_persistence_via_pickle(estimator=BORF(),datatype=EqualLengthUnivariate-Classification-numpy3D)] - AttributeError: 'csr_matrix' object has no attribute 'ravel'

So I probably should use the tag "cant_pickle" for this one.

*edit also this one depends on the same issue:

test_all_estimators[check_fit_deterministic(estimator=BORF(),datatype=EqualLengthUnivariate-Classification-numpy3D)] - AttributeError: 'csr_matrix' object has no attribute 'ravel'

So, the "non_deterministic" tag is also needed.

This one puzzles me as I don't get it locally:

test_all_estimators[check_non_state_changing_method(estimator=BORF(densify=True),datatype=EqualLengthUnivariate-Classification-numpy3D)] - AttributeError: 'IndividualBORF' object has no attribute 'feature_names_in_'

I believe it's a problem with some sklearn estimators I created. I'll try to look into it.

Copy link
Contributor

@TonyBagnall TonyBagnall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the contribution

@TonyBagnall TonyBagnall merged commit 4ec60da into aeon-toolkit:main Oct 21, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature, improvement request or other non-bug code enhancement transformations Transformations package
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants