-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP - Design document for polars support in skpro #34
base: main
Are you sure you want to change the base?
Conversation
Linking sktime/skpro#342 as the coordination pr for implementation of these ideas |
@fkiraly I have written down notes for a few of the sections (see sections 2 and 3) and a couple discussion items (see section 9) Discussions can happen via inside the md or can stay in the comment sections here so that pings work etc.. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replies to the questions below.
Question 2.1: yes, I anticipate that for some interfaced estimators there will never be internal polars
support, e.g., the statsmodels
based ones. On the other hand, the framework should support internal polars
in native implementations, for direct interfavces to packages that support or will support polars
, or for mixtures like composites where part of the logic may be in skpro
and part of the logic may be external.
Question 2.2: what do you mean by this? Every estimator should accept all data containers, if the boilerplate layer works, and which types are supported interanlly you can see in the tag X_inner_mtype
and y_inner_mtype
.
Discussion items:
Item 1 idea 1 - sounds good, do you have a link or ref on how index is handled?
Item 2 idea 1 - I would strongly recommend not to focus on support for polars.Series
. This is based on experience in sktime
where supporting both Series
and DataFrame
in pandas
has caused maintenance and development overhead that was out of proportion compared to the user benefit it brought (and one might argue it's actually net negative due to user confusion)
Discussion item 2: I would suggest to avoid predict_proba
for the moment, as that would open up the can of works of supporting polars
internally in distributions, most of which are internally numpy
and scipy
based.
Discussion item 3: my (not very strong) opinion is to use set_config
as a backend point, and possibly provide syntactic sugar through set_output
.
Apologies for the poorly worded question. Right now every estimator has the tags
The link to the polars function
We can probably leverage this potential second conversion function to handle these scenarios - and pass them back to the users if they specify want them via our own bool parameter |
Sounds good, then in that case I'll leave the corresponding sections in #TODO for now |
If section 3 is sufficiently written - I can go ahead and start building a pr to implement these changes. If there are more areas where you think we need to implement |
I'm not sure if you are referring to the current state or have some unpushed changes. The document does not say what the target state is. You are showcasing some column index handling, but it would be great if you could very explicitly discuss:
Other question, is the tests PR ready? |
Yes - I would like a review to see where changes are needed for the current tests I wrote, and as I implement new changes I will add more tests inside the test file as well |
I've added another item (see discussion item 4) and am working to incorporate target states for predict methods. @fkiraly does the one I wrote so far for
See above
Still thinking about this item edit: see below
Still working on the planning stages for this, maybe discussion on item 4 would help figure this one out? |
I believe that as long as pred_int is returned in pandas Dataframe format the default method |
I see, that's a good solution, and it's good that the defaults are in the private methods, so it works. |
For the contributor workflow, in fact both modes are supported, plus an unmentioned third which allows the user to implement only a The way this works is via the conversion logic described here: If any of the mtypes is on the list, and there's only one, any input is converted to it internally, and back at output. If multiple mtypes are on the list, any mtype on the list is passed on unconverted (but possibly after minor coercions). In this case, it is expected that the internal code can handle both types. This is typically a much larger cognitiver burden on the implementer though, as there is no standard dispatch mechanism, so they have to do if/else or dispatch to private methods. |
@fkiraly I've made some additions to section 3 detailing: how the |
Upper/lower interval end are equivalent to | ||
quantile predictions at alpha = 0.5 - c/2, 0.5 + c/2 for c in coverage. | ||
|
||
For pl.DataFrame: Column is in single level following similar convention to pd.DataFrame |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that would work - although in the pd.DataFrame
convention the second level is float. Do we forsee any issues in coercing everything to string? This no longer allows, for instance, to distinguish variable names that are string and those that are integer. I suppose the variable name is only an issue in mixed input/output type, if same, then that would always be str.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think column names in polars are allowed to be anything but strings. I've tried setting 0 or 0.9 as a column name but I got an error saying TypeError: Series name must be a string
Opinions about the questions:
Yes, I believe it should be - are you testing it? If not, let's. Another related comment: what happens if the internal estimator produces predict outputs in
I think the default behaviour should be the round trip, yes. However, I think if the user uses |
When I first began working on this, I tried passing in a polars DataFrame with the Maybe another discussion item on how the interaction between |
Correct - this falls under the set of combinations 'polars to pandas'. We can handle this when creating the datatype/adapters module inside #392. My current proposal is that there should be custom conversion methods from polars to pandas (and vice versa) via the For examples sake lets consider The polars to pandas conversion should be simple, as the table design of the polars DataFrame is quite trivial (fingers crossed). Just calling it as a pandas DataFrame should be enough. edit: I just re-read this question and realized that I may have mis-interpreted the question.
The |
Agreed - unless the user specifies what they want, its best to keep the current convention. There is a method with a boolean check that I have designed to ensure that the user has specified a |
@fkiraly I've added questions 3.3 and 3.4, and made some minor updates to section 3 as well |
|
||
~~Question 3.2~~) What should we do with the `predict` function. Currently it automatically converts whatever is passed into the `predict` function back into the mtype that was seen in fit. Do we need to refactor this as well? | ||
|
||
Question 3.3) Currently there is no way to allow the user to specify whether they want or do not want to pass back the index when converting from pandas to polars. The user currently only can specify what transform they want via `set_output` and there is no other parameter they can set to specify what configuration they want as the internal code handles the rest of the conversion. How should we tackle this problem? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indeed, there are no "options" in back-conversions, and on particular pathway is assumed.
If information needs to be remembered, it is via the converter_store
argument that all conversion functions have.
I'm not sure whether this is a problem that needs tackling, can you explain some different use cases, and why a user may want this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It goes back earlier to Discussion Item 0, potentially giving the user the option to have the index returned via a column named __index__
. That way they can keep track of what indices were used during testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There currently is no way to allow the user to specify whether they want to include the index or not when converting the dataframe from pandas to polars
|
||
Question 3.3) Currently there is no way to allow the user to specify whether they want or do not want to pass back the index when converting from pandas to polars. The user currently only can specify what transform they want via `set_output` and there is no other parameter they can set to specify what configuration they want as the internal code handles the rest of the conversion. How should we tackle this problem? | ||
|
||
Question 3.4) Is there a way to check via the tags or configs as to which data container the interface is developed in regards to the private `_predict_*` methods? As an example, the `GLMRegressor` has tag 'coded_in_pandas' so we know that all the private `_predict_*` functions are written using pandas |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, the tag is X_inner_mtype
or y_inner_mtype
, see https://www.sktime.net/en/latest/api_reference/auto_generated/sktime.registry._tags.x_inner_mtype.html
Currently, most implementations assume that there is only a single internal type, or multiple related ones (e.g., multiple scitypes but all pandas based).
Compositors however may assume they support all mtypes, if they only carry out abstract operations such sa passing on.
Opening a pr to discuss various methods and ideas to implement polars support for estimators in skpro. All the information design wise will be consolidated in this pr.
Open to any contributions or ideas!