WIP - Design document for polars support in skpro #34

julian-fong · 2024-05-31T15:18:27Z

Opening a pr to discuss various methods and ideas to implement polars support for estimators in skpro. All the information design wise will be consolidated in this pr.

Open to any contributions or ideas!

julian-fong · 2024-05-31T15:20:10Z

Linking sktime/skpro#342 as the coordination pr for implementation of these ideas

julian-fong · 2024-06-03T17:29:00Z

@fkiraly I have written down notes for a few of the sections (see sections 2 and 3) and a couple discussion items (see section 9) Discussions can happen via inside the md or can stay in the comment sections here so that pings work etc..

fkiraly

Replies to the questions below.

Question 2.1: yes, I anticipate that for some interfaced estimators there will never be internal polars support, e.g., the statsmodels based ones. On the other hand, the framework should support internal polars in native implementations, for direct interfavces to packages that support or will support polars, or for mixtures like composites where part of the logic may be in skpro and part of the logic may be external.

Question 2.2: what do you mean by this? Every estimator should accept all data containers, if the boilerplate layer works, and which types are supported interanlly you can see in the tag X_inner_mtype and y_inner_mtype.

Discussion items:

Item 1 idea 1 - sounds good, do you have a link or ref on how index is handled?

Item 2 idea 1 - I would strongly recommend not to focus on support for polars.Series. This is based on experience in sktime where supporting both Series and DataFrame in pandas has caused maintenance and development overhead that was out of proportion compared to the user benefit it brought (and one might argue it's actually net negative due to user confusion)

Discussion item 2: I would suggest to avoid predict_proba for the moment, as that would open up the can of works of supporting polars internally in distributions, most of which are internally numpy and scipy based.

Discussion item 3: my (not very strong) opinion is to use set_config as a backend point, and possibly provide syntactic sugar through set_output.

julian-fong · 2024-06-04T16:05:01Z

Question 2.2: what do you mean by this? Every estimator should accept all data containers, if the boilerplate layer works, and which types are supported interanlly you can see in the tag X_inner_mtype and y_inner_mtype.

Apologies for the poorly worded question. Right now every estimator has the tags X_inner_mtype and y_inner_mtype with with pandas DataFrame tables - and we know there are two sets: one set which contains estimators for which polars is supported and another set where polars is not supported eg statsmodels. My question is if an idea here is to individually check each interfaced estimator to see if it supports polars dataframes and update the tags to include polars_eager_table if it belongs in the first set.

Discussion items:

Item 1 idea 1 - sounds good, do you have a link or ref on how index is handled?

The link to the polars function from_pandas is here: https://docs.pola.rs/py-polars/html/reference/api/polars.from_pandas.html . The bool parameter include_index includes any index from the pandas Dataframe as a column. From what I've tested using some dummy data the functionality is as follows:

If include_index is True and the index in the pandas Dataframe is a single level index that is unnamed - then from_pandas returns a polars dataframe with a column named "None" containing the corresponding indices.
If include_index is True and the index in the pandas Dataframe is a multi-level index for which all index levels are named - then from_pandas returns a polars dataframe with columns corresponding to the index level name and their corresponding indices
If include_index is True and the index in the pandas Dataframe is a multi-level index for which some of the index levels are unnamed (2 or more) I believe it throws an error.

We can probably leverage this potential second conversion function to handle these scenarios - and pass them back to the users if they specify want them via our own bool parameter return_index

julian-fong · 2024-06-04T16:07:20Z

Discussion item 2: I would suggest to avoid predict_proba for the moment, as that would open up the can of works of supporting polars internally in distributions, most of which are internally numpy and scipy based.

Sounds good, then in that case I'll leave the corresponding sections in #TODO for now

julian-fong · 2024-06-04T16:11:52Z

If section 3 is sufficiently written - I can go ahead and start building a pr to implement these changes. If there are more areas where you think we need to implement polars support, I am happy to continue to add to that section. I mainly used the BaseProbaRegressor class and the regression extension template in order to consolidate what enhancements I believe that were needed to be made

fkiraly · 2024-06-05T17:38:46Z

If section 3 is sufficiently written - I can go ahead and start building a pr to implement these changes

I'm not sure if you are referring to the current state or have some unpushed changes. The document does not say what the target state is. You are showcasing some column index handling, but it would be great if you could very explicitly discuss:

proposed signature of predict_interval, predict_quantiles, predict_var
how you are planning to handle predict_proba - leaving it unchanged for now is where we arrived, but that may cause some unpleasant case handling in the default implementation of predict_quantiles (although it might be caught by the boilerplate)
in the target state, the relevant cases and flow logic internally if polars or pandas is passed. We have round trip conversion, or passing unconverted internally.
where the converters or logic will be located, in relation to predict_interval etc.

Other question, is the tests PR ready?

julian-fong · 2024-06-05T19:25:31Z

Other question, is the tests PR ready?

Yes - I would like a review to see where changes are needed for the current tests I wrote, and as I implement new changes I will add more tests inside the test file as well

julian-fong · 2024-06-10T16:29:03Z

I've added another item (see discussion item 4) and am working to incorporate target states for predict methods. @fkiraly does the one I wrote so far for predict_interval suffice?

If section 3 is sufficiently written - I can go ahead and start building a pr to implement these changes

I'm not sure if you are referring to the current state or have some unpushed changes. The document does not say what the target state is. You are showcasing some column index handling, but it would be great if you could very explicitly discuss:

proposed signature of predict_interval, predict_quantiles, predict_var

See above

how you are planning to handle predict_proba - leaving it unchanged for now is where we arrived, but that may cause some unpleasant case handling in the default implementation of predict_quantiles (although it might be caught by the boilerplate)

Still thinking about this item

edit: see below

in the target state, the relevant cases and flow logic internally if polars or pandas is passed. We have round trip conversion, or passing unconverted internally.

Still working on the planning stages for this, maybe discussion on item 4 would help figure this one out?

julian-fong · 2024-06-11T03:15:10Z

how you are planning to handle predict_proba - leaving it unchanged for now is where we arrived, but that may cause some unpleasant case handling in the default implementation of predict_quantiles (although it might be caught by the boilerplate)

I believe that as long as pred_int is returned in pandas Dataframe format the default method _predict_quantile shouldn't matter. My aim when altering the code base for _base.py is to alter the handling of the data containers predict_interval, predict_proba, predict_var, predict_quantile, the default corresponding private methods shouldn't be touched imo.

fkiraly · 2024-06-13T14:45:37Z

the default corresponding private methods shouldn't be touched imo.

I see, that's a good solution, and it's good that the defaults are in the private methods, so it works.

fkiraly · 2024-06-13T14:49:40Z

Still working on the planning stages for this, maybe discussion on item 4 would help figure this one out?

For the contributor workflow, in fact both modes are supported, plus an unmentioned third which allows the user to implement only a polars option (in this case polars becomes a softdep of the estimator).

The way this works is via the conversion logic described here:
https://www.sktime.net/en/latest/api_reference/auto_generated/sktime.registry._tags.x_inner_mtype.html#sktime.registry._tags.x_inner_mtype

If any of the mtypes is on the list, and there's only one, any input is converted to it internally, and back at output. If multiple mtypes are on the list, any mtype on the list is passed on unconverted (but possibly after minor coercions). In this case, it is expected that the internal code can handle both types. This is typically a much larger cognitiver burden on the implementer though, as there is no standard dispatch mechanism, so they have to do if/else or dispatch to private methods.

julian-fong · 2024-06-25T18:18:20Z

@fkiraly I've made some additions to section 3 detailing: how the predict_* functions will incorporate set_output like logic, and have written down the technical implementation of how polars frames would look like in various predict_* methods

fkiraly · 2024-06-25T23:39:09Z

steps/24_skpro_polars_support_integration/step.md

+            Upper/lower interval end are equivalent to
+            quantile predictions at alpha = 0.5 - c/2, 0.5 + c/2 for c in coverage.
+
+	    For pl.DataFrame: Column is in single level following similar convention to pd.DataFrame


that would work - although in the pd.DataFrame convention the second level is float. Do we forsee any issues in coercing everything to string? This no longer allows, for instance, to distinguish variable names that are string and those that are integer. I suppose the variable name is only an issue in mixed input/output type, if same, then that would always be str.

I don't think column names in polars are allowed to be anything but strings. I've tried setting 0 or 0.9 as a column name but I got an error saying TypeError: Series name must be a string

fkiraly · 2024-06-25T23:43:04Z

Opinions about the questions:

Question 3.1) is this already handled? in the scenario where a polars dataframe is passed and skpro automatically converts it into a pandas DataFrame to calculate the predictions?

Yes, I believe it should be - are you testing it? If not, let's.

Another related comment: what happens if the internal estimator produces predict outputs in polars format, but the user wants pandas.DataFrame? We would then need to convert the format the other way round.

Question 3.2) What should we do with the predict function. Currently it automatically converts whatever is passed into the predict function back into the mtype that was seen in fit. Do we need to refactor this as well?

I think the default behaviour should be the round trip, yes.

However, I think if the user uses set_output, it should be possible to override this - if not too complicated. It may just be as simple to change the to_type in the output conversion to the transform string.

julian-fong · 2024-06-26T03:20:05Z

Opinions about the questions:

Question 3.1) is this already handled? in the scenario where a polars dataframe is passed and skpro automatically converts it into a pandas DataFrame to calculate the predictions?

Yes, I believe it should be - are you testing it? If not, let's.

When I first began working on this, I tried passing in a polars DataFrame with the GLMRegressor and the library statsmodels complained that it does not support polars DataFrames. Hmm, but I guess that was the fit function and not the predict function.

Maybe another discussion item on how the interaction between fit and predict should work with polars DataFrames?

julian-fong · 2024-06-26T03:30:47Z

Another related comment: what happens if the internal estimator produces predict outputs in polars format, but the user wants pandas.DataFrame? We would then need to convert the format the other way round.

Correct - this falls under the set of combinations 'polars to pandas'. We can handle this when creating the datatype/adapters module inside #392. My current proposal is that there should be custom conversion methods from polars to pandas (and vice versa) via the create_container method. The create_container method will take in X_input and the user given transform, and make the corresponding conversions.

For examples sake lets consider polars, and lets say for example there is a total of 3 datatype classes created using #392's design: ["pandas", "polars", "numpy"]. Then in the polars datatype class / adapter class, there will be a create_container method that takes in the datatype from X_input (Ideally X_input should fall into one of those three categories). The create_container method should then identify what type of mtype X_input is and use the correct conversion method (either numpy to polars or pandas to polars) in order to generate the correct output.

The polars to pandas conversion should be simple, as the table design of the polars DataFrame is quite trivial (fingers crossed). Just calling it as a pandas DataFrame should be enough.

edit: I just re-read this question and realized that I may have mis-interpreted the question.

Another related comment: what happens if the internal estimator produces predict outputs in polars format, but the user wants pandas.DataFrame? We would then need to convert the format the other way round.

The set_output would still theorically handle this scenario as per above, but I assume this would be a rare case since the x_inner_mtype would have to be polars Table? and the developer who created the interface for this estimator would have had to have written all the private methods using polars DataFrames? i am not sure

julian-fong · 2024-06-26T03:38:00Z

Question 3.2) What should we do with the predict function. Currently it automatically converts whatever is passed into the predict function back into the mtype that was seen in fit. Do we need to refactor this as well?

I think the default behaviour should be the round trip, yes.

However, I think if the user uses set_output, it should be possible to override this - if not too complicated. It may just be as simple to change the to_type in the output conversion to the transform string.

Agreed - unless the user specifies what they want, its best to keep the current convention. There is a method with a boolean check that I have designed to ensure that the user has specified a transform in set_output (see skpro #399 check_transform_config). If that boolean is True, then we can code it to override the current implementation. If not, then we proceed as if the code as usual.

julian-fong · 2024-06-27T01:14:19Z

@fkiraly I've added questions 3.3 and 3.4, and made some minor updates to section 3 as well

fkiraly · 2024-07-02T07:58:43Z

steps/24_skpro_polars_support_integration/step.md

+
+~~Question 3.2~~) What should we do with the `predict` function. Currently it automatically converts whatever is passed into the `predict` function back into the mtype that was seen in fit. Do we need to refactor this as well?
+
+Question 3.3) Currently there is no way to allow the user to specify whether they want or do not want to pass back the index when converting from pandas to polars. The user currently only can specify what transform they want via `set_output` and there is no other parameter they can set to specify what configuration they want as the internal code handles the rest of the conversion. How should we tackle this problem?


indeed, there are no "options" in back-conversions, and on particular pathway is assumed.

If information needs to be remembered, it is via the converter_store argument that all conversion functions have.

I'm not sure whether this is a problem that needs tackling, can you explain some different use cases, and why a user may want this?

It goes back earlier to Discussion Item 0, potentially giving the user the option to have the index returned via a column named __index__. That way they can keep track of what indices were used during testing.

There currently is no way to allow the user to specify whether they want to include the index or not when converting the dataframe from pandas to polars

fkiraly · 2024-07-02T08:00:08Z

steps/24_skpro_polars_support_integration/step.md

+
+Question 3.3) Currently there is no way to allow the user to specify whether they want or do not want to pass back the index when converting from pandas to polars. The user currently only can specify what transform they want via `set_output` and there is no other parameter they can set to specify what configuration they want as the internal code handles the rest of the conversion. How should we tackle this problem?
+
+Question 3.4) Is there a way to check via the tags or configs as to which data container the interface is developed in regards to the private `_predict_*` methods? As an example, the `GLMRegressor` has tag 'coded_in_pandas' so we know that all the private `_predict_*` functions are written using pandas


yes, the tag is X_inner_mtype or y_inner_mtype, see https://www.sktime.net/en/latest/api_reference/auto_generated/sktime.registry._tags.x_inner_mtype.html

Currently, most implementations assume that there is only a single internal type, or multiple related ones (e.g., multiple scitypes but all pandas based).

Compositors however may assume they support all mtypes, if they only carry out abstract operations such sa passing on.

initial commit

08123cf

julian-fong changed the title ~~initial commit~~ WIP - Design document for polars support in skpro May 31, 2024

fixing step file

c588d48

fkiraly assigned julian-fong May 31, 2024

julian-fong added 6 commits June 2, 2024 18:21

created table of content

325b138

filled out section 1 and 2

3ada2af

write ideas for section 3

1ca0192

fixed typo in discussion items

0d7d312

updated a couple discussion items

d3bea6c

updates

8c754ea

fkiraly reviewed Jun 4, 2024

View reviewed changes

julian-fong added 2 commits June 6, 2024 21:45

updated section 8 TODO include target state

5588c6f

updates

b7e5cbe

julian-fong mentioned this pull request Jun 21, 2024

[ENH] Enhancing polars support by introducing set_output sktime/skpro#399

Open

julian-fong added 3 commits June 25, 2024 08:21

updates

3afee44

updates

d08f6e7

updates

776a682

updates

21f2e6e

fkiraly reviewed Jun 25, 2024

View reviewed changes

julian-fong added 2 commits June 26, 2024 19:06

updates to section 3

53d01f4

updates to section 3 and questions 3.3 and 3.4

1108e48

fkiraly reviewed Jul 2, 2024

View reviewed changes

julian-fong added 5 commits July 2, 2024 20:30

updates

f662db4

updates

05ae7ae

removed unused chapters

1083d3f

updates

43565b9

updates

f942264

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP - Design document for polars support in skpro #34

WIP - Design document for polars support in skpro #34

julian-fong commented May 31, 2024

julian-fong commented May 31, 2024

julian-fong commented Jun 3, 2024

fkiraly left a comment

julian-fong commented Jun 4, 2024

julian-fong commented Jun 4, 2024

julian-fong commented Jun 4, 2024

fkiraly commented Jun 5, 2024 •

edited

Loading

julian-fong commented Jun 5, 2024

julian-fong commented Jun 10, 2024 •

edited

Loading

julian-fong commented Jun 11, 2024

fkiraly commented Jun 13, 2024

fkiraly commented Jun 13, 2024

julian-fong commented Jun 25, 2024

fkiraly Jun 25, 2024

julian-fong Jun 26, 2024

fkiraly commented Jun 25, 2024

julian-fong commented Jun 26, 2024 •

edited

Loading

julian-fong commented Jun 26, 2024 •

edited

Loading

julian-fong commented Jun 26, 2024 •

edited

Loading

julian-fong commented Jun 27, 2024

fkiraly Jul 2, 2024

julian-fong Jul 2, 2024

julian-fong Jul 2, 2024 •

edited

Loading

fkiraly Jul 2, 2024


		~~Question 3.2~~) What should we do with the `predict` function. Currently it automatically converts whatever is passed into the `predict` function back into the mtype that was seen in fit. Do we need to refactor this as well?

		Question 3.3) Currently there is no way to allow the user to specify whether they want or do not want to pass back the index when converting from pandas to polars. The user currently only can specify what transform they want via `set_output` and there is no other parameter they can set to specify what configuration they want as the internal code handles the rest of the conversion. How should we tackle this problem?


		Question 3.3) Currently there is no way to allow the user to specify whether they want or do not want to pass back the index when converting from pandas to polars. The user currently only can specify what transform they want via `set_output` and there is no other parameter they can set to specify what configuration they want as the internal code handles the rest of the conversion. How should we tackle this problem?

		Question 3.4) Is there a way to check via the tags or configs as to which data container the interface is developed in regards to the private `_predict_` methods? As an example, the `GLMRegressor` has tag 'coded_in_pandas' so we know that all the private `_predict_` functions are written using pandas

WIP - Design document for polars support in skpro #34

Are you sure you want to change the base?

WIP - Design document for polars support in skpro #34

Conversation

julian-fong commented May 31, 2024

julian-fong commented May 31, 2024

julian-fong commented Jun 3, 2024

fkiraly left a comment

Choose a reason for hiding this comment

julian-fong commented Jun 4, 2024

julian-fong commented Jun 4, 2024

julian-fong commented Jun 4, 2024

fkiraly commented Jun 5, 2024 • edited Loading

julian-fong commented Jun 5, 2024

julian-fong commented Jun 10, 2024 • edited Loading

julian-fong commented Jun 11, 2024

fkiraly commented Jun 13, 2024

fkiraly commented Jun 13, 2024

julian-fong commented Jun 25, 2024

fkiraly Jun 25, 2024

Choose a reason for hiding this comment

julian-fong Jun 26, 2024

Choose a reason for hiding this comment

fkiraly commented Jun 25, 2024

julian-fong commented Jun 26, 2024 • edited Loading

julian-fong commented Jun 26, 2024 • edited Loading

julian-fong commented Jun 26, 2024 • edited Loading

julian-fong commented Jun 27, 2024

fkiraly Jul 2, 2024

Choose a reason for hiding this comment

julian-fong Jul 2, 2024

Choose a reason for hiding this comment

julian-fong Jul 2, 2024 • edited Loading

Choose a reason for hiding this comment

fkiraly Jul 2, 2024

Choose a reason for hiding this comment

fkiraly commented Jun 5, 2024 •

edited

Loading

julian-fong commented Jun 10, 2024 •

edited

Loading

julian-fong commented Jun 26, 2024 •

edited

Loading

julian-fong commented Jun 26, 2024 •

edited

Loading

julian-fong commented Jun 26, 2024 •

edited

Loading

julian-fong Jul 2, 2024 •

edited

Loading