-
Notifications
You must be signed in to change notification settings - Fork 357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Index.map
functionality
#2136
Conversation
@xinrong-databricks @ueshin Hi both, please see my PR to add Index.map functionality. Just a note build is failing due to
I see other PR branches are failing with this issue too. Are you aware? |
It is an ongoing build issue. Thanks for letting us know! We will look into this. |
FYI the issue is resolved, please pull master and then retrigger tests. |
GitHub Actions seems unstable now.
|
databricks/koalas/indexes/base.py
Outdated
def map( | ||
self, | ||
mapper: Union[dict, Callable[[Any], Any], pd.Series], | ||
return_type: ks.typedef.Dtype = str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pandas doesn't accept return_type
argument. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.map.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, shall we import Dtype
in the header if we still need it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, str
is not Dtype
.
Dtype = Union[np.dtype, ExtensionDtype] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll fix, I added the return_type parameter to be leveraged by the pandas_udf
that is used to execute the transformation. Will think of a way to make this discoverable at run time removing the return_type
parameter.
self.assert_eq( | ||
kser.index.map(lambda id: id + DateOffset(days=1), return_type=datetime), | ||
ks.Series([1, 2, 3, 4], index=pd.date_range("2018-04-10", periods=4, freq="2D")).index, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also add tests with CategoricalIndex
?
You can put the tests in test_categorical.py
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ueshin I am looking at adding tests to CategoricalIndex but have some unexpected behaviour. I wonder whether you can help explain.
The current implementation for Index.map leverages _with_new_scol
to avoid collect anything to the driver. When you have a CategoricalIndex
such as ks.CategoricalIndex(["a", "b", "c"])
the returned spark_frame
is
+-----------------+-----------------+
|__index_level_0__|__natural_order__|
+-----------------+-----------------+
| 0| 0|
| 1| 8589934592|
| 2| 17179869184|
+-----------------+-----------------+
I was expecting
+-----------------+-----------------+
|__index_level_0__|__natural_order__|
+-----------------+-----------------+
| a| 0|
| b| 8589934592|
| c| 17179869184|
+-----------------+-----------------+
Is my expectation incorrect?
This seems to be caused by https://github.com/databricks/koalas/blob/master/databricks/koalas/frame.py#L510
If you have a pdf = pd.DataFrame(index=pd.CategoricalIndex(["a", "b", "c"]))
InternalFrame.from_pandas(pdf).spark_frame
returns
+-----------------+-----------------+
|__index_level_0__|__natural_order__|
+-----------------+-----------------+
| 0| 0|
| 1| 8589934592|
| 2| 17179869184|
+-----------------+-----------------+
Where as if you have pdf = pd.DataFrame(index=pd.Index(["a", "b", "c"])
InternalFrame.from_pandas(pdf).spark_frame
returns
+-----------------+-----------------+
|__index_level_0__|__natural_order__|
+-----------------+-----------------+
| a| 0|
| b| 8589934592|
| c| 17179869184|
+-----------------+-----------------+
Apologies if this is a silly question. Thanks in advance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The behavior is expected since so far Koalas manages only Categorical
's codes
in Spark and its categories
are in the metadata, index_dtypes
or data_dtypes
.
Codecov Report
@@ Coverage Diff @@
## master #2136 +/- ##
==========================================
- Coverage 95.37% 95.34% -0.03%
==========================================
Files 60 62 +2
Lines 13694 13780 +86
==========================================
+ Hits 13060 13139 +79
- Misses 634 641 +7
Continue to review full report at Codecov.
|
databricks/koalas/indexes/base.py
Outdated
na_action: Any = None, | ||
): | ||
""" | ||
Use to change Index values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about leveraging pandas docstring? It would be great to maintain the compatibility between Koalas APIs and pandas APIs.
Would you add a new entry |
@@ -28,6 +28,7 @@ Properties | |||
Index.inferred_type | |||
Index.is_all_dates | |||
Index.shape | |||
Index.map |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we also add CategoricalIndex.map
?
Following https://pandas.pydata.org/docs/reference/indexing.html#id1, we'd better add "Modifying and computations" section under "CategoricalIndex".
Hi @awdavidson, since Koalas has been ported to Spark as pandas API on Spark, would you like to migrate this PR to the Spark repository? Here is the ticket https://issues.apache.org/jira/browse/SPARK-36394. Otherwise, I may do that for you next week. |
Hi @xinrong-databricks, apologies for the delay it has been crazily busy the last few months! Yes, if you don’t mind please migrate. I’ll look at addressing you last comment as soon as possible. |
:) That's totally fine. I will migrate it and keep you updated then. |
### What changes were proposed in this pull request? Implement `Index.map`. The PR is based on databricks/koalas#2136. Thanks awdavidson for the prototype. `map` of CategoricalIndex and DatetimeIndex will be implemented in separate PRs. ### Why are the changes needed? Mapping values using input correspondence (a dict, Series, or function) is supported in pandas as [Index.map](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.map.html). We shall also support hat. ### Does this PR introduce _any_ user-facing change? Yes. `Index.map` is available now. ```py >>> psidx = ps.Index([1, 2, 3]) >>> psidx.map({1: "one", 2: "two", 3: "three"}) Index(['one', 'two', 'three'], dtype='object') >>> psidx.map(lambda id: "{id} + 1".format(id=id)) Index(['1 + 1', '2 + 1', '3 + 1'], dtype='object') >>> pser = pd.Series(["one", "two", "three"], index=[1, 2, 3]) >>> psidx.map(pser) Index(['one', 'two', 'three'], dtype='object') ``` ### How was this patch tested? Unit tests. Closes #33694 from xinrong-databricks/index_map. Authored-by: Xinrong Meng <[email protected]> Signed-off-by: Takuya UESHIN <[email protected]>
### What changes were proposed in this pull request? Implement `Index.map`. The PR is based on databricks/koalas#2136. Thanks awdavidson for the prototype. `map` of CategoricalIndex and DatetimeIndex will be implemented in separate PRs. ### Why are the changes needed? Mapping values using input correspondence (a dict, Series, or function) is supported in pandas as [Index.map](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.map.html). We shall also support hat. ### Does this PR introduce _any_ user-facing change? Yes. `Index.map` is available now. ```py >>> psidx = ps.Index([1, 2, 3]) >>> psidx.map({1: "one", 2: "two", 3: "three"}) Index(['one', 'two', 'three'], dtype='object') >>> psidx.map(lambda id: "{id} + 1".format(id=id)) Index(['1 + 1', '2 + 1', '3 + 1'], dtype='object') >>> pser = pd.Series(["one", "two", "three"], index=[1, 2, 3]) >>> psidx.map(pser) Index(['one', 'two', 'three'], dtype='object') ``` ### How was this patch tested? Unit tests. Closes #33694 from xinrong-databricks/index_map. Authored-by: Xinrong Meng <[email protected]> Signed-off-by: Takuya UESHIN <[email protected]> (cherry picked from commit 4dcd746) Signed-off-by: Takuya UESHIN <[email protected]>
Hi @awdavidson, I would like to close this PR since it has been migrated to Spark. Thanks! |
Please see change to implement Index.map functionality similar to that available in pandas. Requirement raised in issue: #1929