Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Index.map functionality #2136

Closed
wants to merge 33 commits into from

Conversation

awdavidson
Copy link
Contributor

Please see change to implement Index.map functionality similar to that available in pandas. Requirement raised in issue: #1929

@awdavidson
Copy link
Contributor Author

@xinrong-databricks @ueshin Hi both, please see my PR to add Index.map functionality. Just a note build is failing due to

Warning, treated as error:
1655 /home/runner/work/koalas/koalas/docs/source/getting_started/10min.ipynb::Line 2731 exceeds the line-length-limit.
1656 make: *** [Makefile:21: html] Error 2
1657 Error: Process completed with exit code 2.

I see other PR branches are failing with this issue too. Are you aware?

@xinrong-meng
Copy link
Contributor

It is an ongoing build issue. Thanks for letting us know! We will look into this.

@xinrong-meng
Copy link
Contributor

xinrong-meng commented Apr 6, 2021

FYI the issue is resolved, please pull master and then retrigger tests.

@ueshin
Copy link
Collaborator

ueshin commented Apr 7, 2021

GitHub Actions seems unstable now.

GitHub Actions has encountered an internal error when running your job.

databricks/koalas/indexes/base.py Outdated Show resolved Hide resolved
databricks/koalas/indexes/extension.py Outdated Show resolved Hide resolved
databricks/koalas/indexes/extension.py Outdated Show resolved Hide resolved
databricks/koalas/indexes/extension.py Outdated Show resolved Hide resolved
databricks/koalas/indexes/extension.py Outdated Show resolved Hide resolved
def map(
self,
mapper: Union[dict, Callable[[Any], Any], pd.Series],
return_type: ks.typedef.Dtype = str,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, shall we import Dtype in the header if we still need it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, str is not Dtype.

Dtype = Union[np.dtype, ExtensionDtype]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll fix, I added the return_type parameter to be leveraged by the pandas_udf that is used to execute the transformation. Will think of a way to make this discoverable at run time removing the return_type parameter.

self.assert_eq(
kser.index.map(lambda id: id + DateOffset(days=1), return_type=datetime),
ks.Series([1, 2, 3, 4], index=pd.date_range("2018-04-10", periods=4, freq="2D")).index,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add tests with CategoricalIndex?
You can put the tests in test_categorical.py.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ueshin I am looking at adding tests to CategoricalIndex but have some unexpected behaviour. I wonder whether you can help explain.

The current implementation for Index.map leverages _with_new_scol to avoid collect anything to the driver. When you have a CategoricalIndex such as ks.CategoricalIndex(["a", "b", "c"]) the returned spark_frame is

+-----------------+-----------------+
|__index_level_0__|__natural_order__|
+-----------------+-----------------+
|                0|                0|
|                1|       8589934592|
|                2|      17179869184|
+-----------------+-----------------+

I was expecting

+-----------------+-----------------+
|__index_level_0__|__natural_order__|
+-----------------+-----------------+
|                a|                0|
|                b|       8589934592|
|                c|      17179869184|
+-----------------+-----------------+

Is my expectation incorrect?

This seems to be caused by https://github.com/databricks/koalas/blob/master/databricks/koalas/frame.py#L510

If you have a pdf = pd.DataFrame(index=pd.CategoricalIndex(["a", "b", "c"])) InternalFrame.from_pandas(pdf).spark_frame returns

+-----------------+-----------------+
|__index_level_0__|__natural_order__|
+-----------------+-----------------+
|                0|                0|
|                1|       8589934592|
|                2|      17179869184|
+-----------------+-----------------+

Where as if you have pdf = pd.DataFrame(index=pd.Index(["a", "b", "c"]) InternalFrame.from_pandas(pdf).spark_frame returns

+-----------------+-----------------+
|__index_level_0__|__natural_order__|
+-----------------+-----------------+
|                a|                0|
|                b|       8589934592|
|                c|      17179869184|
+-----------------+-----------------+

Apologies if this is a silly question. Thanks in advance

Copy link
Collaborator

@ueshin ueshin Apr 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior is expected since so far Koalas manages only Categorical 's codes in Spark and its categories are in the metadata, index_dtypes or data_dtypes.

@codecov-io
Copy link

codecov-io commented Apr 8, 2021

Codecov Report

Merging #2136 (f845001) into master (0fd088e) will decrease coverage by 0.02%.
The diff coverage is 92.22%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2136      +/-   ##
==========================================
- Coverage   95.37%   95.34%   -0.03%     
==========================================
  Files          60       62       +2     
  Lines       13694    13780      +86     
==========================================
+ Hits        13060    13139      +79     
- Misses        634      641       +7     
Impacted Files Coverage Δ
databricks/koalas/missing/indexes.py 100.00% <ø> (ø)
databricks/koalas/indexes/extension.py 83.33% <83.33%> (ø)
databricks/koalas/indexes/base.py 97.33% <100.00%> (+0.01%) ⬆️
databricks/koalas/tests/indexes/test_base.py 100.00% <100.00%> (ø)
databricks/koalas/tests/indexes/test_extension.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0fd088e...f845001. Read the comment docs.

na_action: Any = None,
):
"""
Use to change Index values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about leveraging pandas docstring? It would be great to maintain the compatibility between Koalas APIs and pandas APIs.

@xinrong-meng
Copy link
Contributor

Would you add a new entry Index.map under the Conversion section of https://github.com/databricks/koalas/blob/master/docs/source/reference/indexing.rst?

@@ -28,6 +28,7 @@ Properties
Index.inferred_type
Index.is_all_dates
Index.shape
Index.map
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we also add CategoricalIndex.map?

Following https://pandas.pydata.org/docs/reference/indexing.html#id1, we'd better add "Modifying and computations" section under "CategoricalIndex".

@xinrong-meng
Copy link
Contributor

xinrong-meng commented Aug 3, 2021

Hi @awdavidson, since Koalas has been ported to Spark as pandas API on Spark, would you like to migrate this PR to the Spark repository? Here is the ticket https://issues.apache.org/jira/browse/SPARK-36394. Otherwise, I may do that for you next week.

@awdavidson
Copy link
Contributor Author

Hi @awdavidson, since Koalas has been ported to Spark as pandas API on Spark, would you like to migrate this PR to the Spark repository? Here is the ticket https://issues.apache.org/jira/browse/SPARK-36394. Otherwise, I may do that for you next week.

Hi @xinrong-databricks, apologies for the delay it has been crazily busy the last few months! Yes, if you don’t mind please migrate. I’ll look at addressing you last comment as soon as possible.

@xinrong-meng
Copy link
Contributor

:) That's totally fine. I will migrate it and keep you updated then.

ueshin pushed a commit to apache/spark that referenced this pull request Aug 16, 2021
### What changes were proposed in this pull request?
Implement `Index.map`.

The PR is based on databricks/koalas#2136. Thanks awdavidson for the prototype.

`map` of CategoricalIndex and DatetimeIndex will be implemented in separate PRs.

### Why are the changes needed?
Mapping values using input correspondence (a dict, Series, or function) is supported in pandas as [Index.map](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.map.html).
We shall also support hat.

### Does this PR introduce _any_ user-facing change?
Yes. `Index.map` is available now.

```py
>>> psidx = ps.Index([1, 2, 3])

>>> psidx.map({1: "one", 2: "two", 3: "three"})
Index(['one', 'two', 'three'], dtype='object')

>>> psidx.map(lambda id: "{id} + 1".format(id=id))
Index(['1 + 1', '2 + 1', '3 + 1'], dtype='object')

>>> pser = pd.Series(["one", "two", "three"], index=[1, 2, 3])
>>> psidx.map(pser)
Index(['one', 'two', 'three'], dtype='object')
```

### How was this patch tested?
Unit tests.

Closes #33694 from xinrong-databricks/index_map.

Authored-by: Xinrong Meng <[email protected]>
Signed-off-by: Takuya UESHIN <[email protected]>
ueshin pushed a commit to apache/spark that referenced this pull request Aug 16, 2021
### What changes were proposed in this pull request?
Implement `Index.map`.

The PR is based on databricks/koalas#2136. Thanks awdavidson for the prototype.

`map` of CategoricalIndex and DatetimeIndex will be implemented in separate PRs.

### Why are the changes needed?
Mapping values using input correspondence (a dict, Series, or function) is supported in pandas as [Index.map](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.map.html).
We shall also support hat.

### Does this PR introduce _any_ user-facing change?
Yes. `Index.map` is available now.

```py
>>> psidx = ps.Index([1, 2, 3])

>>> psidx.map({1: "one", 2: "two", 3: "three"})
Index(['one', 'two', 'three'], dtype='object')

>>> psidx.map(lambda id: "{id} + 1".format(id=id))
Index(['1 + 1', '2 + 1', '3 + 1'], dtype='object')

>>> pser = pd.Series(["one", "two", "three"], index=[1, 2, 3])
>>> psidx.map(pser)
Index(['one', 'two', 'three'], dtype='object')
```

### How was this patch tested?
Unit tests.

Closes #33694 from xinrong-databricks/index_map.

Authored-by: Xinrong Meng <[email protected]>
Signed-off-by: Takuya UESHIN <[email protected]>
(cherry picked from commit 4dcd746)
Signed-off-by: Takuya UESHIN <[email protected]>
@xinrong-meng
Copy link
Contributor

Hi @awdavidson, I would like to close this PR since it has been migrated to Spark. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants