Implement `Index.map` functionality #2136

awdavidson · 2021-04-05T19:44:43Z

Please see change to implement Index.map functionality similar to that available in pandas. Requirement raised in issue: #1929

awdavidson · 2021-04-06T08:57:39Z

@xinrong-databricks @ueshin Hi both, please see my PR to add Index.map functionality. Just a note build is failing due to

Warning, treated as error:
1655 /home/runner/work/koalas/koalas/docs/source/getting_started/10min.ipynb::Line 2731 exceeds the line-length-limit.
1656 make: *** [Makefile:21: html] Error 2
1657 Error: Process completed with exit code 2.

I see other PR branches are failing with this issue too. Are you aware?

xinrong-meng · 2021-04-06T18:57:28Z

It is an ongoing build issue. Thanks for letting us know! We will look into this.

xinrong-meng · 2021-04-06T23:51:27Z

FYI the issue is resolved, please pull master and then retrigger tests.

…pl-index_map

ueshin · 2021-04-07T23:25:57Z

GitHub Actions seems unstable now.

GitHub Actions has encountered an internal error when running your job.

databricks/koalas/indexes/base.py

databricks/koalas/indexes/extension.py

ueshin · 2021-04-08T01:23:33Z

databricks/koalas/indexes/base.py

+    def map(
+        self,
+        mapper: Union[dict, Callable[[Any], Any], pd.Series],
+        return_type: ks.typedef.Dtype = str,


pandas doesn't accept return_type argument. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.map.html

Btw, shall we import Dtype in the header if we still need it?

Also, str is not Dtype.

koalas/databricks/koalas/typedef/typehints.py

Line 75 in f845001

Dtype = Union[np.dtype, ExtensionDtype]

I'll fix, I added the return_type parameter to be leveraged by the pandas_udf that is used to execute the transformation. Will think of a way to make this discoverable at run time removing the return_type parameter.

ueshin · 2021-04-08T01:27:20Z

databricks/koalas/tests/indexes/test_base.py

+        self.assert_eq(
+            kser.index.map(lambda id: id + DateOffset(days=1), return_type=datetime),
+            ks.Series([1, 2, 3, 4], index=pd.date_range("2018-04-10", periods=4, freq="2D")).index,
+        )


Could you also add tests with CategoricalIndex?
You can put the tests in test_categorical.py.

@ueshin I am looking at adding tests to CategoricalIndex but have some unexpected behaviour. I wonder whether you can help explain.

The current implementation for Index.map leverages _with_new_scol to avoid collect anything to the driver. When you have a CategoricalIndex such as ks.CategoricalIndex(["a", "b", "c"]) the returned spark_frame is

+-----------------+-----------------+ |__index_level_0__|__natural_order__| +-----------------+-----------------+ | 0| 0| | 1| 8589934592| | 2| 17179869184| +-----------------+-----------------+

I was expecting

+-----------------+-----------------+ |__index_level_0__|__natural_order__| +-----------------+-----------------+ | a| 0| | b| 8589934592| | c| 17179869184| +-----------------+-----------------+

Is my expectation incorrect?

This seems to be caused by https://github.com/databricks/koalas/blob/master/databricks/koalas/frame.py#L510

If you have a pdf = pd.DataFrame(index=pd.CategoricalIndex(["a", "b", "c"])) InternalFrame.from_pandas(pdf).spark_frame returns

+-----------------+-----------------+ |__index_level_0__|__natural_order__| +-----------------+-----------------+ | 0| 0| | 1| 8589934592| | 2| 17179869184| +-----------------+-----------------+

Where as if you have pdf = pd.DataFrame(index=pd.Index(["a", "b", "c"]) InternalFrame.from_pandas(pdf).spark_frame returns

+-----------------+-----------------+ |__index_level_0__|__natural_order__| +-----------------+-----------------+ | a| 0| | b| 8589934592| | c| 17179869184| +-----------------+-----------------+

Apologies if this is a silly question. Thanks in advance

The behavior is expected since so far Koalas manages only Categorical 's codes in Spark and its categories are in the metadata, index_dtypes or data_dtypes.

codecov-io · 2021-04-08T01:31:08Z

Codecov Report

Merging #2136 (f845001) into master (0fd088e) will decrease coverage by 0.02%.
The diff coverage is 92.22%.

@@            Coverage Diff             @@
##           master    #2136      +/-   ##
==========================================
- Coverage   95.37%   95.34%   -0.03%     
==========================================
  Files          60       62       +2     
  Lines       13694    13780      +86     
==========================================
+ Hits        13060    13139      +79     
- Misses        634      641       +7

Impacted Files	Coverage Δ
databricks/koalas/missing/indexes.py	`100.00% <ø> (ø)`
databricks/koalas/indexes/extension.py	`83.33% <83.33%> (ø)`
databricks/koalas/indexes/base.py	`97.33% <100.00%> (+0.01%)`	⬆️
databricks/koalas/tests/indexes/test_base.py	`100.00% <100.00%> (ø)`
databricks/koalas/tests/indexes/test_extension.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0fd088e...f845001. Read the comment docs.

xinrong-meng · 2021-04-08T01:55:47Z

databricks/koalas/indexes/base.py

+        na_action: Any = None,
+    ):
+        """
+        Use to change Index values


How about leveraging pandas docstring? It would be great to maintain the compatibility between Koalas APIs and pandas APIs.

xinrong-meng · 2021-04-08T01:59:48Z

Would you add a new entry Index.map under the Conversion section of https://github.com/databricks/koalas/blob/master/docs/source/reference/indexing.rst?

xinrong-meng · 2021-04-29T17:21:02Z

docs/source/reference/indexing.rst

@@ -28,6 +28,7 @@ Properties
   Index.inferred_type
   Index.is_all_dates
   Index.shape
+   Index.map


Shall we also add CategoricalIndex.map?

Following https://pandas.pydata.org/docs/reference/indexing.html#id1, we'd better add "Modifying and computations" section under "CategoricalIndex".

xinrong-meng · 2021-08-03T20:40:38Z

Hi @awdavidson, since Koalas has been ported to Spark as pandas API on Spark, would you like to migrate this PR to the Spark repository? Here is the ticket https://issues.apache.org/jira/browse/SPARK-36394. Otherwise, I may do that for you next week.

awdavidson · 2021-08-03T22:07:36Z

Hi @awdavidson, since Koalas has been ported to Spark as pandas API on Spark, would you like to migrate this PR to the Spark repository? Here is the ticket https://issues.apache.org/jira/browse/SPARK-36394. Otherwise, I may do that for you next week.

Hi @xinrong-databricks, apologies for the delay it has been crazily busy the last few months! Yes, if you don’t mind please migrate. I’ll look at addressing you last comment as soon as possible.

xinrong-meng · 2021-08-03T23:05:56Z

:) That's totally fine. I will migrate it and keep you updated then.

### What changes were proposed in this pull request? Implement `Index.map`. The PR is based on databricks/koalas#2136. Thanks awdavidson for the prototype. `map` of CategoricalIndex and DatetimeIndex will be implemented in separate PRs. ### Why are the changes needed? Mapping values using input correspondence (a dict, Series, or function) is supported in pandas as [Index.map](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.map.html). We shall also support hat. ### Does this PR introduce _any_ user-facing change? Yes. `Index.map` is available now. ```py >>> psidx = ps.Index([1, 2, 3]) >>> psidx.map({1: "one", 2: "two", 3: "three"}) Index(['one', 'two', 'three'], dtype='object') >>> psidx.map(lambda id: "{id} + 1".format(id=id)) Index(['1 + 1', '2 + 1', '3 + 1'], dtype='object') >>> pser = pd.Series(["one", "two", "three"], index=[1, 2, 3]) >>> psidx.map(pser) Index(['one', 'two', 'three'], dtype='object') ``` ### How was this patch tested? Unit tests. Closes #33694 from xinrong-databricks/index_map. Authored-by: Xinrong Meng <[email protected]> Signed-off-by: Takuya UESHIN <[email protected]>

### What changes were proposed in this pull request? Implement `Index.map`. The PR is based on databricks/koalas#2136. Thanks awdavidson for the prototype. `map` of CategoricalIndex and DatetimeIndex will be implemented in separate PRs. ### Why are the changes needed? Mapping values using input correspondence (a dict, Series, or function) is supported in pandas as [Index.map](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.map.html). We shall also support hat. ### Does this PR introduce _any_ user-facing change? Yes. `Index.map` is available now. ```py >>> psidx = ps.Index([1, 2, 3]) >>> psidx.map({1: "one", 2: "two", 3: "three"}) Index(['one', 'two', 'three'], dtype='object') >>> psidx.map(lambda id: "{id} + 1".format(id=id)) Index(['1 + 1', '2 + 1', '3 + 1'], dtype='object') >>> pser = pd.Series(["one", "two", "three"], index=[1, 2, 3]) >>> psidx.map(pser) Index(['one', 'two', 'three'], dtype='object') ``` ### How was this patch tested? Unit tests. Closes #33694 from xinrong-databricks/index_map. Authored-by: Xinrong Meng <[email protected]> Signed-off-by: Takuya UESHIN <[email protected]> (cherry picked from commit 4dcd746) Signed-off-by: Takuya UESHIN <[email protected]>

xinrong-meng · 2021-09-01T21:54:59Z

Hi @awdavidson, I would like to close this PR since it has been migrated to Spark. Thanks!

awdavidson added 14 commits April 1, 2021 13:53

Initial Index.map impl

5ed61f2

Initial Index.map impl

e2ac4f0

Reformat

ae9c3f3

Add pd.Series compatability

1a55284

Avoid collects

13706cc

Update impl

f48da2e

Clean up impl and add docs

1f794f7

reformat

18e37c0

reformat

3499aa0

reformat

a949400

Reformat

af72e24

Reformat

97e03d3

Fix comment

7c1c678

Remove unused import

ced7d97

Update

694650a

Merge branch 'master' of github.com:databricks/koalas into feature/im…

f845001

…pl-index_map

ueshin reviewed Apr 8, 2021

View reviewed changes

xinrong-meng reviewed Apr 8, 2021

View reviewed changes

awdavidson added 6 commits April 10, 2021 19:38

Add categorical mapping

136bd4c

Reformat

06b250a

Remove print statement

0c74b95

Remove unused import

0bd6b3e

Final tweaks

e52a4ca

Remove unused import

289b573

awdavidson and others added 11 commits April 11, 2021 21:19

minor cast tweaks

f0697ee

Reformat

8d206d9

Fix docstring

ea8fc7f

Fix docstring

7f78833

Fix docstring

a6cd83e

Fix docstring

b57224d

Fix docstring

fe338c8

reformat

491a57a

reformat

3568d0a

fix docstring

e7957bd

Fix docstring

751e77d

awdavidson requested review from ueshin and xinrong-meng April 13, 2021 17:27

xinrong-meng reviewed Apr 29, 2021

View reviewed changes

xinrong-meng mentioned this pull request Aug 10, 2021

[SPARK-36469][PYTHON] Implement Index.map apache/spark#33694

Closed

xinrong-meng closed this Sep 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement `Index.map` functionality #2136

Implement `Index.map` functionality #2136

awdavidson commented Apr 5, 2021

awdavidson commented Apr 6, 2021

xinrong-meng commented Apr 6, 2021

xinrong-meng commented Apr 6, 2021 •

edited

Loading

ueshin commented Apr 7, 2021

ueshin Apr 8, 2021

ueshin Apr 8, 2021

ueshin Apr 8, 2021

awdavidson Apr 8, 2021

ueshin Apr 8, 2021

awdavidson Apr 8, 2021

ueshin Apr 8, 2021 •

edited

Loading

codecov-io commented Apr 8, 2021 •

edited

Loading

xinrong-meng Apr 8, 2021

xinrong-meng commented Apr 8, 2021

xinrong-meng Apr 29, 2021

xinrong-meng commented Aug 3, 2021 •

edited

Loading

awdavidson commented Aug 3, 2021

xinrong-meng commented Aug 3, 2021

xinrong-meng commented Sep 1, 2021

Implement Index.map functionality #2136

Implement Index.map functionality #2136

Conversation

awdavidson commented Apr 5, 2021

awdavidson commented Apr 6, 2021

xinrong-meng commented Apr 6, 2021

xinrong-meng commented Apr 6, 2021 • edited Loading

ueshin commented Apr 7, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ueshin Apr 8, 2021 • edited Loading

Choose a reason for hiding this comment

codecov-io commented Apr 8, 2021 • edited Loading

Codecov Report

Choose a reason for hiding this comment

xinrong-meng commented Apr 8, 2021

Choose a reason for hiding this comment

xinrong-meng commented Aug 3, 2021 • edited Loading

awdavidson commented Aug 3, 2021

xinrong-meng commented Aug 3, 2021

xinrong-meng commented Sep 1, 2021

Implement `Index.map` functionality #2136

Implement `Index.map` functionality #2136

xinrong-meng commented Apr 6, 2021 •

edited

Loading

ueshin Apr 8, 2021 •

edited

Loading

codecov-io commented Apr 8, 2021 •

edited

Loading

xinrong-meng commented Aug 3, 2021 •

edited

Loading