Implemented GroupBy.tail #1949

itholic · 2020-12-03T05:52:16Z

This PR proposes GroupBy.tail() for DataFrameGroupBy and SeriesGroupBy.

>>> df = ks.DataFrame({'a': [1, 1, 1, 1, 2, 2, 2, 3, 3, 3],
...                    'b': [2, 3, 1, 4, 6, 9, 8, 10, 7, 5],
...                    'c': [3, 5, 2, 5, 1, 2, 6, 4, 3, 6]},
...                   columns=['a', 'b', 'c'],
...                   index=[7, 2, 4, 1, 3, 4, 9, 10, 5, 6])
>>> df
    a   b  c
7   1   2  3
2   1   3  5
4   1   1  2
1   1   4  5
3   2   6  1
4   2   9  2
9   2   8  6
10  3  10  4
5   3   7  3
6   3   5  6

>>> df.groupby('a').tail(2).sort_index()
   a  b  c
1  1  4  5
4  1  1  2
4  2  9  2
5  3  7  3
6  3  5  6
9  2  8  6

>>> df.groupby('a')['b'].tail(2).sort_index()
1    4
4    1
4    9
5    7
6    5
9    8
Name: b, dtype: int64

itholic · 2020-12-03T05:53:12Z

databricks/koalas/groupby.py

+
+        sdf = kdf._internal.spark_frame
+        tmp_col = verify_temp_column_name(sdf, "__row_number__")
+        window = Window.partitionBy(groupkey_scols).orderBy(F.col(NATURAL_ORDER_COLUMN_NAME).desc())


This implementation basically same as GroupBy.head() except this line - used descending order -.

Then, shall we combine those two? Like:

def _limit(n, asc: bool): ... window = ... orderBy(F.col(NATURAL_ORDER_COLUMN_NAME).asc() if asc else F.col(NATURAL_ORDER_COLUMN_NAME).desc()) ... def head(self, n): return self._limit(n, asc=True) def tail(self, n): return self._limit(n, asc=False)

Cool! let me address it. Thanks for the suggestion :)

codecov-io · 2020-12-03T08:19:30Z

Codecov Report

Merging #1949 (22f5e74) into master (138c7b8) will decrease coverage by 0.90%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #1949      +/-   ##
==========================================
- Coverage   94.64%   93.74%   -0.91%     
==========================================
  Files          49       49              
  Lines       10818    10839      +21     
==========================================
- Hits        10239    10161      -78     
- Misses        579      678      +99

Impacted Files	Coverage Δ
databricks/koalas/missing/groupby.py	`100.00% <ø> (ø)`
databricks/koalas/groupby.py	`91.50% <100.00%> (+0.08%)`	⬆️
databricks/koalas/usage_logging/__init__.py	`24.78% <0.00%> (-67.53%)`	⬇️
databricks/koalas/usage_logging/usage_logger.py	`47.82% <0.00%> (-52.18%)`	⬇️
databricks/koalas/__init__.py	`85.93% <0.00%> (-4.69%)`	⬇️
databricks/conftest.py	`97.10% <0.00%> (-2.90%)`	⬇️
databricks/koalas/series.py	`96.85% <0.00%> (-0.19%)`	⬇️
databricks/koalas/namespace.py	`84.19% <0.00%> (-0.04%)`	⬇️
databricks/koalas/frame.py	`96.72% <0.00%> (-0.03%)`	⬇️
databricks/koalas/missing/frame.py	`100.00% <0.00%> (ø)`
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 138c7b8...22f5e74. Read the comment docs.

ueshin

Otherwise, LGTM.

ueshin · 2020-12-08T21:00:53Z

databricks/koalas/groupby.py

+
+        sdf = kdf._internal.spark_frame
+        tmp_col = verify_temp_column_name(sdf, "__row_number__")
+        window = Window.partitionBy(groupkey_scols).orderBy(F.col(NATURAL_ORDER_COLUMN_NAME).desc())


Then, shall we combine those two? Like:

def _limit(n, asc: bool): ... window = ... orderBy(F.col(NATURAL_ORDER_COLUMN_NAME).asc() if asc else F.col(NATURAL_ORDER_COLUMN_NAME).desc()) ... def head(self, n): return self._limit(n, asc=True) def tail(self, n): return self._limit(n, asc=False)

xinrong-meng · 2020-12-09T23:27:23Z

ref #1929

xinrong-meng · 2020-12-09T23:43:42Z

Great 👍 !

itholic · 2020-12-10T02:09:45Z

Thanks @ueshin @xinrong-databricks , I'd merge this now.

Implemented GroupBy.tail

9ab0f07

itholic commented Dec 3, 2020

View reviewed changes

Fix doctest

bb95baf

Fix head -> tail doc

3a5a22d

ueshin reviewed Dec 8, 2020

View reviewed changes

Add _limit

22f5e74

xinrong-meng self-requested a review December 9, 2020 23:42

xinrong-meng approved these changes Dec 9, 2020

View reviewed changes

itholic merged commit ba02fa7 into databricks:master Dec 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implemented GroupBy.tail #1949

Implemented GroupBy.tail #1949

itholic commented Dec 3, 2020

itholic Dec 3, 2020

ueshin Dec 8, 2020

itholic Dec 9, 2020

codecov-io commented Dec 3, 2020 •

edited

Loading

ueshin left a comment

ueshin Dec 8, 2020

xinrong-meng commented Dec 9, 2020

xinrong-meng commented Dec 9, 2020

itholic commented Dec 10, 2020

Implemented GroupBy.tail #1949

Implemented GroupBy.tail #1949

Conversation

itholic commented Dec 3, 2020

itholic Dec 3, 2020

Choose a reason for hiding this comment

ueshin Dec 8, 2020

Choose a reason for hiding this comment

itholic Dec 9, 2020

Choose a reason for hiding this comment

codecov-io commented Dec 3, 2020 • edited Loading

Codecov Report

ueshin left a comment

Choose a reason for hiding this comment

ueshin Dec 8, 2020

Choose a reason for hiding this comment

xinrong-meng commented Dec 9, 2020

xinrong-meng commented Dec 9, 2020

itholic commented Dec 10, 2020

codecov-io commented Dec 3, 2020 •

edited

Loading