Implement `DataFrame.last` and `Series.last` functionality #2121

awdavidson · 2021-03-26T11:38:16Z

Please see change to implement DataFrame.last and Series.last functionality similar to that available in pandas. Requirement raised in issue: #1929

>>> index = pd.date_range('2018-04-09', periods=4, freq='2D')
>>> ks_series = ks.Series([1, 2, 3, 4], index=index)
2018-04-09  1
2018-04-11  2
2018-04-13  3
2018-04-15  4
dtype: int64

>>> ks_series.last('3D')
2018-04-13  3
2018-04-15  4
dtype: int64

>>> index = pd.date_range('2018-04-09', periods=4, freq='2D')
>>> pdf = pd.DataFrame({'A': [1, 2, 3, 4]}, index=i)
>>> kdf = fs.from_pandas(pdf)
            A
2018-04-09  1
2018-04-11  2
2018-04-13  3
2018-04-15  4

 >>> kdf.last('3D')
            A
2018-04-13  3
2018-04-15  4

awdavidson · 2021-03-27T11:19:36Z

@xinrong-databricks please can I get a review on this PR?

xinrong-meng · 2021-03-29T15:06:26Z

Certainly! Thank you!

databricks/koalas/series.py

databricks/koalas/frame.py

xinrong-meng · 2021-03-29T16:22:45Z

databricks/koalas/frame.py

+        kdf = self.copy()
+        kdf.index.name = verify_temp_column_name(kdf, "__index_name__")
+
+        def pandas_loc(pdf):
+            return pdf.loc[from_date:index_max].reset_index()
+
+        # apply_batch will remove the index of the Koalas DataFrame and attach a default index,
+        # which will never be used. So use "distributed" index as a dummy to avoid overhead.
+        with option_context("compute.default_index_type", "distributed"):
+            kdf = kdf.koalas.apply_batch(pandas_loc)
+
+        return DataFrame(
+            self._internal.copy(
+                spark_frame=kdf._internal.spark_frame,
+                index_spark_columns=kdf._internal.data_spark_columns[:1],
+                data_spark_columns=kdf._internal.data_spark_columns[1:],
+            )
+        )


Did you intend not to use return self[from_date:index_max]?

My original implementation did something similar return self.loc[from_date:index_max], however, build failed due to Incompatible return value type (got "Union[Series[Any], DataFrame[Any]]", expected "DataFrame[Any]")

I changed implementation to ensure return type matched the expected and avoided the error Slice index must be an integer or None. I definitely prefer the original implementation, however, struggled finding a way to meet the build requirement with minimal changes

Apologies I've just pushed a change that follows similar implementation to the original and what you suggested and avoids mypy build errors.

As we only require the from_date we can filter from the [from_date:] rather than filtering between a range [from_date:index_max]

Note: self[from_date:index_max] throws mypy error Slice index must be an integer or None our index is a Timestamp and would want to avoid type conversions.

Thanks for sharing your investigation!

It is quite helpful to know about the mypy slice check. Your current approach is smart. :)

…impl-last

codecov-io · 2021-03-29T20:13:18Z

Codecov Report

Merging #2121 (4843000) into master (9e361e3) will decrease coverage by 0.94%.
The diff coverage is 91.66%.

@@            Coverage Diff             @@
##           master    #2121      +/-   ##
==========================================
- Coverage   95.37%   94.43%   -0.95%     
==========================================
  Files          60       60              
  Lines       13581    13600      +19     
==========================================
- Hits        12953    12843     -110     
- Misses        628      757     +129

Impacted Files	Coverage Δ
databricks/koalas/missing/frame.py	`100.00% <ø> (ø)`
databricks/koalas/missing/series.py	`100.00% <ø> (ø)`
databricks/koalas/frame.py	`96.53% <88.88%> (+0.01%)`	⬆️
databricks/koalas/series.py	`96.93% <100.00%> (+<0.01%)`	⬆️
databricks/koalas/usage_logging/__init__.py	`27.27% <0.00%> (-65.29%)`	⬇️
databricks/koalas/usage_logging/usage_logger.py	`47.82% <0.00%> (-52.18%)`	⬇️
databricks/conftest.py	`92.18% <0.00%> (-7.82%)`	⬇️
databricks/koalas/typedef/typehints.py	`89.28% <0.00%> (-6.06%)`	⬇️
databricks/koalas/__init__.py	`88.15% <0.00%> (-3.95%)`	⬇️
databricks/koalas/tests/indexes/test_datetime.py	`97.82% <0.00%> (-2.18%)`	⬇️
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9e361e3...4843000. Read the comment docs.

xinrong-meng · 2021-03-30T16:32:49Z

databricks/koalas/tests/test_series.py

+        with self.assertRaises(TypeError):
+            self.kser.last("1D")
+        self.assert_eq(ks_input.last("1D"), pd_input.last("1D"))
+


Shall we add a negative test case here? For example,

with self.assertRaisesRegex(TypeError, "'last' only supports a DatetimeIndex"): ks.DataFrame([1, 2, 3, 4]).last("1D")

Definitely!

xinrong-meng · 2021-03-30T16:34:24Z

Except for one more test case, looks good to me! Thank you!

ueshin

Otherwise, LGTM.
@awdavidson Thanks for working on this!

ueshin · 2021-03-30T17:13:44Z

databricks/koalas/frame.py

+        offset = to_offset(offset)
+        from_date = self.index.max() - offset
+
+        return self[from_date:]


Shall we explicitly use loc indexer, just in case?

return cast(DataFrame, self.loc[from_date:])

I wasn't aware of the cast functionality. Last commit implements you suggestion :)

ueshin · 2021-03-30T17:15:01Z

databricks/koalas/tests/test_dataframe.py

@@ -5202,6 +5202,15 @@ def test_last_valid_index(self):
        kdf = ks.Series([]).to_frame()
        self.assert_eq(pdf.last_valid_index(), kdf.last_valid_index())

+    def test_last(self):
+        from pandas.tseries.offsets import DateOffset


You can import it at the import block at the header of this file.

awdavidson · 2021-03-30T17:34:14Z

Except for one more test case, looks good to me! Thank you!

@xinrong-databricks sure will add the additional test case.

@ueshin happy to explicitly use loc although same functionality I suspect it's easier for people to read and explicitly showing the use of loc? Please let me know what you would prefer!

ueshin · 2021-03-30T17:59:43Z

@awdavidson I prefer to explicitly use loc since __getitem__ has additional logic in it.

awdavidson · 2021-03-30T19:59:51Z

@awdavidson I prefer to explicitly use loc since __getitem__ has additional logic in it.

Sounds good - it's already been implemented. Ive requested your review

ueshin · 2021-03-30T21:03:30Z

Thanks! merging.

awdavidson · 2021-03-30T21:32:56Z

Thanks! merging.

Thank you for taking the time to review!

awdavidson added 2 commits March 26, 2021 11:30

Implement last functionality

f003469

Reformat files

163f39a

awdavidson closed this Mar 26, 2021

awdavidson reopened this Mar 26, 2021

awdavidson changed the title ~~Implement DataFrame.last and Series.last functionality~~ [WIP]Implement DataFrame.last and Series.last functionality Mar 26, 2021

Ensure impl meets build reqs

8974de3

awdavidson changed the title ~~[WIP]Implement DataFrame.last and Series.last functionality~~ Implement DataFrame.last and Series.last functionality Mar 27, 2021

awdavidson added 3 commits March 27, 2021 09:46

Reformat

63d9b48

Reformat

f257330

Fix docstring

cb0a184

awdavidson and others added 4 commits March 27, 2021 19:20

Remove offset import and access directly via pd

3ac0cc0

Fix to_offset import

3b4a4d7

Fix docstring

0184627

Fix docstring format

39f56c3

xinrong-meng reviewed Mar 29, 2021

View reviewed changes

databricks/koalas/series.py Outdated Show resolved Hide resolved

xinrong-meng reviewed Mar 29, 2021

View reviewed changes

databricks/koalas/series.py Outdated Show resolved Hide resolved

xinrong-meng reviewed Mar 29, 2021

View reviewed changes

databricks/koalas/frame.py Outdated Show resolved Hide resolved

xinrong-meng reviewed Mar 29, 2021

View reviewed changes

awdavidson and others added 5 commits March 29, 2021 20:19

Address PR comments

675dbc2

Merge branch 'master' into feature/impl-last

59fcc58

Simplify implementation

3e8d7e9

Simplify implementation

a196e2e

Merge remote-tracking branch 'origin/feature/impl-last' into feature/…

00f3c7f

…impl-last

awdavidson added 2 commits March 29, 2021 21:19

reformat

d89f442

Address PR comments

da8bbed

awdavidson requested review from ueshin and xinrong-meng March 30, 2021 07:52

xinrong-meng reviewed Mar 30, 2021

View reviewed changes

ueshin reviewed Mar 30, 2021

View reviewed changes

Address PR comments

4843000

awdavidson requested review from ueshin and xinrong-meng March 30, 2021 19:59

xinrong-meng approved these changes Mar 30, 2021

View reviewed changes

ueshin approved these changes Mar 30, 2021

View reviewed changes

ueshin merged commit 2dce0d1 into databricks:master Mar 30, 2021

awdavidson deleted the feature/impl-last branch March 30, 2021 21:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement `DataFrame.last` and `Series.last` functionality #2121

Implement `DataFrame.last` and `Series.last` functionality #2121

awdavidson commented Mar 26, 2021

awdavidson commented Mar 27, 2021

xinrong-meng commented Mar 29, 2021

xinrong-meng Mar 29, 2021

awdavidson Mar 29, 2021 •

edited

Loading

awdavidson Mar 29, 2021

xinrong-meng Mar 30, 2021

codecov-io commented Mar 29, 2021 •

edited

Loading

xinrong-meng Mar 30, 2021

awdavidson Mar 30, 2021

xinrong-meng commented Mar 30, 2021

ueshin left a comment

ueshin Mar 30, 2021

awdavidson Mar 30, 2021

ueshin Mar 30, 2021

awdavidson commented Mar 30, 2021

ueshin commented Mar 30, 2021

awdavidson commented Mar 30, 2021

ueshin commented Mar 30, 2021

awdavidson commented Mar 30, 2021

Implement DataFrame.last and Series.last functionality #2121

Implement DataFrame.last and Series.last functionality #2121

Conversation

awdavidson commented Mar 26, 2021

awdavidson commented Mar 27, 2021

xinrong-meng commented Mar 29, 2021

Choose a reason for hiding this comment

awdavidson Mar 29, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-io commented Mar 29, 2021 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xinrong-meng commented Mar 30, 2021

ueshin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

awdavidson commented Mar 30, 2021

ueshin commented Mar 30, 2021

awdavidson commented Mar 30, 2021

ueshin commented Mar 30, 2021

awdavidson commented Mar 30, 2021

Implement `DataFrame.last` and `Series.last` functionality #2121

Implement `DataFrame.last` and `Series.last` functionality #2121

awdavidson Mar 29, 2021 •

edited

Loading

codecov-io commented Mar 29, 2021 •

edited

Loading