-
Notifications
You must be signed in to change notification settings - Fork 357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement DataFrame.last
and Series.last
functionality
#2121
Conversation
DataFrame.last
and Series.last
functionalityDataFrame.last
and Series.last
functionality
DataFrame.last
and Series.last
functionalityDataFrame.last
and Series.last
functionality
@xinrong-databricks please can I get a review on this PR? |
Certainly! Thank you! |
databricks/koalas/frame.py
Outdated
kdf = self.copy() | ||
kdf.index.name = verify_temp_column_name(kdf, "__index_name__") | ||
|
||
def pandas_loc(pdf): | ||
return pdf.loc[from_date:index_max].reset_index() | ||
|
||
# apply_batch will remove the index of the Koalas DataFrame and attach a default index, | ||
# which will never be used. So use "distributed" index as a dummy to avoid overhead. | ||
with option_context("compute.default_index_type", "distributed"): | ||
kdf = kdf.koalas.apply_batch(pandas_loc) | ||
|
||
return DataFrame( | ||
self._internal.copy( | ||
spark_frame=kdf._internal.spark_frame, | ||
index_spark_columns=kdf._internal.data_spark_columns[:1], | ||
data_spark_columns=kdf._internal.data_spark_columns[1:], | ||
) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you intend not to use return self[from_date:index_max]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My original implementation did something similar return self.loc[from_date:index_max]
, however, build failed due to Incompatible return value type (got "Union[Series[Any], DataFrame[Any]]", expected "DataFrame[Any]")
I changed implementation to ensure return type matched the expected and avoided the error Slice index must be an integer or None
. I definitely prefer the original implementation, however, struggled finding a way to meet the build requirement with minimal changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies I've just pushed a change that follows similar implementation to the original and what you suggested and avoids mypy
build errors.
As we only require the from_date
we can filter from the [from_date:]
rather than filtering between a range [from_date:index_max]
Note: self[from_date:index_max]
throws mypy
error Slice index must be an integer or None
our index is a Timestamp
and would want to avoid type conversions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sharing your investigation!
It is quite helpful to know about the mypy slice check. Your current approach is smart. :)
Codecov Report
@@ Coverage Diff @@
## master #2121 +/- ##
==========================================
- Coverage 95.37% 94.43% -0.95%
==========================================
Files 60 60
Lines 13581 13600 +19
==========================================
- Hits 12953 12843 -110
- Misses 628 757 +129
Continue to review full report at Codecov.
|
with self.assertRaises(TypeError): | ||
self.kser.last("1D") | ||
self.assert_eq(ks_input.last("1D"), pd_input.last("1D")) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we add a negative test case here? For example,
with self.assertRaisesRegex(TypeError, "'last' only supports a DatetimeIndex"):
ks.DataFrame([1, 2, 3, 4]).last("1D")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely!
Except for one more test case, looks good to me! Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, LGTM.
@awdavidson Thanks for working on this!
databricks/koalas/frame.py
Outdated
offset = to_offset(offset) | ||
from_date = self.index.max() - offset | ||
|
||
return self[from_date:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we explicitly use loc
indexer, just in case?
return cast(DataFrame, self.loc[from_date:])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't aware of the cast
functionality. Last commit implements you suggestion :)
@@ -5202,6 +5202,15 @@ def test_last_valid_index(self): | |||
kdf = ks.Series([]).to_frame() | |||
self.assert_eq(pdf.last_valid_index(), kdf.last_valid_index()) | |||
|
|||
def test_last(self): | |||
from pandas.tseries.offsets import DateOffset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can import it at the import block at the header of this file.
@xinrong-databricks sure will add the additional test case. @ueshin happy to explicitly use |
@awdavidson I prefer to explicitly use |
Sounds good - it's already been implemented. Ive requested your review |
Thanks! merging. |
Thank you for taking the time to review! |
Please see change to implement
DataFrame.last
andSeries.last
functionality similar to that available in pandas. Requirement raised in issue: #1929