Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement DataFrame.last and Series.last functionality #2121

Merged
merged 18 commits into from
Mar 30, 2021

Conversation

awdavidson
Copy link
Contributor

Please see change to implement DataFrame.last and Series.last functionality similar to that available in pandas. Requirement raised in issue: #1929

>>> index = pd.date_range('2018-04-09', periods=4, freq='2D')
>>> ks_series = ks.Series([1, 2, 3, 4], index=index)
2018-04-09  1
2018-04-11  2
2018-04-13  3
2018-04-15  4
dtype: int64

>>> ks_series.last('3D')
2018-04-13  3
2018-04-15  4
dtype: int64
>>> index = pd.date_range('2018-04-09', periods=4, freq='2D')
>>> pdf = pd.DataFrame({'A': [1, 2, 3, 4]}, index=i)
>>> kdf = fs.from_pandas(pdf)
            A
2018-04-09  1
2018-04-11  2
2018-04-13  3
2018-04-15  4

 >>> kdf.last('3D')
            A
2018-04-13  3
2018-04-15  4      

@awdavidson awdavidson closed this Mar 26, 2021
@awdavidson awdavidson reopened this Mar 26, 2021
@awdavidson awdavidson changed the title Implement DataFrame.last and Series.last functionality [WIP]Implement DataFrame.last and Series.last functionality Mar 26, 2021
@awdavidson awdavidson changed the title [WIP]Implement DataFrame.last and Series.last functionality Implement DataFrame.last and Series.last functionality Mar 27, 2021
@awdavidson
Copy link
Contributor Author

@xinrong-databricks please can I get a review on this PR?

@xinrong-meng
Copy link
Contributor

Certainly! Thank you!

Comment on lines 5691 to 5708
kdf = self.copy()
kdf.index.name = verify_temp_column_name(kdf, "__index_name__")

def pandas_loc(pdf):
return pdf.loc[from_date:index_max].reset_index()

# apply_batch will remove the index of the Koalas DataFrame and attach a default index,
# which will never be used. So use "distributed" index as a dummy to avoid overhead.
with option_context("compute.default_index_type", "distributed"):
kdf = kdf.koalas.apply_batch(pandas_loc)

return DataFrame(
self._internal.copy(
spark_frame=kdf._internal.spark_frame,
index_spark_columns=kdf._internal.data_spark_columns[:1],
data_spark_columns=kdf._internal.data_spark_columns[1:],
)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you intend not to use return self[from_date:index_max]?

Copy link
Contributor Author

@awdavidson awdavidson Mar 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My original implementation did something similar return self.loc[from_date:index_max], however, build failed due to Incompatible return value type (got "Union[Series[Any], DataFrame[Any]]", expected "DataFrame[Any]")

I changed implementation to ensure return type matched the expected and avoided the error Slice index must be an integer or None. I definitely prefer the original implementation, however, struggled finding a way to meet the build requirement with minimal changes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies I've just pushed a change that follows similar implementation to the original and what you suggested and avoids mypy build errors.

As we only require the from_date we can filter from the [from_date:] rather than filtering between a range [from_date:index_max]

Note: self[from_date:index_max] throws mypy error Slice index must be an integer or None our index is a Timestamp and would want to avoid type conversions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sharing your investigation!

It is quite helpful to know about the mypy slice check. Your current approach is smart. :)

@codecov-io
Copy link

codecov-io commented Mar 29, 2021

Codecov Report

Merging #2121 (4843000) into master (9e361e3) will decrease coverage by 0.94%.
The diff coverage is 91.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2121      +/-   ##
==========================================
- Coverage   95.37%   94.43%   -0.95%     
==========================================
  Files          60       60              
  Lines       13581    13600      +19     
==========================================
- Hits        12953    12843     -110     
- Misses        628      757     +129     
Impacted Files Coverage Δ
databricks/koalas/missing/frame.py 100.00% <ø> (ø)
databricks/koalas/missing/series.py 100.00% <ø> (ø)
databricks/koalas/frame.py 96.53% <88.88%> (+0.01%) ⬆️
databricks/koalas/series.py 96.93% <100.00%> (+<0.01%) ⬆️
databricks/koalas/usage_logging/__init__.py 27.27% <0.00%> (-65.29%) ⬇️
databricks/koalas/usage_logging/usage_logger.py 47.82% <0.00%> (-52.18%) ⬇️
databricks/conftest.py 92.18% <0.00%> (-7.82%) ⬇️
databricks/koalas/typedef/typehints.py 89.28% <0.00%> (-6.06%) ⬇️
databricks/koalas/__init__.py 88.15% <0.00%> (-3.95%) ⬇️
databricks/koalas/tests/indexes/test_datetime.py 97.82% <0.00%> (-2.18%) ⬇️
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9e361e3...4843000. Read the comment docs.

with self.assertRaises(TypeError):
self.kser.last("1D")
self.assert_eq(ks_input.last("1D"), pd_input.last("1D"))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add a negative test case here? For example,

        with self.assertRaisesRegex(TypeError, "'last' only supports a DatetimeIndex"):
            ks.DataFrame([1, 2, 3, 4]).last("1D")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely!

@xinrong-meng
Copy link
Contributor

Except for one more test case, looks good to me! Thank you!

Copy link
Collaborator

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, LGTM.
@awdavidson Thanks for working on this!

offset = to_offset(offset)
from_date = self.index.max() - offset

return self[from_date:]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we explicitly use loc indexer, just in case?

return cast(DataFrame, self.loc[from_date:])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't aware of the cast functionality. Last commit implements you suggestion :)

@@ -5202,6 +5202,15 @@ def test_last_valid_index(self):
kdf = ks.Series([]).to_frame()
self.assert_eq(pdf.last_valid_index(), kdf.last_valid_index())

def test_last(self):
from pandas.tseries.offsets import DateOffset
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can import it at the import block at the header of this file.

@awdavidson
Copy link
Contributor Author

Except for one more test case, looks good to me! Thank you!

@xinrong-databricks sure will add the additional test case.

@ueshin happy to explicitly use loc although same functionality I suspect it's easier for people to read and explicitly showing the use of loc? Please let me know what you would prefer!

@ueshin
Copy link
Collaborator

ueshin commented Mar 30, 2021

@awdavidson I prefer to explicitly use loc since __getitem__ has additional logic in it.

@awdavidson
Copy link
Contributor Author

@awdavidson I prefer to explicitly use loc since __getitem__ has additional logic in it.

Sounds good - it's already been implemented. Ive requested your review

@ueshin
Copy link
Collaborator

ueshin commented Mar 30, 2021

Thanks! merging.

@ueshin ueshin merged commit 2dce0d1 into databricks:master Mar 30, 2021
@awdavidson
Copy link
Contributor Author

Thanks! merging.

Thank you for taking the time to review!

@awdavidson awdavidson deleted the feature/impl-last branch March 30, 2021 21:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants