-
Notifications
You must be signed in to change notification settings - Fork 357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Koalas Missing APIs #1929
Comments
Hi, I would like to help. I was planning on picking |
Please go ahead @AishwaryaKalloli! |
Certainly, thank you @AishwaryaKalloli! |
Just finished the set up in my local, hopefully will have some updates soon! |
I have committed the code, let me know if it is in the right direction. If it is I'll add the test cases and docs. |
ref #1929 ``` >>> df = ks.DataFrame({'num_legs': [4, 2], 'num_wings': [0, 2]}, ... index=['dog', 'hawk']) >>> df num_legs num_wings dog 4 0 hawk 2 2 >>> for row in df.itertuples(): ... print(row) ... Koalas(Index='dog', num_legs=4, num_wings=0) Koalas(Index='hawk', num_legs=2, num_wings=2) ```
This PR proposes `GroupBy.median()`. Note: the result can be slightly different from pandas since we use an approximated median based upon approximate percentile computation because computing median across a large dataset is extremely expensive. ```python >>> kdf = ks.DataFrame({'a': [1., 1., 1., 1., 2., 2., 2., 3., 3., 3.], ... 'b': [2., 3., 1., 4., 6., 9., 8., 10., 7., 5.], ... 'c': [3., 5., 2., 5., 1., 2., 6., 4., 3., 6.]}, ... columns=['a', 'b', 'c'], ... index=[7, 2, 4, 1, 3, 4, 9, 10, 5, 6]) >>> kdf a b c 7 1.0 2.0 3.0 2 1.0 3.0 5.0 4 1.0 1.0 2.0 1 1.0 4.0 5.0 3 2.0 6.0 1.0 4 2.0 9.0 2.0 9 2.0 8.0 6.0 10 3.0 10.0 4.0 5 3.0 7.0 3.0 6 3.0 5.0 6.0 >>> kdf.groupby('a').median().sort_index() # doctest: +NORMALIZE_WHITESPACE b c a 1.0 2.0 3.0 2.0 8.0 2.0 3.0 7.0 4.0 >>> kdf.groupby('a')['b'].median().sort_index() a 1.0 2.0 2.0 8.0 3.0 7.0 Name: b, dtype: float64 ``` ref #1929
Hi @ueshin, @HyukjinKwon can I proceed with the following Dataframe APIs -
I'll start with the dev once you give me the approval. |
@shril sure, please go ahead! Thanks! |
@ueshin I was going through this blog post of yours - https://databricks.com/blog/2020/03/31/10-minutes-from-pandas-to-koalas-on-apache-spark.html, and it suggested using Do you suggest to proceed with using the Edit: |
@shril I don't have a strong opinion on it. If you can implement it without |
Hi @ueshin, I am slightly confused. We don't have
This is the small implementation I tried. Do you think that |
ref #1929 ``` >>> kser = ks.Series(['b', None, 'a', 'c', 'b']) >>> codes, uniques = kser.factorize() >>> codes 0 1 1 -1 2 0 3 2 4 1 dtype: int64 >>> uniques Index(['a', 'b', 'c'], dtype='object') >>> codes, uniques = kser.factorize(na_sentinel=None) >>> codes 0 1 1 3 2 0 3 2 4 1 dtype: int64 >>> uniques Index(['a', 'b', 'c', None], dtype='object') >>> codes, uniques = kser.factorize(na_sentinel=-2) >>> codes 0 1 1 -2 2 0 3 2 4 1 dtype: int64 >>> uniques Index(['a', 'b', 'c'], dtype='object') ```
ref #1929 Insert column into DataFrame at a specified location. ``` >>> kdf = ks.DataFrame([1, 2, 3]) >>> kdf.insert(0, 'x', 4) >>> kdf.sort_index() x 0 0 4 1 1 4 2 2 4 3 >>> from databricks.koalas.config import set_option, reset_option >>> set_option("compute.ops_on_diff_frames", True) >>> kdf.insert(1, 'y', [5, 6, 7]) >>> kdf.sort_index() x y 0 0 4 5 1 1 4 6 2 2 4 7 3 >>> kdf.insert(2, 'z', ks.Series([8, 9, 10])) >>> kdf.sort_index() x y z 0 0 4 5 8 1 1 4 6 9 2 2 4 7 10 3 >>> reset_option("compute.ops_on_diff_frames") ```
Is iterating through groups on the roadmap for API coverage? I would find that helpful. |
@chogg Thanks for the suggestion! We'll look into this and keep you updated. |
ref #1929 Implement `DataFrame.between_time` ```py >>> i = pd.date_range('2018-04-09', periods=4, freq='1D20min') >>> ts = pd.DataFrame({'A': [1, 2, 3, 4]}, index=i) >>> kts = ks.from_pandas(ts) >>> kts A 2018-04-09 00:00:00 1 2018-04-10 00:20:00 2 2018-04-11 00:40:00 3 2018-04-12 01:00:00 4 >>> kts.between_time('0:15', '0:45') A 2018-04-10 00:20:00 2 2018-04-11 00:40:00 3 You get the times that are *not* between two times by setting ``start_time`` later than ``end_time``: >>> kts.between_time('0:45', '0:15') A 2018-04-09 00:00:00 1 2018-04-12 01:00:00 4 ```
Hi all, I was going to look at implementing functionality for |
@awdavidson Certainly, please feel free to do so! Your PR seemed to be closed. |
@xinrong-databricks I'll reopen currently still working on it. Opened the PR to check build etc as had issue running a few things locally - didn't want to clutter your PR tab. Local environment is now working so should be able to completely test :) |
@awdavidson Thanks! You might mark it as a draft PR until it's ready for review. Let us know if you have any questions :) |
Please see change to implement `DataFrame.last` and `Series.last` functionality similar to that available in pandas. Requirement raised in issue: #1929 ```python >>> index = pd.date_range('2018-04-09', periods=4, freq='2D') >>> ks_series = ks.Series([1, 2, 3, 4], index=index) 2018-04-09 1 2018-04-11 2 2018-04-13 3 2018-04-15 4 dtype: int64 >>> ks_series.last('3D') 2018-04-13 3 2018-04-15 4 dtype: int64 ``` ```python >>> index = pd.date_range('2018-04-09', periods=4, freq='2D') >>> pdf = pd.DataFrame({'A': [1, 2, 3, 4]}, index=i) >>> kdf = fs.from_pandas(pdf) A 2018-04-09 1 2018-04-11 2 2018-04-13 3 2018-04-15 4 >>> kdf.last('3D') A 2018-04-13 3 2018-04-15 4 ```
Please see change to implement DataFrame.first and Series.first functionality similar to that available in pandas. Requirement raised in issue: #1929 ```python >>> index = pd.date_range('2018-04-09', periods=4, freq='2D') >>> ks_series = ks.Series([1, 2, 3, 4], index=index) 2018-04-09 1 2018-04-11 2 2018-04-13 3 2018-04-15 4 dtype: int64 >>> ks_series.first('3D') 2018-04-09 1 2018-04-11 2 dtype: int64 ```
@ueshin @xinrong-databricks has there been any discussion around how to implement Note: example implementation can be found here master...awdavidson:feature/impl-index_map |
@awdavidson As we have not been working on it, you can go ahead. One thing on the example implementation, using Thanks! |
Please see change to implement `DataFrame.last` and `Series.last` functionality similar to that available in pandas. Requirement raised in issue: databricks/koalas#1929 ```python >>> index = pd.date_range('2018-04-09', periods=4, freq='2D') >>> ks_series = ks.Series([1, 2, 3, 4], index=index) 2018-04-09 1 2018-04-11 2 2018-04-13 3 2018-04-15 4 dtype: int64 >>> ks_series.last('3D') 2018-04-13 3 2018-04-15 4 dtype: int64 ``` ```python >>> index = pd.date_range('2018-04-09', periods=4, freq='2D') >>> pdf = pd.DataFrame({'A': [1, 2, 3, 4]}, index=i) >>> kdf = fs.from_pandas(pdf) A 2018-04-09 1 2018-04-11 2 2018-04-13 3 2018-04-15 4 >>> kdf.last('3D') A 2018-04-13 3 2018-04-15 4 ```
Help wanted! A few popular pandas APIs are missing in Koalas. We are going to implement them!
Please use this thread to comment on which function you will be working so we don't duplicate work. Please mention this issue in your PR so that the list below can be updated.
DataFrame
Series
Index
SeriesGroupBy and DataFrameGroupBy
The text was updated successfully, but these errors were encountered: