Skip to content

Conversation

@rishic3
Copy link

@rishic3 rishic3 commented Dec 9, 2025

Description

Implements monotonically increasing ID expression. This closely follows the Spark implementation https://github.com/apache/spark/blob/9bbdc0743034b40a904ca87a08da4e0bf2b1386c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/MonotonicallyIncreasingID.scala.

Example usage:

from ray.data.expressions import monotonically_increasing_id

ds = ray.data.range(100)
ds = ds.with_column("uid", monotonically_increasing_id())
train, test = ds_with_id.streaming_train_test_split(test_size=0.25, split_type="hash", hash_column="uid")

Related issues

Closes #57806

Signed-off-by: Rishi Chandra <[email protected]>
Signed-off-by: Rishi Chandra <[email protected]>
Signed-off-by: Rishi Chandra <[email protected]>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces add_unique_id to add a monotonically increasing ID column to a dataset, which is a valuable addition. The implementation is clear and follows the described logic from Spark. However, I've identified a significant correctness issue in the handling of pyarrow.Table objects, which results in creating a nested list column instead of a flat integer column. I've also provided suggestions to improve efficiency, remove unreachable code, and enhance the clarity of the documentation. The tests are well-structured, but they may not be catching the pyarrow bug I've pointed out.

Signed-off-by: Rishi Chandra <[email protected]>
@ray-gardener ray-gardener bot added data Ray Data-related issues community-contribution Contributed by the community labels Dec 9, 2025
@richardliaw
Copy link
Contributor

awesome! thanks a bunch for this contribution. will let @gvspraveen find someone to shepherd this.

@gvspraveen
Copy link
Contributor

Thanks for the contribution. @bveeramani to shepherd this.

@richardliaw
Copy link
Contributor

hey @rishic3, thanks a bunch for the contribution! if you're interested in chatting more with the contributors, feel free to join our community sync -- https://docs.google.com/forms/d/e/1FAIpQLSeYWjNExnr6gbhO5rpM0i6wm4TBTdsm3y5S0LR8Syzk_2gelQ/viewform

Signed-off-by: Rishi Chandra <[email protected]>
@rishic3
Copy link
Author

rishic3 commented Dec 10, 2025

Previous commit implemented as a dataset method (see e0fc987). Pushed an update that instead implements as an expression as per original issue intended.

Signed-off-by: Rishi Chandra <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community data Ray Data-related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[data] Support expression for monotonically increasing id

3 participants