-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add i32 return type for length functions #606
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does having the same set of input arguments but different output arguments cause an issue? The name of a function is usually defined by the input arguments so that would be one potential source of conflict.
From a Substrait point of view, defining the type to always be i64 is fairly interoperable as anything implementing the interface could just upcast an i32 if they had it to i64.
David is right. We cannot overload functions based only on return type:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid you will need to create a new function. Perhaps char_length_32
? I'm not sure we've had to tackle this yet so we don't have a convention.
That works for me...but i also guess to @EpsilonPrime's point, I'm wondering if it's too much of a problem to just have the producer/consumer handle any type of casting to i64 if needed in the situation we have today. |
I'm good with only i64. It's more work on the consumer either way. Either they have to support two functions (one of them probably with a cast) or they need to recognize and add an appropriate cast so they can consume a non-matching function. On the producer end of things I think, in most situations, the user doesn't care about this detail. If they do then they can insert a cast. A savvy consumer will hopefully be able to optimize the resulting internal plan of |
If we got the only i64 route it might be nice to document this somewhere. It would be nice if there were a
|
- | ||
name: char_length_32 | ||
description: >- | ||
Return the number of characters in the input string. The length includes trailing spaces. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is characters
mean code points?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ping @richtia for clarity on what character means. In some system this means bytes. In some system it means utf8 codepoints.
I was thinking about these return types recently, given that Spark and DataFusion disagree on the return types in many cases. I wonder if there is any reason/benefit overall to having the return type specified on the extension? Maybe it could be removed? Given that:
I don't know about the other consumers, but DataFusion doesn't use the extension's return type (well, doesn't use the extensions at all 😅 ) and the Substrait java lib has the validation as no-op as it's too hard to do properly. |
Isn't this how most languages work? Even those that allow overloading do not allow overloading based purely on return type.
This is true even if Substrait doesn't exist. If a producer and a consumer do not agree on how a function should behave then there is a problem.
Yes, a consumer is free to act this way today. The return type in the YAML does not prevent this behavior. I assume most consumers will never consult the YAML. They will look at the fully qualified function name, the input types, and the output type, and then figure out if they have a kernel (usually a lookup in some kind of registry) that can satisfy the request.
I think angst around return type "width" (e.g. i32 vs i64 vs u32 vs u64) is a symptom of a slightly different problem. In a perfect world there would only be one integer type (integer) and the width of that integer would be an implementation detail (I think bkietz sold me on this idea long ago). That's part of the reason I think i64 always is a perfectly fine approach. I think there is still value in having the return type in the YAML. I think the main value is that the return type tells me the returned value is an integer and not a float / string / timestamp / etc.
|
I think there's a difference in how we view Substrait's function (and maybe my view is incorrect 😅 ). Am I correct in understanding you see Substrait proto (and simple extension format) as a language for defining one standard of how a relational query engine should work, and the YAML files as the definition for that query engine written in that language? (Or at least something like this.) Difference being I've seen Substrait rather as a language for defining how a query engine (or a plan, really) does work, with the YAML files defining the different options. Put in other words, the two options would be:
I think both approaches would be valid goals. I would expect it to take a long time until engines like Spark would move towards compatibility with Substrait, so I see (2) as more useful approach, however, the good thing is with the simple- and advanced extensions (2) is quite doable for users (e.g. me) even if the projet's main goal is (1), so that's cool.
That's a different question. Even if the producer and consumer agrees on the output type, if the Substrait YAML has different output type, then there's nothing really that the producer can do to produce valid plans according to that YAML (note: the producer can not use the Substrait YAML and have its own, and that solves this problem).
There are differences apart from int width, for example Spark's floor/ceil return ints while Substrait specifies floats. But yea, would you agree the right answer here is to create new YAMLs for Spark, rather than even trying to fit it into the YAMLs in the Substrait repo? |
I'm not entirely sure I follow. I view the proto files / substrait as defining operator behavior (in the world of relational algebra) and the YAML as defining function behavior (I don't know what this world is called 😆).
Yes, I think creating Spark-specific YAMLs would be a perfectly fine approach.
Yes, these differences seem like they could cause trouble. For example, imagine the plan is |
I think the question is do the YAMLs aim to define one target function behavior or do they aim to define all existing behaviors. With the former, it absolutely makes sense to have the return type (and to have only a single return type). |
The way you asked this is unclear to me. Each leaf argument/name combination in a yaml is trying to define the exact semantics of a function (which may exist in any of zero to many systems). If two different functions have different behavior, they should be two distinguished in one of the following ways. Note that for some kinds of variation in behavior you can choose from multiple choices. For some, you have limited choices.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These look good to me other than the question about character points or bytes.
This PR adds an i32 return type for a few length functions based on existing engine support.
postgres, datafusion, and acero all have i32 return types for these functions. duckdb has i64