feat: add i32 return type for length functions #606

richtia · 2024-02-23T22:00:06Z

This PR adds an i32 return type for a few length functions based on existing engine support.

postgres, datafusion, and acero all have i32 return types for these functions. duckdb has i64

extensions/functions_string.yaml

EpsilonPrime

Does having the same set of input arguments but different output arguments cause an issue? The name of a function is usually defined by the input arguments so that would be one potential source of conflict.

From a Substrait point of view, defining the type to always be i64 is fairly interoperable as anything implementing the interface could just upcast an i32 if they had it to i64.

westonpace · 2024-02-26T12:39:11Z

Does having the same set of input arguments but different output arguments cause an issue?

David is right. We cannot overload functions based only on return type:

Every compound function signature must be unique. If two function implementations in a YAML file would generate the same compound function signature, then the YAML file is invalid and behavior is undefined.

westonpace

I'm afraid you will need to create a new function. Perhaps char_length_32? I'm not sure we've had to tackle this yet so we don't have a convention.

richtia · 2024-02-27T16:30:31Z

I'm afraid you will need to create a new function. Perhaps char_length_32? I'm not sure we've had to tackle this yet so we don't have a convention.

That works for me...but i also guess to @EpsilonPrime's point, I'm wondering if it's too much of a problem to just have the producer/consumer handle any type of casting to i64 if needed in the situation we have today.

westonpace · 2024-02-28T13:04:07Z

That works for me...but i also guess to @EpsilonPrime's point, I'm wondering if it's too much of a problem to just have the producer/consumer handle any type of casting to i64 if needed in the situation we have today.

I'm good with only i64. It's more work on the consumer either way. Either they have to support two functions (one of them probably with a cast) or they need to recognize and add an appropriate cast so they can consume a non-matching function.

On the producer end of things I think, in most situations, the user doesn't care about this detail. If they do then they can insert a cast. A savvy consumer will hopefully be able to optimize the resulting internal plan of consumer_char_length(...) -> cast(i64) -> cast(i32) into my_char_length(...).

westonpace · 2024-02-28T13:06:28Z

If we got the only i64 route it might be nice to document this somewhere. It would be nice if there were a README.MD for https://github.com/substrait-io/substrait/tree/main/extensions the defined:

the scope of the folder (e.g. these are a fairly minimal set of functions shared by many consumers today)
any conventions (e.g. prefer i64 because the consumer can easily insert a cast and we don't want to double the # of methods)
Etc.

jacques-n · 2024-03-28T00:45:25Z

extensions/functions_string.yaml

+  -
+    name: char_length_32
+    description: >-
+      Return the number of characters in the input string.  The length includes trailing spaces.


is characters mean code points?

Ping @richtia for clarity on what character means. In some system this means bytes. In some system it means utf8 codepoints.

Blizzara · 2024-07-11T07:05:13Z

I was thinking about these return types recently, given that Spark and DataFusion disagree on the return types in many cases. I wonder if there is any reason/benefit overall to having the return type specified on the extension? Maybe it could be removed?

Given that:

we cannot (at least currently) have functions with same name and same args but different return types, as the return type is not part of the compound key. Having separately named functions for separate return types feels just wrong.
a substrait producer usually goes from internal plan -> Substrait, ie. one direction only. It is not able to modify the plan to fit into the Substrait spec, so if its internal return type for a function doesn't match what Substrait expects, there is nothing it can do (I think?).
Function calls in proto definitions already contain output types, like here. It would seem reasonable to me that the producer picks the function impl with the right args, but then writes the output type it has - and consumer would be responsible for adhering to that output type by casting if its implementation differs.

I don't know about the other consumers, but DataFusion doesn't use the extension's return type (well, doesn't use the extensions at all 😅 ) and the Substrait java lib has the validation as no-op as it's too hard to do properly.

westonpace · 2024-07-11T13:59:41Z

we cannot (at least currently) have functions with same name and same args but different return types, as the return type is not part of the compound key. Having separately named functions for separate return types feels just wrong.

Isn't this how most languages work? Even those that allow overloading do not allow overloading based purely on return type.

a substrait producer usually goes from internal plan -> Substrait, ie. one direction only. It is not able to modify the plan to fit into the Substrait spec, so if its internal return type for a function doesn't match what Substrait expects, there is nothing it can do (I think?).

This is true even if Substrait doesn't exist. If a producer and a consumer do not agree on how a function should behave then there is a problem.

Function calls in proto definitions already contain output types, like here. It would seem reasonable to me that the producer picks the function impl with the right args, but then writes the output type it has - and consumer would be responsible for adhering to that output type by casting if its implementation differs.

Yes, a consumer is free to act this way today. The return type in the YAML does not prevent this behavior. I assume most consumers will never consult the YAML. They will look at the fully qualified function name, the input types, and the output type, and then figure out if they have a kernel (usually a lookup in some kind of registry) that can satisfy the request.

I wonder if there is any reason/benefit overall to having the return type specified on the extension? Maybe it could be removed?

I think angst around return type "width" (e.g. i32 vs i64 vs u32 vs u64) is a symptom of a slightly different problem. In a perfect world there would only be one integer type (integer) and the width of that integer would be an implementation detail (I think bkietz sold me on this idea long ago). That's part of the reason I think i64 always is a perfectly fine approach.

I think there is still value in having the return type in the YAML. I think the main value is that the return type tells me the returned value is an integer and not a float / string / timestamp / etc.

Producers / consumers creating new functions can know what is expected.
I agree automatic casting between different integer widths is trivial. However, automatically introducing casts between int / float / string is probably not going to give the right answer / what the user expects.
Independent plan transformations tools that don't have the producer's knowledge or the consumer's list of functions will need this information.
Validation tools need this information.

Blizzara · 2024-07-24T10:12:27Z

we cannot (at least currently) have functions with same name and same args but different return types, as the return type is not part of the compound key. Having separately named functions for separate return types feels just wrong.

Isn't this how most languages work? Even those that allow overloading do not allow overloading based purely on return type.

I think there's a difference in how we view Substrait's function (and maybe my view is incorrect 😅 ). Am I correct in understanding you see Substrait proto (and simple extension format) as a language for defining one standard of how a relational query engine should work, and the YAML files as the definition for that query engine written in that language? (Or at least something like this.)

Difference being I've seen Substrait rather as a language for defining how a query engine (or a plan, really) does work, with the YAML files defining the different options.

Put in other words, the two options would be:

Substrait (YAML) files define a behavior for query plan. Engines should behave like Substrait defines. vs
Engines have existing behavior. Substrait should be able to represent that behavior.

I think both approaches would be valid goals. I would expect it to take a long time until engines like Spark would move towards compatibility with Substrait, so I see (2) as more useful approach, however, the good thing is with the simple- and advanced extensions (2) is quite doable for users (e.g. me) even if the projet's main goal is (1), so that's cool.

a substrait producer usually goes from internal plan -> Substrait, ie. one direction only. It is not able to modify the plan to fit into the Substrait spec, so if its internal return type for a function doesn't match what Substrait expects, there is nothing it can do (I think?).

This is true even if Substrait doesn't exist. If a producer and a consumer do not agree on how a function should behave then there is a problem.

That's a different question. Even if the producer and consumer agrees on the output type, if the Substrait YAML has different output type, then there's nothing really that the producer can do to produce valid plans according to that YAML (note: the producer can not use the Substrait YAML and have its own, and that solves this problem).

I wonder if there is any reason/benefit overall to having the return type specified on the extension? Maybe it could be removed?

I think angst around return type "width" (e.g. i32 vs i64 vs u32 vs u64) is a symptom of a slightly different problem. In a perfect world there would only be one integer type (integer) and the width of that integer would be an implementation detail (I think bkietz sold me on this idea long ago). That's part of the reason I think i64 always is a perfectly fine approach.

I think there is still value in having the return type in the YAML. I think the main value is that the return type tells me the returned value is an integer and not a float / string / timestamp / etc.

There are differences apart from int width, for example Spark's floor/ceil return ints while Substrait specifies floats. But yea, would you agree the right answer here is to create new YAMLs for Spark, rather than even trying to fit it into the YAMLs in the Substrait repo?

westonpace · 2024-07-24T15:19:36Z

Am I correct in understanding you see Substrait proto (and simple extension format) as a language for defining one standard of how a relational query engine should work, and the YAML files as the definition for that query engine written in that language? (Or at least something like this.)

I'm not entirely sure I follow. I view the proto files / substrait as defining operator behavior (in the world of relational algebra) and the YAML as defining function behavior (I don't know what this world is called 😆).

But yea, would you agree the right answer here is to create new YAMLs for Spark, rather than even trying to fit it into the YAMLs in the Substrait repo?

Yes, I think creating Spark-specific YAMLs would be a perfectly fine approach.

There are differences apart from int width, for example Spark's floor/ceil return ints while Substrait specifies floats.

Yes, these differences seem like they could cause trouble. For example, imagine the plan is SELECT (CEIL(0.3) / 2). If some engines return 0.5 and other engines return 0 then I think that is a bad thing. Spark should either reject the plan (I don't have a function ceil that behaves as you want) or convert substrait ceil into spark ceil + cast.

Blizzara · 2024-07-24T16:24:57Z

Am I correct in understanding you see Substrait proto (and simple extension format) as a language for defining one standard of how a relational query engine should work, and the YAML files as the definition for that query engine written in that language? (Or at least something like this.)

I'm not entirely sure I follow. I view the proto files / substrait as defining operator behavior (in the world of relational algebra) and the YAML as defining function behavior (I don't know what this world is called 😆).

I think the question is do the YAMLs aim to define one target function behavior or do they aim to define all existing behaviors. With the former, it absolutely makes sense to have the return type (and to have only a single return type).

jacques-n · 2024-08-01T02:25:53Z

I think the question is do the YAMLs aim to define one target function behavior or do they aim to define all existing behaviors. With the former, it absolutely makes sense to have the return type (and to have only a single return type).

The way you asked this is unclear to me.

Each leaf argument/name combination in a yaml is trying to define the exact semantics of a function (which may exist in any of zero to many systems). If two different functions have different behavior, they should be two distinguished in one of the following ways. Note that for some kinds of variation in behavior you can choose from multiple choices. For some, you have limited choices.

Difference	Via Options, same yaml & variant	Separate Variants, same yaml	Separate Yaml
Different output value semantics for same argument types and output type derivation	Yes	Yes	Yes
Difference in input argument types or count	No	Yes	Yes
Different output type semantics for same name and input types	No	No	Yes

This appears resolved.

jacques-n

These look good to me other than the question about character points or bytes.

feat: add i32 return type for length functions

a83d7e2

richtia requested review from jacques-n, cpcloud, westonpace, EpsilonPrime and vbarua as code owners February 23, 2024 22:00

vibhatha reviewed Feb 24, 2024

View reviewed changes

extensions/functions_string.yaml Outdated Show resolved Hide resolved

EpsilonPrime reviewed Feb 24, 2024

View reviewed changes

westonpace previously requested changes Feb 26, 2024

View reviewed changes

fix: add new function for 32 bit variation

9a235c9

jacques-n reviewed Mar 28, 2024

View reviewed changes

Blizzara mentioned this pull request Jul 24, 2024

feat: add aggregate count functions with decimal return type #670

Merged

Merge branch 'main' into update_lenth_funcs

4660034

jacques-n reviewed Aug 1, 2024

View reviewed changes

jacques-n added the awaiting-user-input This issue is waiting on further input from users label Aug 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add i32 return type for length functions #606

feat: add i32 return type for length functions #606

richtia commented Feb 23, 2024

EpsilonPrime left a comment

westonpace commented Feb 26, 2024

westonpace left a comment

richtia commented Feb 27, 2024

westonpace commented Feb 28, 2024

westonpace commented Feb 28, 2024

jacques-n Mar 28, 2024

jacques-n Aug 1, 2024

Blizzara commented Jul 11, 2024

westonpace commented Jul 11, 2024

Blizzara commented Jul 24, 2024

westonpace commented Jul 24, 2024

Blizzara commented Jul 24, 2024

jacques-n commented Aug 1, 2024 •

edited

Loading

jacques-n left a comment

feat: add i32 return type for length functions #606

Are you sure you want to change the base?

feat: add i32 return type for length functions #606

Conversation

richtia commented Feb 23, 2024

EpsilonPrime left a comment

Choose a reason for hiding this comment

westonpace commented Feb 26, 2024

westonpace left a comment

Choose a reason for hiding this comment

richtia commented Feb 27, 2024

westonpace commented Feb 28, 2024

westonpace commented Feb 28, 2024

jacques-n Mar 28, 2024

Choose a reason for hiding this comment

jacques-n Aug 1, 2024

Choose a reason for hiding this comment

Blizzara commented Jul 11, 2024

westonpace commented Jul 11, 2024

Blizzara commented Jul 24, 2024

westonpace commented Jul 24, 2024

Blizzara commented Jul 24, 2024

jacques-n commented Aug 1, 2024 • edited Loading

jacques-n left a comment

Choose a reason for hiding this comment

jacques-n commented Aug 1, 2024 •

edited

Loading