-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deprecate ScalarUDFImpl::return_type #13717
base: main
Are you sure you want to change the base?
Conversation
Do you call If you rely on |
internal_err!("return_type_from_exprs shoud be called instead") | ||
// return type could be Float32 or Int32 depending on the input arguments | ||
// return_type_from_exprs should be called instead | ||
Ok(DataType::Float64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would appreciate exception more than successfully retrieving a false value.
the false value may easily cause some hard to track down problem, while exception is so much easier to find
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have reverted this change now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, this is a breaking change and is currently blocking Comet from upgrading its DataFusion version. We do not have access to the logical Expr
to call the newer return_type_from_exprs
because we are only working with DataFusion's physical plans. I think I would prefer the non-breaking behavior of maintaining the functionality from DataFusion 43 and marking this as deprecated until we can find a better solution.
Maybe @alamb has thoughts on how we can best handle this to avoid causing pain for downstream projects?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a breaking change and is currently blocking Comet from upgrading its DataFusion version.
i agree this is breaking and sorry to hear it affects Comet
i wish i knew how to make it non-breaking!
it seems this all boils down to the problem that expressions don't know their types -- #12604
Both of these functions have been part of the public API for a while now. The newer
We call this for all functions.
We have generic handling for all scalar UDFs so I'd prefer not to start adding special rules for specific funct ions. |
i believe you
date_part's return_type used to be correct in all cases the implementation was removed because it was no longer correct in all cases if someone calls for example (pseudo-code) let func_return_value : ArrayRef = ... ;
match func.return_type() {
case Float64 => {
as_float64_array(func_return_value) ... or let wanted_type = ...;
let ret func.return_type();
if wanted_type == ret {
return func
} else {
Expr::Cast( func, wanted_type )
} if we need to return some incorrect value for some time, this should be behind an opt-in config toggle. |
@@ -174,6 +174,7 @@ impl ScalarUDF { | |||
/// | |||
/// See [`ScalarUDFImpl::return_type`] for more details. | |||
pub fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> { | |||
#[allow(deprecated)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we also deprecate this (ScalarUDF::return_type
) method if we deprecate underlying ScalarUDFImpl
method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
@@ -174,6 +174,7 @@ impl ScalarUDF { | |||
/// | |||
/// See [`ScalarUDFImpl::return_type`] for more details. | |||
pub fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> { | |||
#[allow(deprecated)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
@@ -450,6 +451,7 @@ pub trait ScalarUDFImpl: Debug + Send + Sync { | |||
/// is recommended to return [`DataFusionError::Internal`]. | |||
/// | |||
/// [`DataFusionError::Internal`]: datafusion_common::DataFusionError::Internal | |||
#[deprecated(since = "44.0.0", note = "Use `return_type_from_exprs` instead")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we are going to deprecate this API I think we should add an example to return_type_from_exprs
to help the migration effort.
I can help if we proceed
@@ -406,6 +406,7 @@ fn get_udf_args_and_return_types( | |||
.into_iter() | |||
.map(|arg_types| { | |||
// only handle the function which implemented [`ScalarUDFImpl::return_type`] method | |||
#[allow(deprecated)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @goldmedal for ideas how (if) we will be able to resolve this deprecation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reminding this. I filed #13735 to trace it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While reading #13735 and thinking about the churn we had recently with invoke_with_batch
and invoke_with_args
, etc
What if we made a new API that could accomodate both usecases. Something like
struct ReturnTypeArgs {
// Argments, if available. This may not be available in some contexts
// such as information_table schema queries
args: Option<&[Expr]>
// schema, if available.
schema: Option<&dyn ExprSchema>,
arg_types: &[DataType],
}
impl ScalarUdfImpl {
/// returns the result type, given `args` if possible
/// if not possible, returns `Ok(None)`
fn return_type_with_args(&self, args: &ReturnTypeArgs) -> Result<Option<DataType>>
}
🤔
This would also let us add other fields to the ReturnTypeArgs
structure over time with less API churn
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we change the signature to return_type(ReturnTypeArgs)
and remove return_type_with_args
at all
I was thinking about this too but not sure this huge breaking change is acceptable or not, but I think this will be a better change in long term
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we change the signature to
return_type(ReturnTypeArgs)
and removereturn_type_with_args
at allI was thinking about this too but not sure this huge breaking change is acceptable or not, but I think this will be a better change in long term
I agree return_type(ReturnTypeArgs)
would be the best in the long term
However, in the short term (next 6 months) I think it would be nicer to downstream crates / users to add a new function return_type_with_args
and leave #deprecation
notices on return_type
and return_type_from_exprs
directing people to return_type_with_args
Then once the deprecation period has passed we could rename / introduce a new function called return_type
🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The information_schema seems like another independent problem. We can even just list the possible return types for it, because it doesn't care the decision of the return type, it cares about the return_type
itself only. The trivial solution is introduce another trait implementation like you said.
pub fn possible_return_types(&self) -> Vec<DataType>;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's turn the question around. What problems would return_type_with_args be solving?
In my mind there are two usecases for return_type
:
- LogicalPlanning (aka the analyzer and type coercion, etc) when you have
Exprs
- PhysicalPlanning (aka physical optimizers)
As I understood the problem is that some things in the LogicalPlanner do things with the exprs (like determining the type of the arguments)
I agree with the "it is not clear when you get arguments in what contexts"
Maybe we could make this explicit with an enum like the following 🤔
/// Information about arguments when calling `return_type`
enum ArgumentInfo {
LogicalPlanning {
args: &[Expr],
schema: &dyn ExprSchema
},
PhysicalExecution,
...
}
struct ReturnTypeArgs {
// Argument information
arg_info: ArgumentInfo,
// argument DataTypes
arg_types: &[DataType],
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2. PhysicalPlanning (aka physical optimizers)
For physical planning we should simply do equivalent of #13825 (but for physical exprs). Otherwise there is no good way to find out the return type. And complicating the call signature will only add insult to injury
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think one of the challenge of #13825 and #12604 is that the properties of Expr
including data type is determined based on schema
. Unless the schema
is the same all the way of the processing otherwise we need to recompute the properties based on given schema
so we can't compute once and store the information within Expr
. Introduce Map
for schema -> properties mapping seems not practically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, if we have column c
, we need schema
to know it's data type is Int32
. With another schema
, it may becomes Int64
. How do we ensure the schema is not changed and we can happily reuse the result we had before?
Which issue does this PR close?
Closes #13716
Rationale for this change
While upgrading the DataFusion version used by Comet, I saw test failures due to a breaking API change where
ScalarUDFImpl.return_type
returns an internal error forDatePartFunc
.What changes are included in this PR?
Return Float64 instead of throwing an error to preserve previous (incorrect) behavior. Deprecate
return_type
and recommendreturn_type_from_exprs
instead.Are these changes tested?
Are there any user-facing changes?