Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deprecate ScalarUDFImpl::return_type #13717

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

andygrove
Copy link
Member

Which issue does this PR close?

Closes #13716

Rationale for this change

While upgrading the DataFusion version used by Comet, I saw test failures due to a breaking API change where ScalarUDFImpl.return_type returns an internal error for DatePartFunc.

What changes are included in this PR?

Return Float64 instead of throwing an error to preserve previous (incorrect) behavior. Deprecate return_type and recommend return_type_from_exprs instead.

Are these changes tested?

Are there any user-facing changes?

@jayzhan211
Copy link
Contributor

jayzhan211 commented Dec 11, 2024

return_type_from_exprs is introduced to avoid downstream breaking change, but if it still causes downstream breaking change...we should consider combine these 2 functions to cleanup.

Do you call ScalarUDFImpl.return_type for all the functions or just date_part?

If you rely on ScalarUDFImpl.return_type for all the functions then I think we should deprecate return_type_from_exprs and fix the signature in return_type. If you can switch to return_type_from_exprs for date_part only, this is the better solution that minimize the breaking change.

internal_err!("return_type_from_exprs shoud be called instead")
// return type could be Float32 or Int32 depending on the input arguments
// return_type_from_exprs should be called instead
Ok(DataType::Float64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would appreciate exception more than successfully retrieving a false value.
the false value may easily cause some hard to track down problem, while exception is so much easier to find

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reverted this change now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, this is a breaking change and is currently blocking Comet from upgrading its DataFusion version. We do not have access to the logical Expr to call the newer return_type_from_exprs because we are only working with DataFusion's physical plans. I think I would prefer the non-breaking behavior of maintaining the functionality from DataFusion 43 and marking this as deprecated until we can find a better solution.

Maybe @alamb has thoughts on how we can best handle this to avoid causing pain for downstream projects?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a breaking change and is currently blocking Comet from upgrading its DataFusion version.

i agree this is breaking and sorry to hear it affects Comet
i wish i knew how to make it non-breaking!

it seems this all boils down to the problem that expressions don't know their types -- #12604

@andygrove
Copy link
Member Author

return_type_from_exprs is introduced to avoid downstream breaking change, but if it still causes downstream breaking change...we should consider combine these 2 functions to cleanup.

Both of these functions have been part of the public API for a while now. The newer return_type_from_exprs works correctly and the older return_type is incorrect in some cases. It seems better to me to deprecate the older incorrect version.

Do you call ScalarUDFImpl.return_type for all the functions or just date_part?

We call this for all functions.

If you rely on ScalarUDFImpl.return_type for all the functions then I think we should deprecate return_type_from_exprs and fix the signature in return_type. If you can switch to return_type_from_exprs for date_part only, this is the better solution that minimize the breaking change.

We have generic handling for all scalar UDFs so I'd prefer not to start adding special rules for specific funct ions.

@findepi
Copy link
Member

findepi commented Dec 11, 2024

Both of these functions have been part of the public API for a while now.

i believe you

The newer return_type_from_exprs works correctly and the older return_type is incorrect in some cases. It seems better to me to deprecate the older incorrect version.

date_part's return_type used to be correct in all cases

the implementation was removed because it was no longer correct in all cases

if someone calls func.return_type() just for display purposes into logs, they may be OK getting wrong information.
If someone calls func.return_type() and trusts return value, they could be disappointed if we continue to return something, but the return value doesn't continue to be correct.

for example (pseudo-code)

let func_return_value : ArrayRef =  ... ;
match func.return_type() {
 case Float64 => {
    as_float64_array(func_return_value) ...

or

let wanted_type = ...;
let ret func.return_type();
if wanted_type == ret {
  return func
} else {
  Expr::Cast( func, wanted_type )
}

if we need to return some incorrect value for some time, this should be behind an opt-in config toggle.

@@ -174,6 +174,7 @@ impl ScalarUDF {
///
/// See [`ScalarUDFImpl::return_type`] for more details.
pub fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> {
#[allow(deprecated)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we also deprecate this (ScalarUDF::return_type) method if we deprecate underlying ScalarUDFImpl method?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

@@ -174,6 +174,7 @@ impl ScalarUDF {
///
/// See [`ScalarUDFImpl::return_type`] for more details.
pub fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> {
#[allow(deprecated)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

@@ -450,6 +451,7 @@ pub trait ScalarUDFImpl: Debug + Send + Sync {
/// is recommended to return [`DataFusionError::Internal`].
///
/// [`DataFusionError::Internal`]: datafusion_common::DataFusionError::Internal
#[deprecated(since = "44.0.0", note = "Use `return_type_from_exprs` instead")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are going to deprecate this API I think we should add an example to return_type_from_exprs to help the migration effort.

I can help if we proceed

@github-actions github-actions bot added the core Core DataFusion crate label Dec 11, 2024
@andygrove andygrove changed the title Deprecate ScalarUDFImpl::return_type and revert a breaking API change Deprecate ScalarUDFImpl::return_type Dec 11, 2024
@@ -406,6 +406,7 @@ fn get_udf_args_and_return_types(
.into_iter()
.map(|arg_types| {
// only handle the function which implemented [`ScalarUDFImpl::return_type`] method
#[allow(deprecated)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @goldmedal for ideas how (if) we will be able to resolve this deprecation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reminding this. I filed #13735 to trace it.

Copy link
Contributor

@alamb alamb Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While reading #13735 and thinking about the churn we had recently with invoke_with_batch and invoke_with_args, etc

What if we made a new API that could accomodate both usecases. Something like

struct ReturnTypeArgs {
       // Argments, if available. This may not be available in some contexts
       // such as information_table schema queries
        args: Option<&[Expr]>
       // schema, if available. 
        schema: Option<&dyn ExprSchema>,
        arg_types: &[DataType],
}

impl ScalarUdfImpl {
  /// returns the result type, given `args` if possible
  /// if not possible, returns `Ok(None)`
  fn return_type_with_args(&self, args: &ReturnTypeArgs) -> Result<Option<DataType>> 
}

🤔

This would also let us add other fields to the ReturnTypeArgs structure over time with less API churn

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we change the signature to return_type(ReturnTypeArgs) and remove return_type_with_args at all

I was thinking about this too but not sure this huge breaking change is acceptable or not, but I think this will be a better change in long term

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we change the signature to return_type(ReturnTypeArgs) and remove return_type_with_args at all

I was thinking about this too but not sure this huge breaking change is acceptable or not, but I think this will be a better change in long term

I agree return_type(ReturnTypeArgs) would be the best in the long term

However, in the short term (next 6 months) I think it would be nicer to downstream crates / users to add a new function return_type_with_args and leave #deprecation notices on return_type and return_type_from_exprs directing people to return_type_with_args

Then once the deprecation period has passed we could rename / introduce a new function called return_type 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The information_schema seems like another independent problem. We can even just list the possible return types for it, because it doesn't care the decision of the return type, it cares about the return_type itself only. The trivial solution is introduce another trait implementation like you said.

pub fn possible_return_types(&self) -> Vec<DataType>;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's turn the question around. What problems would return_type_with_args be solving?

In my mind there are two usecases for return_type:

  1. LogicalPlanning (aka the analyzer and type coercion, etc) when you have Exprs
  2. PhysicalPlanning (aka physical optimizers)

As I understood the problem is that some things in the LogicalPlanner do things with the exprs (like determining the type of the arguments)

I agree with the "it is not clear when you get arguments in what contexts"

Maybe we could make this explicit with an enum like the following 🤔

/// Information about arguments when calling `return_type`
enum ArgumentInfo {
  LogicalPlanning {
        args: &[Expr],
        schema: &dyn ExprSchema
   },
  PhysicalExecution, 
...
   
}

struct ReturnTypeArgs {
       // Argument information
       arg_info: ArgumentInfo,
       // argument DataTypes
        arg_types: &[DataType],
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2. PhysicalPlanning (aka physical optimizers)

For physical planning we should simply do equivalent of #13825 (but for physical exprs). Otherwise there is no good way to find out the return type. And complicating the call signature will only add insult to injury

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one of the challenge of #13825 and #12604 is that the properties of Expr including data type is determined based on schema. Unless the schema is the same all the way of the processing otherwise we need to recompute the properties based on given schema so we can't compute once and store the information within Expr. Introduce Map for schema -> properties mapping seems not practically.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, if we have column c, we need schema to know it's data type is Int32. With another schema, it may becomes Int64. How do we ensure the schema is not changed and we can happily reuse the result we had before?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate functions logical-expr Logical plan and expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Public API ScalarUDFImpl.return_type returns internal error in some cases
6 participants