Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(function): add greatest function #12474

Merged
merged 21 commits into from
Nov 23, 2024
Merged

Conversation

rluvaton
Copy link
Contributor

Which issue does this PR close?

Closes #12472

Rationale for this change

Support more operators

What changes are included in this PR?

Implement the Spark implementation for greatest:
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.greatest.html

Are these changes tested?

I was not able to find where you put the functions test, the coalesce for example has only one test for the return type, I must missing something

mod test {
use arrow::datatypes::DataType;
use datafusion_expr::ScalarUDFImpl;
use crate::core;
#[test]
fn test_coalesce_return_types() {
let coalesce = core::coalesce::CoalesceFunc::new();
let return_type = coalesce
.return_type(&[DataType::Date32, DataType::Date32])
.unwrap();
assert_eq!(return_type, DataType::Date32);
}
}

Are there any user-facing changes?

Yes, I need to add documentation, but first check if this feature is desired

@Weijun-H
Copy link
Member

FYI #12357 (comment)

@comphead
Copy link
Contributor

Related to #6531

@rluvaton rluvaton closed this Sep 23, 2024
@rluvaton rluvaton deleted the add-greatest branch September 23, 2024 12:22
@rluvaton
Copy link
Contributor Author

rluvaton commented Sep 23, 2024

I was not able to find where you put the functions test, the coalesce for example has only one test for the return type, I must missing something

Hey @alamb, we talked about the test location at the beginning of the CMU talk you gave, where the tests are located?

@alamb
Copy link
Contributor

alamb commented Sep 25, 2024

I think the tests I was talking about are described here:

https://github.com/apache/datafusion/tree/4a3c09a9316fb8940aeb1c5b9b48bc3b7259d5d4/datafusion/sqllogictest#readme

If you want to test the type of a function you can do it like

select arrow_typeif(my_func(foo)))

@rluvaton
Copy link
Contributor Author

I see, thank you, how is your experience with debugging this kind of test?

@alamb
Copy link
Contributor

alamb commented Sep 25, 2024

It is great!

Copy link
Member

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

datafusion/functions/src/core/greatest.rs Outdated Show resolved Hide resolved
datafusion/functions/src/core/greatest.rs Outdated Show resolved Hide resolved

let cmp = make_comparator(lhs, rhs, SORT_OPTIONS)?;

let len = lhs.len().min(rhs.len());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they array lengths should be equal (otherwise we would be losing data)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// If both arrays are not nested, have the same length and no nulls, we can use the faster vectorised kernel
// - If both arrays are not nested: Nested types, such as lists, are not supported as the null semantics are not well-defined.
// - both array does not have any nulls: cmp::gt_eq will return null if any of the input is null while we want to return false in that case
if !lhs.data_type().is_nested() && lhs.null_count() == 0 && rhs.null_count() == 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this probably should use logical null count

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// - If both arrays are not nested: Nested types, such as lists, are not supported as the null semantics are not well-defined.
// - both array does not have any nulls: cmp::gt_eq will return null if any of the input is null while we want to return false in that case
if !lhs.data_type().is_nested() && lhs.null_count() == 0 && rhs.null_count() == 0 {
return cmp::gt_eq(&lhs, &rhs).map_err(|e| e.into());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a test with float NaN values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Nov 16, 2024
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Nov 16, 2024
…eatest

# Conflicts:
#	datafusion/sqllogictest/test_files/functions.slt
@rluvaton rluvaton marked this pull request as ready for review November 16, 2024 19:14
@Weijun-H
Copy link
Member

Please add a document function for it

@github-actions github-actions bot removed the documentation Improvements or additions to documentation label Nov 18, 2024
@rluvaton rluvaton requested a review from Weijun-H November 18, 2024 21:38
@waynexia waynexia self-requested a review November 21, 2024 16:31
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Nov 21, 2024
Copy link
Contributor

@jayzhan211 jayzhan211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

@Weijun-H Weijun-H left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @rluvaton

Comment on lines +212 to +214
for array in arrays_iter {
largest = keep_larger(Arc::clone(array), largest)?;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep_larger constructs an intermediate array for each call. I'd prefer to reduce the materialization to just the last one. It's okay to do this in the follow-up PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do it in a separate PR, this PR is large enough

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried using interleave like we talked but than I would not be able to use the kernels functions so I did not do it

Copy link
Member

@waynexia waynexia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to add proto-related supports for this new function. But it is not a blocker for merging this PR from my perspective. Thank you @rluvaton

@rluvaton
Copy link
Contributor Author

What is proto related support?

@waynexia
Copy link
Member

What is proto related support?

Sorry, forget it 🙈 They are wiped out in #10173

@rluvaton
Copy link
Contributor Author

rluvaton commented Nov 22, 2024

Thank you @waynexia see you on Monday at your CMU talk, I'll join 10 minutes earlier so we can talk about data fusion if interested 😊

@rluvaton
Copy link
Contributor Author

Can we merge this?

@waynexia waynexia merged commit fe2da2b into apache:main Nov 23, 2024
26 checks passed
@waynexia
Copy link
Member

Thank you again for working on this!

@rluvaton
Copy link
Contributor Author

Thank you, I'll try to contribute again!

@rluvaton rluvaton deleted the add-greatest branch November 23, 2024 17:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation functions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add greatest function
7 participants