Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor signatures for lpad, rpad, left, and right #13420

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

jiashenC
Copy link
Contributor

Which issue does this PR close?

Refactor signatures for lpad, rpad, left, and right. They share very similar signatures.

Closes some tasks in #13301.

What changes are included in this PR?

Signature changes.

Are these changes tested?

Existing tests.

Are there any user-facing changes?

No

@Omega359
Copy link
Contributor

You can run

cargo test --test sqllogictests

locally to reproduce the test failures.

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Nov 15, 2024
@@ -1864,10 +1864,10 @@ query TT
EXPLAIN SELECT letter, letter = LEFT(letter2, 1) FROM simple_string;
----
logical_plan
01)Projection: simple_string.letter, simple_string.letter = left(simple_string.letter2, Int64(1))
01)Projection: simple_string.letter, simple_string.letter = left(CAST(simple_string.letter2 AS Utf8View), Int64(1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking maybe we should avoid casting type if they have the same Logical type like utf8 -> utf8view. 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would agree. If the signature allows for a string and receives a Utf8 it should accept it as is unless it needs to be coerced to a common type for some other reason. The less casting the better imho

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100% -- casts can often require trivial computation during query -- in this particular case casting letter2 to a Utf8View means it will copy at least an additional 128 bytes for each row (each view is 128 bytes)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. I implement not to cast if their logical types are the same. However, it is failing in some cases where Dictionary is in the signature. In those cases, the logical type is the same, but the native type is Dictionary causing type mismatch. The test can be reproduced

cargo test --test sqllogictests -- jctest

Any suggestions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need to implement the kernel for dictionary type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Can you elaborate on how we implement the kernel for dictionary type? I can give it a try.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check the fn invoke function for each function.

For example lpad

fn invoke(&self, args: &[ColumnarValue]) -> Result<ColumnarValue> {
        match args[0].data_type() {
            Utf8 | Utf8View => make_scalar_function(lpad::<i32>, vec![])(args),
            LargeUtf8 => make_scalar_function(lpad::<i64>, vec![])(args),
            other => exec_err!("Unsupported data type {other:?} for function lpad"),
        }
    }

It support utf8/utf8view/largeutf8, but not dictionary. You can rewrite it like this

fn invoke(&self, args: &[ColumnarValue]) -> Result<ColumnarValue> {
        invoke_inner(args, &args[0].data_type())
}

fn invoke_inner(args: &[ColumnarValue], data_type: &DataType) -> Result<ColumnarValue> {
    match data_type {
        DataType::Dictionary(_, v) => {
            invoke_inner(args, v.as_ref())
        }
        Utf8 | Utf8View => make_scalar_function(lpad::<i32>, vec![])(args),
        LargeUtf8 => make_scalar_function(lpad::<i64>, vec![])(args),
        other => exec_err!("Unsupported data type {other:?} for function lpad"),
    }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointer! I added the support for left function. Can you give a read of whether that is a good design?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @jayzhan211, I have added dictionary support for those four functions and repeat because the logical type casting skipping fails some relevant tests. Please let me know if you have any thoughts on the current implementation.

I also have a question when getting values from Dictionary to get the actual string, it drops NULL values, which causes some tests to fail. Is there any helper method I can use to get NULL preserving values from the Dictionary?

@github-actions github-actions bot added the logical-expr Logical plan and expressions label Nov 15, 2024
@jiashenC jiashenC marked this pull request as draft November 15, 2024 19:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
functions logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants