Feature scalar regexp match benchmark #13789

zhuliquan · 2024-12-15T16:50:01Z

Which issue does this PR close?

Closes #.

Rationale for this change

What changes are included in this PR?

Adding benchmark for scalar regex matching, I pick 6 cases for benchmark:
1、email
2、ip
3、phnone number
4、html tag
5、url
6、format date string ('yyyy-MM-dd')

Are these changes tested?

Are there any user-facing changes?

For apache#13433.

…#13669) * Minor: Comment temporary function for documentation migration * Minor: Comment temporary function for documentation migration

* Minor: Rephrase MSRV policy to be more explanatory Co-authored-by: Andrew Lamb <[email protected]> * MSRV policy update --------- Co-authored-by: Andrew Lamb <[email protected]>

…3677)

* Remove unused dependencies from macros crate * rename macro lib to user_doc

* update * update * update * clean up errors * fix flags types * fix failed example

… owned `ColumnReference`) (apache#13637) * Improve documentation * Pass owned args to ScalarFunctionArgs * Update advanced_udf with example of reusing arrays * clarify rationale for cloning * clarify comments * fix expected output

* refactor: use `LazyLock` in the `user_doc` macro * Fix cargo doc * Update datafusion/macros/src/lib.rs * Fix doc comment --------- Co-authored-by: Oleks V <[email protected]>

Issue was patched as of lexical release 1.0.5. Reverts apache#13689 Closes apache#13686

* fix: join with sort push down * chore: insert some value * apply suggestion * recover handle_costom_pushdown change * apply suggestion * add more test * add partition

…3688) Co-authored-by: zhangli20 <[email protected]>

* Optimize performance of function Signed-off-by: Tai Le Manh <[email protected]> * Add pre-check array is null * Fix clippy warnings --------- Signed-off-by: Tai Le Manh <[email protected]>

Updates the requirements on [prost-build](https://github.com/tokio-rs/prost) to permit the latest version. - [Release notes](https://github.com/tokio-rs/prost/releases) - [Changelog](https://github.com/tokio-rs/prost/blob/master/CHANGELOG.md) - [Commits](tokio-rs/prost@v0.13.3...v0.13.4) --- updated-dependencies: - dependency-name: prost-build dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Minor: Output elapsed time for sql logic test

…rojectionMapping` and `EquivalenceGroup` (apache#13675) * refactor: replace Vec with IndexMap for expression mappings in ProjectionMapping and EquivalenceGroup * chore * chore: Fix CI * chore: comment * chore: simplify

* fix: Fix parse_sql_expr not handling alias * cargo fmt * fix parse_sql_expr example(remove alias) * add testing * add SUM udaf to TestContextProvider and modify test_sql_to_expr_with_alias for function * revert change on example `parse_sql_expr`

apache#13730) Debug trait is useful for understanding what something is and how it's configured, especially if the implementation is behind dyn trait.

…pache#13731)

…13660) * add `unnest_as_table_factor` and `UnnestRelationBuilder` * unparse unnest as table factor * fix typo * add tests for the default configs * add a static const for unnest_placeholder * fix tests * fix tests

Omega359 · 2024-12-22T16:09:46Z

I ran your benchmark and it looks good. I am unsure though the benefit of this benchmark over the existing 'regx' benchmark which targets the udf versions of the Postgresql symbols - most of the logic is in the regexp::regexp_is_match function for both.

zhuliquan and others added 27 commits December 11, 2024 23:22

bench: scalar regex match benchmark

ecd4793

refactor: migrate LinearSearch to HashTable (apache#13658)

c3e0951

For apache#13433.

Minor: Comment temporary function for documentation migration (apache…

1e507ad

…#13669) * Minor: Comment temporary function for documentation migration * Minor: Comment temporary function for documentation migration

Minor: Rephrase MSRV policy to be more explanatory (apache#13668)

61fd077

* Minor: Rephrase MSRV policy to be more explanatory Co-authored-by: Andrew Lamb <[email protected]> * MSRV policy update --------- Co-authored-by: Andrew Lamb <[email protected]>

fix: repartitioned reads of CSV with custom line terminator (apache#1…

67260a0

…3677)

chore: macros crate cleanup (apache#13685)

3618cfe

* Remove unused dependencies from macros crate * rename macro lib to user_doc

Refactor regexplike signature (apache#13394)

d3e0860

* update * update * update * clean up errors * fix flags types * fix failed example

Temporary fix for CI (apache#13689)

d39852d

refactor: use LazyLock in the user_doc macro (apache#13684)

98372cc

* refactor: use `LazyLock` in the `user_doc` macro * Fix cargo doc * Update datafusion/macros/src/lib.rs * Fix doc comment --------- Co-authored-by: Oleks V <[email protected]>

Unlock lexical-write-integer version. (apache#13693)

e8226f5

Issue was patched as of lexical release 1.0.5. Reverts apache#13689 Closes apache#13686

Minor: Use div_ceil

bd91271

Fix hash join with sort push down (apache#13560)

45926ab

* fix: join with sort push down * chore: insert some value * apply suggestion * recover handle_costom_pushdown change * apply suggestion * add more test * add partition

Improve substr() performance by avoiding using owned string (apache#1…

16d2ab1

…3688) Co-authored-by: zhangli20 <[email protected]>

reinstate down_cast_any_ref (apache#13705)

d8c9cfb

Optimize performance of character_length function (apache#13696)

f8c0efe

* Optimize performance of function Signed-off-by: Tai Le Manh <[email protected]> * Add pre-check array is null * Fix clippy warnings --------- Signed-off-by: Tai Le Manh <[email protected]>

Minor: Output elapsed time for sql logic test (apache#13718)

5dc6e42

* Minor: Output elapsed time for sql logic test

refactor: simplify the make_udf_function macro (apache#13712)

4fb9d2a

Improve documentation for TableProvider (apache#13724)

ddfc9e5

Reveal implementing type and return type in simple UDF implementations (

b494157

apache#13730) Debug trait is useful for understanding what something is and how it's configured, especially if the implementation is behind dyn trait.

minor: Extract tests for EXTRACT AND date_part to their own file (a…

3b5daa2

…pache#13731)

Support unparsing UNNEST plan to UNNEST table factor SQL (apache#…

50ce883

…13660) * add `unnest_as_table_factor` and `UnnestRelationBuilder` * unparse unnest as table factor * fix typo * add tests for the default configs * add a static const for unnest_placeholder * fix tests * fix tests

Merge branch 'apache:main' into feature-scalar_regexp_match_benchmark

13b581a

Merge branch 'apache:main' into feature-scalar_regexp_match_benchmark

065eb47

github-actions bot added the core Core DataFusion crate label Dec 15, 2024

zhuliquan mentioned this pull request Dec 15, 2024

feat: scalar regex match physical expr #12270

Open

fix: take taplo formatter suggestion

c697bb0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature scalar regexp match benchmark #13789

Feature scalar regexp match benchmark #13789

zhuliquan commented Dec 15, 2024

Omega359 commented Dec 22, 2024

Feature scalar regexp match benchmark #13789

Are you sure you want to change the base?

Feature scalar regexp match benchmark #13789

Conversation

zhuliquan commented Dec 15, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Omega359 commented Dec 22, 2024