Improve parsing performance by reducing token cloning #1587

davisp · 2024-12-11T18:10:59Z

This basically just takes the approach proposed by @alamb in #1561 and runs with it.

Part of Improve performance by not copying Tokens as much #1558

The main changes involved:

A number of token parsing methods introduced that return a &TokenWithSpan instead of a cloned TokenWithSpan.
A bunch of updates to make use of these new ref returning functions.
Refactoring some of the core peek/expect/consume methods to avoid unnecessary clones
Replace all but one use of expect_keyword with expect_keyword_is.

Results on my M1 MacBook Pro:

# main:00abaf218

❯ cargo bench -- --save-baseline main
    Finished `bench` profile [optimized] target(s) in 0.11s
     Running benches/sqlparser_bench.rs (target/release/deps/sqlparser_bench-9f7a6e6e193d8f5c)
sqlparser-rs parsing benchmark/sqlparser::select
                        time:   [4.2036 µs 4.2197 µs 4.2372 µs]
                        change: [-2.4618% -2.1421% -1.8071%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  6 (6.00%) high mild
  3 (3.00%) high severe
Benchmarking sqlparser-rs parsing benchmark/sqlparser::with_select: Collecting 100 samples in estimated 5.0715 s (278k iteratio
sqlparser-rs parsing benchmark/sqlparser::with_select
                        time:   [18.360 µs 18.425 µs 18.486 µs]
                        change: [-5.3763% -2.5663% -0.8662%] (p = 0.02 < 0.05)
                        Change within noise threshold.

# reduce-token-cloning:1ffa2aa4

❯ cargo bench -- --baseline main
   Compiling sqlparser v0.52.0 (/Users/davisp/github/davisp/datafusion-sqlparser-rs)
   Compiling sqlparser_bench v0.1.0 (/Users/davisp/github/davisp/datafusion-sqlparser-rs/sqlparser_bench)
    Finished `bench` profile [optimized] target(s) in 15.76s
     Running benches/sqlparser_bench.rs (target/release/deps/sqlparser_bench-9f7a6e6e193d8f5c)
sqlparser-rs parsing benchmark/sqlparser::select
                        time:   [2.9006 µs 2.9020 µs 2.9034 µs]
                        change: [-31.254% -31.013% -30.769%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) low mild
  2 (2.00%) high mild
  4 (4.00%) high severe
Benchmarking sqlparser-rs parsing benchmark/sqlparser::with_select: Collecting 100 samples in estimated 5.0415 s (364k iteratio
sqlparser-rs parsing benchmark/sqlparser::with_select
                        time:   [13.785 µs 13.813 µs 13.843 µs]
                        change: [-25.054% -24.778% -24.494%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

I've been sitting around a 10% improvement with each added commit not moving the needled all that much. Then after poking a bit harder at Instruments I realized that some super hot paths are in the token manipulation methods themselves. This adds an 18% improvement against my previous commit which gives a grand total of about 28% on both benchmarks.

Except for a single instance, every use of expect_keyword was ignoring the returned token. This adds a new `expect_keyword_is` that avoids that unnecessary clone. I nearly added a `#[must_use]` attribute to the `expect_keyword` method, but decided against it as that feels like a breaking API change even if it would nudge folks toward the correct method.

alamb

Thank you very much @davisp -- this is a very nice PR ❤️

I ran the benchmarks as well on a linux machine and see the same improvements you did. Very nice

++ critcmp main reduce-token-cloning
group                                                    main                                   reduce-token-cloning
-----                                                    ----                                   --------------------
sqlparser-rs parsing benchmark/sqlparser::select         1.46      6.4±0.03µs        ? ?/sec    1.00      4.4±0.02µs        ? ?/sec
sqlparser-rs parsing benchmark/sqlparser::with_select    1.30     30.7±0.09µs        ? ?/sec    1.00     23.5±0.10µs        ? ?/sec
++ popd
~/datafusion-benchmarking

FYI @iffyio and @Dandandan

alamb · 2024-12-12T15:15:53Z

src/parser/mod.rs

@@ -186,6 +186,15 @@ impl std::error::Error for ParserError {}
 // By default, allow expressions up to this deep before erroring
 const DEFAULT_REMAINING_DEPTH: usize = 50;

+// A constant EOF token that can be referenced.


💯

This might be useful enough to make pub const as well (so users of the crate can do the same thing)

alamb · 2024-12-12T15:20:28Z

src/parser/mod.rs

@@ -3376,22 +3407,26 @@ impl<'a> Parser<'a> {
        matched
    }

+    pub fn next_token(&mut self) -> TokenWithSpan {


Can we please add doc comments to these methods as well?

Maybe we can also mention "see next_token_ref for a faster, non copying API"

alamb · 2024-12-12T15:21:41Z

src/parser/mod.rs

+    /// If the current token is the `expected` keyword, consume the token.
+    /// Otherwise, return an error.
+    ///
+    /// This differs from expect_keyword only in that the matched keyword


It might be really nice to add a link in the docs too:

Suggested change

/// This differs from expect_keyword only in that the matched keyword

/// This differs from [`Self::expect_keyword`] only in that the matched keyword

alamb · 2024-12-12T15:23:17Z

src/parser/mod.rs

                &format!("one of {}", keywords.join(" or ")),
-                self.peek_token(),
+                self.peek_token_ref(),
            )
        }
    }

    /// If the current token is the `expected` keyword, consume the token.
    /// Otherwise, return an error.
    pub fn expect_keyword(&mut self, expected: Keyword) -> Result<TokenWithSpan, ParserError> {


Maybe as a follow on PR we could mark this API as deprecated to locate other uses of this API as well as help downstream consumers upgrade 🤔

datafusion-sqlparser-rs/src/tokenizer.rs

Line 604 in 5de5312

#[deprecated(since = "0.53.0", note = "please use `TokenWithSpan` instead")]

iffyio

LGTM! Thanks @davisp!

alamb · 2024-12-14T11:31:37Z

I merged this PR up to main to resolve a conflict

alamb · 2024-12-19T11:37:51Z

This PR has another conflict now

What I am hoping / planning to do is make the suggested doc improvements above and get it merged in. I also think it would be fine to merge it in and make the doc comment improvements as a follow on PR

I'll try to do this if no one else beats me to it

davisp added 8 commits December 11, 2024 11:59

Avoid cloning tokens in parse_prefix

8e5bf49

Avoid cloning tokens in parse_infix

52432ff

Avoid cloning tokens in parse_prefix_with_*

86467ce

Avoid cloning tokens in parse_data_type_helper

3d7169f

Avoid cloning tokens in parse_identifier

7922db0

Document save-baesline and baseline arguments

1ffa2aa

This was referenced Dec 11, 2024

Optimize Token::make_word #1588

Open

POC to show performance improvements of not copying token #1561

Draft

Fix error reporting

e6dcc38

alamb approved these changes Dec 12, 2024

View reviewed changes

alamb changed the title ~~Reduce token cloning~~ Improve parsing performance by reducing token cloning Dec 12, 2024

alamb mentioned this pull request Dec 12, 2024

Implement Spanned to retrieve source locations on AST nodes #1435

Merged

iffyio approved these changes Dec 13, 2024

View reviewed changes

Dandandan approved these changes Dec 13, 2024

View reviewed changes

Merge remote-tracking branch 'apache/main' into reduce-token-cloning

56c070b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve parsing performance by reducing token cloning #1587

Improve parsing performance by reducing token cloning #1587

davisp commented Dec 11, 2024 •

edited by alamb

Loading

alamb left a comment

alamb Dec 12, 2024

alamb Dec 12, 2024

alamb Dec 12, 2024

alamb Dec 12, 2024

iffyio left a comment

alamb commented Dec 14, 2024

alamb commented Dec 19, 2024

	/// This differs from expect_keyword only in that the matched keyword
	/// This differs from [`Self::expect_keyword`] only in that the matched keyword

Improve parsing performance by reducing token cloning #1587

Are you sure you want to change the base?

Improve parsing performance by reducing token cloning #1587

Conversation

davisp commented Dec 11, 2024 • edited by alamb Loading

alamb left a comment

Choose a reason for hiding this comment

alamb Dec 12, 2024

Choose a reason for hiding this comment

alamb Dec 12, 2024

Choose a reason for hiding this comment

alamb Dec 12, 2024

Choose a reason for hiding this comment

alamb Dec 12, 2024

Choose a reason for hiding this comment

iffyio left a comment

Choose a reason for hiding this comment

alamb commented Dec 14, 2024

alamb commented Dec 19, 2024

davisp commented Dec 11, 2024 •

edited by alamb

Loading