Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve parsing performance by reducing token cloning #1587

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

davisp
Copy link
Member

@davisp davisp commented Dec 11, 2024

This basically just takes the approach proposed by @alamb in #1561 and runs with it.

The main changes involved:

  1. A number of token parsing methods introduced that return a &TokenWithSpan instead of a cloned TokenWithSpan.
  2. A bunch of updates to make use of these new ref returning functions.
  3. Refactoring some of the core peek/expect/consume methods to avoid unnecessary clones
  4. Replace all but one use of expect_keyword with expect_keyword_is.

Results on my M1 MacBook Pro:

# main:00abaf218

❯ cargo bench -- --save-baseline main
    Finished `bench` profile [optimized] target(s) in 0.11s
     Running benches/sqlparser_bench.rs (target/release/deps/sqlparser_bench-9f7a6e6e193d8f5c)
sqlparser-rs parsing benchmark/sqlparser::select
                        time:   [4.2036 µs 4.2197 µs 4.2372 µs]
                        change: [-2.4618% -2.1421% -1.8071%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  6 (6.00%) high mild
  3 (3.00%) high severe
Benchmarking sqlparser-rs parsing benchmark/sqlparser::with_select: Collecting 100 samples in estimated 5.0715 s (278k iteratio
sqlparser-rs parsing benchmark/sqlparser::with_select
                        time:   [18.360 µs 18.425 µs 18.486 µs]
                        change: [-5.3763% -2.5663% -0.8662%] (p = 0.02 < 0.05)
                        Change within noise threshold.

# reduce-token-cloning:1ffa2aa4

❯ cargo bench -- --baseline main
   Compiling sqlparser v0.52.0 (/Users/davisp/github/davisp/datafusion-sqlparser-rs)
   Compiling sqlparser_bench v0.1.0 (/Users/davisp/github/davisp/datafusion-sqlparser-rs/sqlparser_bench)
    Finished `bench` profile [optimized] target(s) in 15.76s
     Running benches/sqlparser_bench.rs (target/release/deps/sqlparser_bench-9f7a6e6e193d8f5c)
sqlparser-rs parsing benchmark/sqlparser::select
                        time:   [2.9006 µs 2.9020 µs 2.9034 µs]
                        change: [-31.254% -31.013% -30.769%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) low mild
  2 (2.00%) high mild
  4 (4.00%) high severe
Benchmarking sqlparser-rs parsing benchmark/sqlparser::with_select: Collecting 100 samples in estimated 5.0415 s (364k iteratio
sqlparser-rs parsing benchmark/sqlparser::with_select
                        time:   [13.785 µs 13.813 µs 13.843 µs]
                        change: [-25.054% -24.778% -24.494%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

I've been sitting around a 10% improvement with each added commit not
moving the needled all that much. Then after poking a bit harder at
Instruments I realized that some super hot paths are in the token
manipulation methods themselves. This adds an 18% improvement against my
previous commit which gives a grand total of about 28% on both
benchmarks.
Except for a single instance, every use of expect_keyword was ignoring
the returned token. This adds a new `expect_keyword_is` that avoids that
unnecessary clone.

I nearly added a `#[must_use]` attribute to the `expect_keyword` method,
but decided against it as that feels like a breaking API change even if
it would nudge folks toward the correct method.
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much @davisp -- this is a very nice PR ❤️

I ran the benchmarks as well on a linux machine and see the same improvements you did. Very nice :bowtie:

++ critcmp main reduce-token-cloning
group                                                    main                                   reduce-token-cloning
-----                                                    ----                                   --------------------
sqlparser-rs parsing benchmark/sqlparser::select         1.46      6.4±0.03µs        ? ?/sec    1.00      4.4±0.02µs        ? ?/sec
sqlparser-rs parsing benchmark/sqlparser::with_select    1.30     30.7±0.09µs        ? ?/sec    1.00     23.5±0.10µs        ? ?/sec
++ popd
~/datafusion-benchmarking

FYI @iffyio and @Dandandan

@@ -186,6 +186,15 @@ impl std::error::Error for ParserError {}
// By default, allow expressions up to this deep before erroring
const DEFAULT_REMAINING_DEPTH: usize = 50;

// A constant EOF token that can be referenced.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

This might be useful enough to make pub const as well (so users of the crate can do the same thing)

@@ -3376,22 +3407,26 @@ impl<'a> Parser<'a> {
matched
}

pub fn next_token(&mut self) -> TokenWithSpan {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please add doc comments to these methods as well?

Maybe we can also mention "see next_token_ref for a faster, non copying API"

/// If the current token is the `expected` keyword, consume the token.
/// Otherwise, return an error.
///
/// This differs from expect_keyword only in that the matched keyword
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be really nice to add a link in the docs too:

Suggested change
/// This differs from expect_keyword only in that the matched keyword
/// This differs from [`Self::expect_keyword`] only in that the matched keyword

&format!("one of {}", keywords.join(" or ")),
self.peek_token(),
self.peek_token_ref(),
)
}
}

/// If the current token is the `expected` keyword, consume the token.
/// Otherwise, return an error.
pub fn expect_keyword(&mut self, expected: Keyword) -> Result<TokenWithSpan, ParserError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe as a follow on PR we could mark this API as deprecated to locate other uses of this API as well as help downstream consumers upgrade 🤔

#[deprecated(since = "0.53.0", note = "please use `TokenWithSpan` instead")]

@alamb alamb changed the title Reduce token cloning Improve parsing performance by reducing token cloning Dec 12, 2024
Copy link
Contributor

@iffyio iffyio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @davisp!

@alamb
Copy link
Contributor

alamb commented Dec 14, 2024

I merged this PR up to main to resolve a conflict

@alamb
Copy link
Contributor

alamb commented Dec 19, 2024

This PR has another conflict now

What I am hoping / planning to do is make the suggested doc improvements above and get it merged in. I also think it would be fine to merge it in and make the doc comment improvements as a follow on PR

I'll try to do this if no one else beats me to it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants