Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add streaming 'jsonl' parser #3831

Merged
merged 7 commits into from
Nov 21, 2024
Merged

Conversation

asgerf
Copy link
Contributor

@asgerf asgerf commented Nov 19, 2024

Replaces the jsonl parser with a streaming version that is at least as fast as the sync version. Also see original PR against the hackathon branch.

Thanks to @esbena for the initial streaming version used for the hackaton. That version used the readline library to split lines, which turned out to be a bottleneck, however, making the streaming parser slower than the original sync version. So I wrote one that doesn't use readline.

Running the benchmark on a 21 MB logfile:

  • readJsonlReferenceImpl: 172.4 ms (original non-streaming version)
  • readJsonlFile: 283.3 ms (streaming version based on readline)
  • readJsonlFile2: 151.3 ms (new version without readline)
  • justReadline: 187.5 ms (consumes the file with readline and nothing else)

On a 520 MB logfile:

  • readJsonlReferenceImpl: out of memory
  • readJsonlFile: 6439.4 ms
  • readJsonlFile2: 3538.4 ms
  • justReadline: 3664.3 ms

I've added the benchmark script although the project doesn't seem to have much infrastructure for benchmark scripts. At the moment you'll have to run it with something like ts-node and there's no tests to ensure the benchmark script keeps working (but it will be checked for compilation errors). I'm on the fence about whether it should be committed.

The current build setup doesn't seem to have a concept for benchmark
scripts, so for now you'll have to run it with something like ts-node.
@esbena
Copy link

esbena commented Nov 19, 2024

Perahps add a minor comment about why readfileSync and readline are insufficient alternatives.

@asgerf asgerf marked this pull request as ready for review November 19, 2024 13:02
@asgerf asgerf requested a review from a team as a code owner November 19, 2024 13:02
Copy link

@esbena esbena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but maybe the owning team wants a say as well..

Copy link
Contributor

@aeisenberg aeisenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Some minor comments.

extensions/ql-vscode/src/common/jsonl-reader.ts Outdated Show resolved Hide resolved
extensions/ql-vscode/src/common/jsonl-reader.ts Outdated Show resolved Hide resolved
await handler(JSON.parse(buffer));
} catch (e) {
reject(e);
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: If you move the logger.log statement and the resolve() call into the try block, you won't need a return here.

@aeisenberg
Copy link
Contributor

I think it's fine to check the benchmark script in.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a file comment, or a README describing what the benchmark does and that it's not being run on a regular basis? Also, it would be nice to include the results that you have in the PR description.

Copy link
Contributor

@aeisenberg aeisenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice.

@asgerf asgerf merged commit b840c38 into github:main Nov 21, 2024
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants