Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize binlog event deserialization #11

Draft
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

shirolimit
Copy link
Collaborator

@shirolimit shirolimit commented Jan 22, 2025

The PR aims to speed up MySQL binary log client by changing how events are deserialized.

Instead of reading data byte-by-byte from ByteArrayInputStream it fully buffers binlog event in memory as RawBinaryLogEvent and then uses BinaryLogEventDataReader for deserialization. The interface of BinaryLogEventDataReader is kept similar to library's custom ByteArrayInputStream to simplify migration.

In my experiments, this improves binlog extraction speed by up to 300% (in case we do nothing with extracted events).
I also tried buffering data in byte[] and wrapping it into ByteArrayInputStream, it was faster than existing approach but still 2 times slower than the proposed one.

The PR also introduces some minor optimizations to row parsing (e.g. datetime) and a bunch of unit tests for existing deserializers.

@shirolimit shirolimit force-pushed the shirolimit/optimize-event-deserialization branch from 5cba571 to 85fcc35 Compare January 23, 2025 10:06
@shirolimit
Copy link
Collaborator Author

Force-pushed after rebasing on the latest main.

Copy link

@jmlw jmlw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any obvious issues; especially since the majority of changes are duplication to utilize a different input object

@shirolimit shirolimit force-pushed the shirolimit/optimize-event-deserialization branch from 2ac1ea7 to 888c63f Compare January 27, 2025 17:49
Copy link

@DylanFlanders DylanFlanders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments but overall it's looking good to me

return BitSet.valueOf(bytes);
}

public int available() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having trouble understanding the blockLength in the ByteArrayInputStream version of available(). Do you have some more insight there? It seems in the new data reader version, the buffer is the "block" and we resize the buffer limit in enterBlock. Whereas in the old input stream version, it seems a bit more hacky and has to block off portions of the input stream?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In ByteArrayInputStream blockLength was used as a running value, i.e. it contained the number of bytes available for the current block. The stream updated blockLength on every read but it couldn't go below zero unless we skipToTheEndOfTheBlock. In the end, what they tried to achieve is to avoid reading from a block boundary which is needed for parsing some events.

The reader version achieves the same by manipulating ByteBuffer's limit.

new BigDecimal("237.00"),
new BigDecimal("10.00"),
1,
0,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this covers a number of branches in AbstractRowsEventDataDeserializer#deserializeCell but not all. Are there some column types that are more difficult to add than others? I'd be happy to try to help here so we can cover AbstractRowsEventDataDeserializer changes

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true, we don't cover all possible data types and branches here. I totally agree that ideally we should cover all the branches and methods but I decided not to do that to keep the PR smaller. Also, I didn't set a goal to improve the library's code coverage to 100%, just enough of them to be confident in my changes.

For row deserialization classes most of the changes are just overloads that use the same method names thus I decided to add only a few tests.

  1. Would it be better to add more unit tests to cover all cases and branches?
    It definitely would, the library is a crucial part of binlog syncs and we want it to be stable.

  2. Is it required to add them to this PR?
    I believe it shouldn't be a stopper considering the changes introduced to the class.

I'd propose to leave it as is for now and create a backlog task to improve library's code coverage.
What do you think?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay that makes sense to me. And thank you for the many tests that were added here!

@shirolimit shirolimit force-pushed the shirolimit/optimize-event-deserialization branch from 43ceba9 to b28c388 Compare January 29, 2025 17:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants