Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configuring Jitsu Bulker for Multi-Partition Kafka Topics #17

Open
ZiyaadQasem opened this issue Aug 23, 2024 · 2 comments
Open

Configuring Jitsu Bulker for Multi-Partition Kafka Topics #17

ZiyaadQasem opened this issue Aug 23, 2024 · 2 comments

Comments

@ZiyaadQasem
Copy link

ZiyaadQasem commented Aug 23, 2024

In Kubernetes deployment of Jitsu, the Bulker component is responsible for batching events and sending them to a ClickHouse instance. Currently, the Kafka topic that Bulker creates and consumes is configured with only one partition.

Challenges Encountered:

  • Partition Limitation: The single-partition setup leads to performance bottlenecks and limits scalability.
  • Data Rebalancing Issues: Attempting to manually increase the number of partitions on the existing Kafka topic results in data rebalancing problems, which can disrupt the data flow and processing.

Questions:

  • How can I instruct Jitsu Bulker to create Kafka topics with multiple partitions during their initial creation?

  • Are there specific configuration settings or parameters within Jitsu or Bulker that allow specifying the desired number of partitions for Kafka topics?

@vklimontovich
Copy link
Contributor

We discussed it for a while internally and decided not to implement parallel processing at the moment. For data streams with enabled deduplication, parallel processing can break it. E.g. if two consumers will run MERGE statements in parallel, most databases won't guarantee the correctness.

For non-deduped streams it can give you a performance boost, but most of the use-cases we see require deduplication.

If we ever decide to go forward with this issue, here's what we would do:

  • Allow multiple partitions only for non-dedup streams; or
  • Run MERGE sequentially using a cluster-wide lock

Meanwhile, I suggest to implement parallelization with having different destinations per each table

@absorbb
Copy link
Contributor

absorbb commented Aug 26, 2024

Meanwhile, I suggest to implement parallelization with having different destinations per each table

Actually, topics are created per table so we have that kind of parallelism.

To workaround current limitations, you can duplicate the destination and connection, then rotate writeKeys on client side or split traffic using JavaScript function. Deduplication may still work unreliably in this scenario.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants