Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: segmentation fault on accessing a dangling channel when sending error from debezium #19783

Open
BugenZhao opened this issue Dec 12, 2024 · 0 comments
Labels
component/connector type/bug Something isn't working
Milestone

Comments

@BugenZhao
Copy link
Member

Describe the bug

JVM thread running Debezium engine may access dangling channel sender in its completionCallback, leading to segmentation fault.

Error message/log

2024-12-12T16:49:21.062787+08:00 ERROR risingwave_connector_node: engine#16 terminated with error. message: Stopping connector after error in the application's handler method: null: java.lang.InterruptedException
	at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1261)
	at java.base/java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:317)
	at java.base/java.util.concurrent.ArrayBlockingQueue.put(ArrayBlockingQueue.java:364)
	at com.risingwave.connector.source.core.DbzChangeEventConsumer.handleBatch(DbzChangeEventConsumer.java:293)
	at io.debezium.embedded.ConvertingEngineBuilder$ConvertingChangeConsumer.handleBatch(ConvertingEngineBuilder.java:75)
	at io.debezium.embedded.EmbeddedEngine.pollRecords(EmbeddedEngine.java:727)
	at io.debezium.embedded.EmbeddedEngine.run(EmbeddedEngine.java:466)
	at io.debezium.embedded.ConvertingEngineBuilder$1.run(ConvertingEngineBuilder.java:163)
	at com.risingwave.connector.source.core.DbzCdcEngine.run(DbzCdcEngine.java:67)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
 thread="rw-dbz-engine-runner-16" class="com.risingwave.connector.source.core.DbzCdcEngineRunner"

[PROCESS EXIT WITH CODE 139]

To Reproduce

I first encountered this bug in #19595 (comment). It appears that using the same MySQL instance for both CDC tests and the meta store consistently reproduces the issue. Although it's not likely the case in production, it clearly shows that there's unsafe or unsoundness in our JNI code.

  1. Edit risedev:
    ServiceConfig::MySql(c) if c.application != Application::Metastore => {
- ServiceConfig::MySql(c) if c.application != Application::Metastore => {
+ ServiceConfig::MySql(c) {
  1. ./risedev ci-start xx, with profile xx as:
xx:
  steps:
    - use: minio
    - use: mysql
      application: metastore
    - use: meta-node
      meta-backend: mysql
    - use: compute-node
      parallelism: 3
    - use: frontend
    - use: compactor
  1. Run ./risedev slt-clean 'e2e_test/source_inline/cdc/mysql/mysql_create_drop.slt.serial'
  2. Check the logs.

Expected behavior

No response

How did you deploy RisingWave?

No response

The version of RisingWave

main, or 957f5f566854ea9adfe7bdeb77d41a581dab7011

Additional context

By attaching the debugger we can see where it fails.

LOG.error(
"engine#{} terminated with error. message: {}",
sourceId,
message,
error);
String errorMsg =
(error != null && error.getMessage() != null
? error.getMessage()
: message);
if (!Binding.sendCdcSourceErrorToChannel(
channelPtr, errorMsg)) {
LOG.warn(
"engine#{} unable to send error message: {}",
sourceId,
errorMsg);
}

image
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/connector type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant