Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CORE-8637]: storage: fix race between disk_log_impl::new_segment() and disk_log_impl::close() #24635

Open
wants to merge 1 commit into
base: dev
Choose a base branch
from

Conversation

WillemKauf
Copy link
Contributor

@WillemKauf WillemKauf commented Dec 20, 2024

disk_log_impl::close() and disk_log_impl::remove() both set _closed to true. There is an assert in new_segment() which checks that the log is not closed under _segments_rolling_lock. However, neither close() or remove() previously respected this lock, which could lead to a race condition and a triggered assert, "cannot add log segment to closed log".

Await the _segments_rolling_lock in close() and remove(), and indicate it as broken() for future waiters before setting _closed = true to avoid this race.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.3.x
  • v24.2.x
  • v24.1.x

Release Notes

Bug Fixes

  • Fixes a bug in which a segment being added to a log while the log was being closed could race, leading to a triggered vassert().

`disk_log_impl::close()` and `disk_log_impl::remove()` both set
`_closed` to `true`. There is an assert in `new_segment()` which checks
that the log is not closed under `_segments_rolling_lock`. However,
neither `close()` or `remove()` previously respected this lock, which
could lead to a race condition and a triggered assert,
`"cannot add log segment to closed log"`.

Await the `_segments_rolling_lock` in `close()` and `remove()`, and
additional indicate it as `broken()` for future waiters before setting
`_closed = true` to avoid this race.
@WillemKauf WillemKauf changed the title storage: fix race between disk_log_impl::new_segment() and disk_log_impl::close() [CORE-8637]: storage: fix race between disk_log_impl::new_segment() and disk_log_impl::close() Dec 20, 2024
@vbotbuildovich
Copy link
Collaborator

Retry command for Build#60020

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/e2e_shadow_indexing_test.py::ShadowIndexingWhileBusyTest.test_create_or_delete_topics_while_busy@{"cloud_storage_type":2,"short_retention":true}

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Dec 21, 2024

CI test results

test results on build#60020
test_id test_kind job_url test_status passed
rptest.tests.e2e_shadow_indexing_test.ShadowIndexingWhileBusyTest.test_create_or_delete_topics_while_busy.short_retention=True.cloud_storage_type=CloudStorageType.ABS ducktape https://buildkite.com/redpanda/redpanda/builds/60020#0193e60d-4720-48e4-843b-4906a54a4e23 FAIL 0/6
test results on build#60032
test_id test_kind job_url test_status passed
rptest.tests.archival_test.ArchivalTest.test_all_partitions_leadership_transfer.cloud_storage_type=CloudStorageType.ABS ducktape https://buildkite.com/redpanda/redpanda/builds/60032#0193e6e8-dbd4-4f0e-9a3b-b530ebfd1d4c FAIL 0/1
rptest.tests.datalake.partition_movement_test.PartitionMovementTest.test_cross_core_movements.cloud_storage_type=CloudStorageType.S3 ducktape https://buildkite.com/redpanda/redpanda/builds/60032#0193e704-7372-422b-8491-ae9f722727e7 FLAKY 2/6

@vbotbuildovich
Copy link
Collaborator

Retry command for Build#60032

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/archival_test.py::ArchivalTest.test_all_partitions_leadership_transfer@{"cloud_storage_type":2}

@nvartolomei
Copy link
Contributor

@WillemKauf I'm wondering who initiates the segment roll while a log is being closed? Is there a missing higher level locking or bad management of resources?

Is it possible than fixing the problem this way will just lead to another class of issues like asserts writing/reading from a closed log?

@WillemKauf
Copy link
Contributor Author

I'm wondering who initiates the segment roll while a log is being closed?

It could be disk_log_impl::apply_segment_ms(). Consider,

F1: Enters disk_log_impl::apply_segment_ms(), calls _compaction_housekeeping_gate.hold() and suspends at some scheduling point in the function.
F2: Enters one of disk_log_impl::close()/remove(), sets _closed = true before waiting on _compaction_housekeeping_gate.close().
F1: Resumes, we make it into disk_log_impl::new_segment(), where we hit the triggered vassert.

By waiting to obtain _segments_rolling_lock in close()/remove() we allow new_segment() to complete its work before shutting the log down.

Is it possible than fixing the problem this way will just lead to another class of issues like asserts writing/reading from a closed log?

I don't believe these changes would introduce those issues, but I'm also not going to claim they couldn't already exist as a separate race condition. Will have to re-read some code to form a concrete opinion here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants