Skip to content

Make CoroBaseTask thread-safe#69

Merged
Simn merged 12 commits intomasterfrom
threadsafe-tasks-step-2
Jan 19, 2026
Merged

Make CoroBaseTask thread-safe#69
Simn merged 12 commits intomasterfrom
threadsafe-tasks-step-2

Conversation

@Simn
Copy link
Member

@Simn Simn commented Jan 19, 2026

This deals with awaitingContinuations and awaitChildren in CoroBaseTask in a way that is hopefully robust. The implementation for awaitingContinuations is not great because it's just a mutexed array now, so I made sure to give it an awkward name because that will make me want to look into it later.

The test updates a mostly just mutexes around array.push operations because that's very much not thread-safe. I don't want to merge this with the CoroRun change, so it's a draft for now.

@Aidan63 I keep getting test hangs in Issue124.testPrime without that change to BoundedWriter. I haven't properly looked into this yet, but my tentative conclusion is that the comment there can't be accurate. Unless something else goes wrong, which it might.

@Simn
Copy link
Member Author

Simn commented Jan 19, 2026

Interesting neko failure just now:

Running issues.hf.Issue32...
    test
src/issues/hf/Issue32.hx:43: Channel closed

It somehow hit a ChannelClosedException, which I think could only come from the write operation. But in that test all that happens inside a single task:

						node.async(_ -> {
							for (v in expected) {
								channel.write(v);
							}

							channel.close();
						});

No idea sure what's going on.


Other than that, cancellation stuff is still failing sometimes, and that's not always related to channels.

@Simn
Copy link
Member Author

Simn commented Jan 19, 2026

I've identified another data race which is giving me a bit of a headache:

  1. Thread 1 is on onCancellationRequested and has just matched state.load() to something other than Cancelling | Cancelled
  2. Thread 2 is in cancel where it updates the state to Cancelling and flushes the cancellationManager.
  3. Thread 1 now finds a null-value on cancellationManager.load(), creates a new one and adds the callback to it. But this new cancellation manager will never be flushed.

I'm not trying to understand if it's sufficient to check again for Cancelling | Cancelled at the end of onCancellationRequested and flushing the new manager.

@Simn
Copy link
Member Author

Simn commented Jan 19, 2026

I just had a concurrency revelation with regards to how we can modify things without needing a mutex. By having an atomic int with some kind of Modifying state we can CAS loop against that:

	public function addCallback(callback:ICancellationCallback):ICancellationHandle {
		final handle = new CancellationHandle(callback, this);
		while (true) {
			switch (state.compareExchange(Ready, Modifying)) {
				case Ready:
					break;
				case Modifying:
					// loop
				case Finished:
					handle.run();
					return CancellationHandle.noOpCancellationHandle;
			}
		}
		handles ??= [];
		handles.push(handle);
		state.store(Ready);
		return handle;
	}

This should of course only be used for very short-running sections, but that's most of the data structure manipulations I've been dealing with recently.

@Simn
Copy link
Member Author

Simn commented Jan 19, 2026

I've restarted CI several times and everything was green, until neko decided to fail on testLocalTypeParameters of all things:

	function testLocalTypeParameters() {
		CoroRun.run(@:coroutine function f<T>():T {
			return null;
		});
		Assert.pass(); // The test is that this doesn't cause an unbound type parameter
	}

That is... quite fascinating. It seems unlikely that this is due to anything we're doing here, so maybe this is some quirk related to how it spawns threads or something like this.

I'll restart CI a few more times and if nothing else comes up I'll revert the CoroRun part and merge this. We'll have to find a way to test multi-threading in our tests, but let's look into that after #65 in order to avoid huge merge conflicts.

@Simn Simn marked this pull request as ready for review January 19, 2026 16:16
@Simn
Copy link
Member Author

Simn commented Jan 19, 2026

Python failed again, which I think is a legitimate timeout because threaded performance is awful, in particular for test 32. That's not very important right now though.

@Simn Simn merged commit 1d970a8 into master Jan 19, 2026
46 checks passed
@Simn Simn deleted the threadsafe-tasks-step-2 branch January 19, 2026 16:24
@Aidan63
Copy link
Contributor

Aidan63 commented Jan 19, 2026

Looking at it again I do think that entire close function needs to be guarded by the mutex. Might be able to do one of those fancy CAS modify loops instead of the lock.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants