Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

πŸ› Bug Report β€” Runtime APIs - Durable Objects with SQLite storage - calling deleteAll() in alarm handler causes internal error #2993

Open
lucidplot opened this issue Oct 24, 2024 · 5 comments
Assignees

Comments

@lucidplot
Copy link

lucidplot commented Oct 24, 2024

I think I've found a bug in the SQLite storage backend for Durable Objects. When I call deleteAll() from an alarm handler, it causes an internal error. Which means the alarm handler fails and keeps getting retried.

However, if I do await this.ctx.storage.deleteAlarm(); before calling deleteAll(), it works fine.

(I'm using this as a workaround with no problem. But it seems like a bug.)

Use case: I want to use alarms to delete all data, for eg a verify link that auto-expires.

FWIW this problem didn't seem to happen when I was using the KV storage backend.

@joshthoward
Copy link
Collaborator

@lucidplot we've confirmed this and are working on a fix. Thank you for reporting this issue!

@stayallive
Copy link

I think my issue is related or at least related to the fix. I see the PR was merged but I am assuming this is not live yet. I am wondering if this will also clear up the affected DO's since I have quite a substantial number of them stuck in a retry loop. The constructor is called but the alarm callback is never fired for these and delete storage or alarm for the DO IDs have no effect. It's starting to rack up to quite a number of requests / CPU time (I'm assuming).

image

@jclee
Copy link
Contributor

jclee commented Nov 5, 2024

I think my issue is related or at least related to the fix. I see the PR was merged but I am assuming this is not live yet.

Correct... we are in the process of rolling out a version of the runtime containing the fix (as well as other updates), but at the time of your comment, it was not yet present in production.

If you wanted to address the error prior to the fix being fully deployed, one possible workaround (as lucidplot pointed out above) is for the Durable Object to explicitly delete the alarm before calling deleteAll(). It turns out that the bug only manifests when deleteAll() is called while an alarm is in place -- which is always the case during alarm handlers, if the alarm is not explicitly deleted, since the handler-running code does not actually clear the alarm until the handler successfully completes.

@stayallive
Copy link

I have been doing that, calling deleteAlarm and after deleteAll. I did that via a RPC call to a function that looks like:

	public async destroy(): Promise<void> {
		await this.ctx.blockConcurrencyWhile(async () => {
			await this.ctx.storage.deleteAlarm();
			await this.ctx.storage.deleteAll();
		});
	}

But it is possible that I did that incorrect or I did something else wrong. Because of the volume of errors I have recreated my worker to workaround the issue for now. But good to hear the runtime is rolling out. I'll report back if the problem starts to re-appear but hopeful it's solved. Appreciate the update!

@jclee
Copy link
Contributor

jclee commented Nov 6, 2024

@stayallive Thanks for the details. As far as this bug goes, I would expect that code to work OK, on both the new and old versions of the runtime, so maybe there's some additional wrinkle I'm missing.

I suspect that the awaits and the blockConcurrencyWhile() call are not strictly necessary, as I believe with sqlite-backed DOs, the deleteAlarm() and deleteAll() calls are effectively synchronous, from the viewpoint of the worker.

But I don't think they would hurt anything either... unless something is somehow causing either deleteAlarm() or deleteAll() to throw? In that case, it would break the DO's input lock and terminate it. Per the docs, it might be worth adding exception handling inside the blockConcurrencyWhile() lambda to see if any problems might be caused by something throwing there?

(Hmm... Looking through the docs again, I suppose another potential source of failure might be if the lambda takes longer than the 30s it is allowed? But I'd be surprised that those two calls would take an excessive amount of time.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants