-
Notifications
You must be signed in to change notification settings - Fork 11.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race condition when scheduling commands with withoutOverlapping() using Redis/Memcached #50330
Comments
Thank you for reporting this issue! As Laravel is an open source project, we rely on the community to help us diagnose and fix issues as it is not possible to research and fix every issue reported to us via GitHub. If possible, please make a pull request fixing the issue you have described, along with corresponding tests. All pull requests are promptly reviewed by the Laravel team. Thank you! |
Just wanted to say thanks for the highly detailed report. I was on the very beginning of the research of what I believe is this issue. Since Sentry now provides Cron check-in reports - I was digging into a random problem of jobs not checking in. Read this issue and we indeed use Redis, both |
I'm the author of #45963 and I would be very happy if we can find a way to add a In terms of backwards-compatibility, the release of Laravel 11 is only a few days away, maybe such a bc breaking fix could be added for Laravel 11 now at the last moment? This is my suggestion of a public function lockExists($name) : bool
{
$lockName = $this->prefix.$name;
return $this->lockConnection()->get($lockName) !== null;
// alternate version (only for redis)
// return (bool) $this->lockConnection()->eval(
// "return redis.call('exists',KEYS[1])", 1, $lockName,
// );
} Usage in public function exists(Event $event)
{
if ($this->cache->store($this->store)->getStore() instanceof LockProvider) {
return $this->cache->store($this->store)->getStore()
->lockExists($event->mutexName());
}
return $store->has($event->mutexName());
} This would at least solve the issue of the lock being acquired just to be immediately released. |
I hope you are not serious about lockExists. This will inevitably lead to another race condition issue. |
Actually suggested solution by @johanrosenson would work fine. While the information could indeed be outdated it won't cause deadlocks, overlaps or missed runs. |
Laravel Version
10.46.0
PHP Version
8.1.27
Database Driver & Version
No response
Description
We've identified an issue whereby in multi-server environments where commands are scheduled to run using
withoutOverlapping()
andonOneServer()
, the commands will very occasionally not run at all on any server (about 0.2% of the time in our environment). Not a problem for things running every minute where we first spotted the problem, but more impactful when things running a few times or once per day are skipped.Just going to start with outlining my understanding of how scheduled commands get run when
withoutOverlapping()
andonOneServer()
are used:withoutOverlapping()
a check is added to the filters to ensure the event mutex lock does not currently exist (Event::withoutOverlapping, L713)serverShouldRun()
is then called to check for and take a scheduling lock for the current server to run the command for the current hour/minute (ScheduleRunCommand::runSingleServerEvent, L156)Event::run
is called (Event::run, L217)shouldSkipDueToOverlapping()
is called which checks whetherwithoutOverlapping()
was used and checks that the event lock can be obtained (Event::shouldSkipDueToOverlapping, L237)Just for clarity, there's two locks in play;
withoutOverlapping()
to identify whether a command is already being run.onOneSever()
to check whether a command has already been run on a server for the current schedule run (identified by the current hour and minute.)The problem stems from the way in which the event mutex lock is checked for existence, which was introduced in PR #45963.
Originally, the CacheEventMutex would just check if the cache had an entry for the lock (CacheEventMutex::exists, L69) which worked fine, but PR #45963 changed that behavior. For cache stores that implement the LockProvider interface (Redis, Memcached) it now attempts to get the lock and then immediately releases it again (but this is not done atomically.)
So the race condition looks something like this if Server B's system clock is just slightly slower than Server A:
So in this scenario the scheduled command is never run, there's no error message, and the
ScheduledTaskSkipped
doesn't even get fired because the call torun()
just returns ifshouldSkipDueToOverlapping()
returnstrue
.In terms of a solution, it might make sense to be able to check for the existence of a lock without taking and releasing it, maybe a new abstract
exists()
on Lock (https://github.com/laravel/framework/blob/10.x/src/Illuminate/Cache/Lock.php) which CacheEventMutex could use and RedisLock/MemcachedLock could implement their own atomic means of checking for lock existence, although that probably constitutes a backwards-compatibility break. Happy to put together a PR for whatever the agreed solution is.As a workaround, I've just bound an older version of CacheEventMutex to
Illuminate\Console\Scheduling\EventMutex
to override the changes implemented in #45963 which has solved the issue in our project for now.Steps To Reproduce
Given its a race condition its very hard to reproduce, but hopefully there's enough details in the description to understand what's going on.
The text was updated successfully, but these errors were encountered: