Scheduler is a Sidekiq based application that, in the standard lifecycle of a build request, comes third in line after Gatekeeper.
Scheduler's only purpose is to evaluate if jobs can be run based on the concurrency limits for the given owner group.
An owner group is the group of owners that a given job belongs to. Groups can be set up via "delegation" configuration. Most groups only contain a single owner, which is the owner of the given job.
When a job can be run (i.e. the job's owner group is not at its concurrency limit)
then Scheduler will set the job to queued
, and queue it for the workers.
Scheduler evaluates jobs for an owner group when a job is created (notified from Gatekeeper), and when a job's state changes (notified from Hub).
Before publishing a job for the Workers, Scheduler also determines the RabbitMQ queue specific to the infrastructure, based on application and job configuration.
Scheduler also has a separate thread that periodically "pings" owner groups.
This thread publishes a single "ping" message to Sidekiq for Scheduler itself.
Scheduler will then look at all owner groups that have jobs in the created
state, and publish a message to itself in order to evaluate these owner groups.
The purpose of the ping is to prevent us losing messages, causing jobs to get
stuck in the created
state.
Scheduler uses the following main resources:
- Redis for incoming messages (Sidekiq jobs)
- Redis for outgoing messages
- RabbitMQ for queueing messages to the workers
- the main database
It also uses:
- Sentry
- Librato
- Papertrail
- Gatekeeper for job creation
- Hub for all job state updates
- Worker for running a job (on the RabbitMQ queues
builds.*
)
Scheduler also queues messages for itself for:
- Serializing the worker payload and publishing it to RabbitMQ.
- Periodically pinging owner groups.
- Addon notifications (sending outgoing messages)
# check dynos
heroku ps -a travis-scheduler-production
heroku ps -a travis-pro-scheduler-prod
# scale dynos (replace N with the target dyno count)
heroku ps:scale gator=N -a travis-scheduler-production
heroku ps:scale gator=N -a travis-pro-scheduler-prod
# start a console
heroku run console -a travis-scheduler-production
heroku run console -a travis-pro-scheduler-prod
Scheduler might not be able to process the amount of incoming traffic, and the queue might back up.
There are two known conditions when this happened in the past:
- A user with massive build matrices cancelled lots of builds
- Incoming requeues from workers are drastically increased, propagating from Hub to Scheduler
The underlying condition for cancellations for huge matrices should be resolved though, and Scheduler should be more efficient in this scenario. It is unclear if the same scenario would happen again.
Other reasons for the queue backing up potentially might be:
- The main database is unusually slow, or queries are blocked.
- Scheduler cannot obtain connections to the main database.
- Publishing to RabbitMQ is extremely slow and all Sidekiq threads are waiting for RabbitMQ.
If Scheduler is functional (i.e. it can connect to the database, and it can
publish to RabbitMQ) then the solution to this can be to scale up more
scheduler
dynos in order to work through the queue faster.