The DLCS Composite Handler is an implementation of DLCS RFC011.
The component is written in Python and utilises Django with the following extensions:
Additionally, the project uses:
The project ships with a docker-compose.yml
that can be used to get a local version of the component running:
docker compose up
Note that for the Composite Handler to be able to interact with the target S3 bucket, the Docker Compose assumes that the
AWS_PROFILE
environment variable has been set and a valid AWS session is available.
This will create a PostgreSQL instance, bootstrap it with the required tables, deploy a single instance of the API, and three instances of the engine. Requests can then be targetted at localhost:8000
.
The component can also be run directly, either in an IDE or from the CLI. The component must first be configured either via the creation of a .env
file (see .env.dist
for an example configuration), or via a set of environment variables (see the Configuration section).
Once configuration is in place, the following commands will start the API and / or engine:
- API:
python manage.py runserver 0.0.0.0:8000
- Engine:
python manage.py qcluster
Should the required tables not exist in the target database, the following commands should be run first:
python manage.py migrate
python manage.py createcachetable
Once the API is running, an administrator interface can be accessed via the browser at http://localhost:8000/admin
. To create an administrator login, run the following command:
python manage.py createsuperuser
The administrator user can be used to browse the database and manage the queue (including deleting tasks and resubmitting failed tasks into the queue).
There are 3 possible entrypoints to make the above easier:
entrypoint.sh
- this will wait for Postgres to be available and runmanage.py migrate
andmanage.py createcachetable
ifMIGRATE=True
. It will runmanage.py createsuperuser
isINIT_SUPERUSER=True
(also needsDJANGO_SUPERUSER_*
envvars)entrypoint-api.sh
- this runs above then starts nginx instance fronting gunicorn processentrypoint-worker.sh
- this runs above thenpython manage.py qcluster
The following list of environment variables are supported:
Environment Variable | Default Value | Component(s) | Description |
---|---|---|---|
DJANGO_DEBUG |
True |
API, Engine | Whether Django should run in debug. Useful for development purposes but should be set to False in production. |
DJANGO_SECRET_KEY |
None | API, Engine | The secret key used by Django when generating sensitive tokens. This should a randomly generated 50 character string. |
SCRATCH_DIRECTORY |
/tmp/scratch |
Engine | A locally accessible filesystem path where work-in-progress files are written during rasterization. |
WEB_SERVER_SCHEME |
http |
API | The HTTP scheme used when generating URI's. |
WEB_SERVER_HOSTNAME |
localhost:8000 |
API | The hostname (and optional port) used when generating URI's. |
ORIGIN_CHUNK_SIZE |
8192 |
Engine | The chunk size, in bytes, used when retrieving objects from origins. Tailoring this value can theoretically improve download speeds. |
DATABASE_URL |
None | API, Engine | The URL of the target PostgreSQL database, in a format acceptable to django-environ, e.g. postgresql://dlcs:password@postgres:5432/compositedb . |
CACHE_URL |
None | API, Engine | The URL of the target cache, in a format acceptable to django-environ, e.g. dbcache://app_cache . |
PDF_RASTERIZER_THREAD_COUNT |
3 |
Engine | The number of concurrent Poppler threads spawned when a worker is rasterizing a PDF. Each thread typically consumes 100% of a CPU core. |
PDF_RASTERIZER_DPI |
500 |
Engine | The DPI of images generated during the rasterization process. For JPEG's, the default value of 500 typically produces images approximately 1.5MiB to 2MiB in size. |
PDF_RASTERIZER_FALLBACK_DPI |
200 |
Engine | The DPI to use for images that exceed pdftoppm memory size and produce a 1x1 pixel (see Belval/pdf2image#34) |
PDF_RASTERIZER_FORMAT |
jpg |
Engine | The format to generate rasterized images in. Supported values are ppm , jpeg / jpg , png and tiff |
PDF_RASTERIZER_MAX_LENGTH |
0 |
Engine | Optional, the maximum size of pixels on longest edge that will be saved. If rasterized image exceeds this it will be resized, maintaining aspect ratio. |
DLCS_API_ROOT |
https://api.dlcs.digirati.io |
Engine | The root URI of the API of the target DLCS deployment, without the trailing slash. |
DLCS_S3_BUCKET_NAME |
dlcs-composite-images |
Engine | The S3 bucket that the Composite Handler will push rasterized images to, for consumption by the wider DLCS. Both the Composite Handler and the DLCS must have access to this bucket. |
DLCS_S3_OBJECT_KEY_PREFIX |
composites |
Engine | The S3 key prefix to use when pushing images to the DLCS_S3_BUCKET_NAME - in other words, the folder within the S3 bucket into which images are stored. |
DLCS_S3_UPLOAD_THREADS |
8 |
Engine | The number of concurrent threads to use when pushing images to the S3 bucket. A higher number of threads will significantly lower the amount of time spent pushing images to S3, however too high a value will cause issues with Boto3. 8 is a testing and sensible value. |
ENGINE_WORKER_COUNT |
2 |
Engine | The number of workers a single instance of the engine will spawn. Each worker will handle the processing of a single PDF, so the total number of concurrent PDF's that can be processed is engine_count * worker_count . |
ENGINE_WORKER_TIMEOUT |
3600 |
Engine | The number of seconds that a task (i.e. the processing of a single PDF) can run for before being terminated and treated as a failure. This value is useful to purging "stuck" tasks which haven't technically failed but are occupying a worker. |
ENGINE_WORKER_RETRY |
4500 |
Engine | The number of seconds since a task was presented for processing before a worker will re-run, regardless of whether it is still running or failed. As such, this value must be higher than ENGINE_WORKER_TIMEOUT . |
ENGINE_WORKER_MAX_ATTEMPTS |
0 |
Engine | The number of processing attempts a single task will undergo before it is abandoned. Setting this value to 0 will cause a task to be retried forever. |
MIGRATE |
None | API, Engine | If "True" will run migrations + createcachetable on startup if entrypoint used. |
INIT_SUPERUSER |
None | API, Engine | If "True" will attempt to create superuser. Needs standard Django envvars to be set (e.g. DJANGO_SUPERUSER_USERNAME , DJANGO_SUPERUSER_EMAIL , DJANGO_SUPERUSER_PASSWORD ) if entrypoint used. |
GUNICORN_WORKERS |
2 |
API | The value of --workers arg when running gunicorn |
SQS_BROKER_QUEUE_NAME |
None | API, Engine | If set, django-q SQS broker will be used. Queue created if doesn't exist. If empty default Django ORM broker is used |
Note that in order to access the S3 bucket, the Composite Handler assumes that valid AWS credentials are available in the environment - this can be in the former of environment variables, or in the form of ambient credentials.
By default Django Q will use the default Django ORM broker.
The SQS broker can be configured by specifying the SQS_BROKER_QUEUE_NAME
environment variable. Default SQS broker behaviour is to create this queue if it is not found.
As with S3, above, Composite Handler assumes that valid AWS credentials are available in the environment.
The project ships with a Dockerfile
:
docker build -t dlcs/composite-handler:local .
This will produce a single image that can be used to execute any of the supported Django commands, including running the API and the engine:
docker run dlcs/composite-handler:local python manage.py migrate # Apply any pending DB schema changes
docker run dlcs/composite-handler:local python manage.py createcachetable # Create the cache table (if it doesn't exist)
docker run --env-file .env -it --rm dlcs/composite-handler:local /srv/dlcs/entrypoint-api.sh # Run the API
docker run --env-file .env -it --rm dlcs/composite-handler:local /srv/dlcs/entrypoint-worker.sh # Run the engine
docker run dlcs/composite-handler:local python manage.py qmonitor # Monitor the workers