-
Notifications
You must be signed in to change notification settings - Fork 32
DevOps
(Internal) notes related to the operation of the portal
Because in every pipeline run we also include reference data for reproducibility, we now and then run out of space. Since reference data doesn't change much over time, we can hardlink identical files to safe significant amounts of space, that is, instead of having various copies of a file, one file is chosen as "representative" and the other files point to this representative. Hardlinks are similar to symlinks, but are on filesystem level, i.e. this is transparent to typical software.
We use jdupes to find and hardlink duplicates in the file system.
Since the version shipped with the OS is not the most recent (it's probably a
good idea to use a recent version with a software which can lead to potential
data loss), jdupes
was manually installed. This could be done by cloning the
code repo and following the installation
instructions (went smoothly)
jdupes
can potentially lead to data loss if something goes wrong. This is rather unlikely but cannot be excluded. For this reason we only apply it on reference data. This is also the place where we can save most space.
A few commands:
- deduplicate reference data
cd /data/monthly_releases && jdupes --loud -H -1 -r --linkhard data_release*/re*
- deduplicate reference data used by VICTOR (bayesdel calculations):
jdupes --loud -H -1 -r --linkhard /data/victor > jdupes.out 2>&1
To check for duplicates, but not doing any hardlinking use the options replace
the above --linkhard
with --printwithsummary
The entire prod webserver gets regularly snapshoted on the pipeline machine. It uses dss to save space using hardlinks
See /data/brca_prod_backup
on the pipeline machine and
https://github.com/BRCAChallenge/brca-backup-configuration for more details.