Skip to content
Marc Zimmermann edited this page Dec 16, 2020 · 1 revision

(Internal) notes related to the operation of the portal

Pipeline Server

Space Saving with Deduplication

Because in every pipeline run we also include reference data for reproducibility, we now and then run out of space. Since reference data doesn't change much over time, we can hardlink identical files to safe significant amounts of space, that is, instead of having various copies of a file, one file is chosen as "representative" and the other files point to this representative. Hardlinks are similar to symlinks, but are on filesystem level, i.e. this is transparent to typical software.

We use jdupes to find and hardlink duplicates in the file system.

Installing jdupes

Since the version shipped with the OS is not the most recent (it's probably a good idea to use a recent version with a software which can lead to potential data loss), jdupes was manually installed. This could be done by cloning the code repo and following the installation instructions (went smoothly)

Running jdupes

jdupes can potentially lead to data loss if something goes wrong. This is rather unlikely but cannot be excluded. For this reason we only apply it on reference data. This is also the place where we can save most space.

A few commands:

  • deduplicate reference data cd /data/monthly_releases && jdupes --loud -H -1 -r --linkhard data_release*/re*
  • deduplicate reference data used by VICTOR (bayesdel calculations): jdupes --loud -H -1 -r --linkhard /data/victor > jdupes.out 2>&1

To check for duplicates, but not doing any hardlinking use the options replace the above --linkhard with --printwithsummary

Internal Snapshotting of Prod Server

The entire prod webserver gets regularly snapshoted on the pipeline machine. It uses dss to save space using hardlinks

See /data/brca_prod_backup on the pipeline machine and https://github.com/BRCAChallenge/brca-backup-configuration for more details.