ChromOptimise is a pipeline that identifies the optimum number of states that
should be used with
ChromHMM's
LearnModel
command for a particular genomic dataset.
For more specific information, please head over to the wiki.
When using ChromHMM to learn hidden Markov models for genomic data, it is often difficult to determine how many states to include:
- Including too many states will result in overfitting your data and introduces redundant states
- Including too few states will result in underfitting your data and thus results in lower model accuracy
This pipeline identifies the optimal number of states to use by finding a model that avoids the two above points.
After using this pipeline, the user will have greater knowledge over their dataset in the context of ChromHMM, which will allow them to make more informed decisions as they continue to further downstream analysis.
- Clone this repository
- Ensure all required software is installed
- If using LDSC, download 1000 genomes files (or similar) from this repository
- Copy the configuration files to a memorable location (recommended: next to
your data) and then fill them in using the templates provided. DO NOT CHANGE
THE NAMES OF THESE FILES.
- If you are feeling lazy. You can just edit the files where they already are. The suggetsion to move them is to accomodate having mutliple configs for different projects.
- Run the
setup
executable, providing the path to the directory with the config files in them as the first argument:
./setup path/to/configuration/directory
After completing 'getting started', run the master script (ChromOptimise.sh) in the command line with:
bash ChromOptimise.sh path/to/your/configuration/directory
Alternatively, you can run each of the shell scripts in JobSubmission sequentially.
sbatch 1_BinarizeFiles.sh path/to/your/configuration/directory
For further information please see the pipeline explanation.
There also exists supplementary scripts for further information on your chosen data set. Most importantly, thresholds used in redundancy analysis can be inferred from the results of Redundancy_Threshold_Optimisation. Further details for these scripts can be found in the wiki.
This pipeline requires a unix-flavoured OS with the following software installed:
- Bash (>=4.2.46(2))
- SLURM Workload Manager (>=20.02.3)
- conda(>=23.10.0)
- ChromHMM (>=1.23)
- sed (>=4.2.2)
- LDSC (>=aa33296)
- gzip (>=1.5)
- awk (>=4.0.2)
Additionally, conda environments are created for you to obtain:
- R v4.4.1
- java-jdk v8.0.112
- bedtools v2.27.1
This study makes use of data generated by the Blueprint Consortium. A full list of the investigators who contributed to the generation of the data is available from www.blueprint-epigenome.eu. Funding for the project was provided by the European Union's Seventh Framework Programme (FP7/2007-2013) under grant agreement no 282510 – BLUEPRINT.
For any further enquiries, please open an issue or contact Sam Fletcher:
[email protected]