EpiBench is a software tool designed for predicting DNA methylation levels using genomic sequence data and histone modification marks. It employs a multi-branch Convolutional Neural Network (CNN) architecture (SeqCNNRegressor
) specifically tailored for integrating these data types to achieve high prediction accuracy.
- Overview
- Installation
- Quick Start
- Basic Usage Examples
- Configuration
- Orchestration
- Environment Validation
- Contributing
- License
- Logging, Run Tracking, and Analysis (New in v1.2)
The tool encompasses a pipeline for:
- Data Processing: Converting raw genomic data (bed, bigwig) into model-ready matrices.
- Model Training: Training the
SeqCNNRegressor
model. - Hyperparameter Optimization: Using Optuna to find optimal model parameters.
- Evaluation: Assessing model performance using various regression metrics.
- Prediction: Generating methylation predictions on new data.
- Interpretation: Understanding model predictions using Integrated Gradients.
- Comparative Analysis: Comparing models trained/evaluated on different sample groups.
It is primarily operated via a Command-Line Interface (CLI) (epibench
).
The SeqCNNRegressor
model expects input data in HDF5 format, generated by the epibench process-data
command. Each sample, corresponding to a specific genomic region defined in the input BED file, is represented as a matrix.
- Shape:
(sequence_length, num_channels)
sequence_length
: Typically 10,000 base pairs.num_channels
: Usually 11 (4 for DNA sequence + 6 for histone marks + 1 for mask), but can vary based on the number of input histone BigWig files.
- Channels:
- 0-3: One-hot encoded DNA sequence (A, C, G, T).
- 4-9 (Example): Normalized histone modification signals (e.g., H3K4me3, H3K27ac). The exact number and order depend on the BigWig files provided during processing.
- Last Channel: A binary mask indicating the boundaries of the original BED region within the fixed-size window (1 inside the region, 0 outside).
The corresponding target methylation value (e.g., beta value) for each region is stored separately within the HDF5 file.
- Clone the repository:
git clone https://github.com/Bonney96/epibench.git cd epibench
- Set up a Python virtual environment (Recommended):
python3 -m venv .venv source .venv/bin/activate # Linux/macOS # or ".venv\Scripts\activate" on Windows
- Install dependencies:
pip install -r requirements.txt
- Install EpiBench in development mode:
This allows you to modify the code and have the changes reflected immediately when running the
pip install -e .
epibench
command.
This guide helps you get EpiBench running quickly.
-
Install EpiBench: Follow the steps in the Installation section.
-
Verify Installation: Check that the
epibench
command is available:epibench --version epibench --help
-
Explore Commands: Get help for specific subcommands:
epibench process-data --help epibench train --help # ... and so on for evaluate, predict, interpret, compare
-
Review Example Configurations: Examine the files in the
config/
directory (e.g.,config/process_config_example.yaml
,config/train_config_example.yaml
) to understand the required parameters and structure for different pipeline steps. -
Run a Basic Workflow (Tutorial): Follow the steps outlined in the Training and Evaluating a Model Tutorial. You'll use the example configuration files, replacing placeholder paths with the actual paths to your data files.
-
Run the Orchestration Script: For a more automated run (after setting up configuration files), use the script:
# Example for a single sample defined in args python scripts/run_full_pipeline.py --output-dir ./quickstart_out --single-sample-name test_sample --process-data-config config/process_config_example.yaml --train-config config/train_config_example.yaml # Example using a multi-sample config file python scripts/run_full_pipeline.py --output-dir ./quickstart_out --samples-config config/samples_config_example.yaml --max-workers 2
(Remember to replace placeholder paths in the config files with actual paths to your data before running).
The epibench
tool utilizes a command-line interface structured around several subcommands. Below are examples for common tasks:
-
Process Data: Convert raw data (BED, BigWig, FASTA) into the HDF5 format required by the model.
epibench process-data --config config/process_config.yaml -o output/processed_data
-
Train Model: Train the
SeqCNNRegressor
using processed data.epibench train --config config/train_config.yaml --output-dir output/training_run_01
-
Evaluate Model: Assess the performance of a trained model on test data.
epibench evaluate --config config/train_config.yaml --checkpoint output/training_run_01/best_model.pth --test-data output/processed_data/test.h5 -o output/evaluation_results
-
Generate Predictions: Use a trained model to predict methylation levels for new input data.
epibench predict --config config/train_config.yaml --checkpoint output/training_run_01/best_model.pth --input-data data/new_samples.h5 -o output/predictions
-
Interpret Model: Calculate feature attributions (e.g., using Integrated Gradients) to understand which input features (sequence bases, histone marks) contribute most to the model's predictions for specific regions. See the Interpretation Tutorial for detailed instructions.
epibench interpret --config config/interpret_config.yaml --checkpoint output/training_run_01/best_model.pth --input-data output/processed_data/interpret_subset.h5 -o output/interpretation_results
-
Compare Models/Groups: Perform comparative analyses, such as evaluating model performance differences across various sample groups defined in the configuration.
epibench compare --config config/compare_config.yaml -o output/comparative_analysis
EpiBench now includes a comprehensive logging system for experiment tracking, reproducibility, and analysis. All pipeline runs are logged with detailed metadata, and you can manage and analyze logs via the CLI:
epibench logs list --log-dir logs/ --status completed --format table
epibench logs show <execution_id> --log-dir logs/ --section all --format rich
epibench logs search --metric "r_squared>0.9" --config "model.name=SeqCNNRegressor" --format table
epibench logs compare <run1> <run2> --focus metrics --format table
epibench logs export --format csv --output logs_export.csv --fields execution_id mse r_squared
epibench logs analyze --analysis-type summary --metric r_squared --plot
See docs/logging.md for full documentation, schema details, and advanced examples.
EpiBench relies heavily on configuration files (YAML format preferred, JSON also supported) to define parameters for data processing, model architecture, training settings, evaluation metrics, interpretation methods, and comparison setups.
Example configuration files demonstrating required parameters and structure can be found in the config/
directory.
For running common end-to-end workflows (e.g., processing, training, evaluating, and predicting sequentially), you can use the provided orchestration scripts located in the scripts/
directory.
Example:
python scripts/run_full_pipeline.py --output-dir ./pipeline_runs --samples-config config/samples_to_run.yaml --max-workers 4
See scripts/README.md
for detailed usage of the orchestration scripts.
Before running complex pipelines, it's crucial to ensure your environment (Python packages, external tools, environment variables) is set up correctly. EpiBench includes a validation script for this purpose:
python scripts/check_environment.py
This script checks:
- Python Packages: Verifies that all packages listed in
requirements.txt
are installed and meet the specified version constraints. - External Tools: Checks for the presence (and optionally, minimum versions) of required external command-line tools in your system's PATH.
- Environment Variables: Ensures that necessary environment variables are set.
The script provides detailed error messages and suggestions for fixing any detected issues.
The main orchestration script (scripts/run_full_pipeline.py
) automatically runs this validation at the beginning. If you need to bypass this check (e.g., in a tightly controlled environment where you are certain of the setup), you can use the --skip-validation
flag:
python scripts/run_full_pipeline.py --skip-validation ... [other arguments]
Contributions are welcome! Please refer to the contributing guidelines for details on how to submit pull requests, report issues, or suggest enhancements.
This project is licensed under the MIT License - see the LICENSE file for details.