NMF K-Optimization Pipeline for ATAC-seq Data

Overview

Python-based CLI pipeline for Non-negative Matrix Factorization (NMF) k-optimization on ATAC-seq data. Processes TCGA Z-scores to identify optimal component numbers through:

Binary accessibility mapping from Z-scores
Sample grouping by embryonic origin
NMF evaluation across k-values (2-26)
Metric calculation (F1, AUPRC, Reconstruction Error)
Model weight storage and visualization

Quick Start

# Setup
python3 -m venv nmf_env && source nmf_env/bin/activate
pip install -r requirements.txt

# Configure paths in config.json
# Run pipeline
python prepare_data.py
python run_group_nmf_cli.py Ectoderm
python run_allsamples_nmf_cli.py

Repository Structure

├── config.json                    # Main configuration
├── emb.json                       # Sample groupingspython run_allsamples_nmf_cli.py

├── requirements.txt               # Dependencies
├── data_utils.py                  # Data processing utilities
├── nmf_evaluation.py              # Evaluation metrics
├── nmf_plotting.py                # Visualization
├── nmf_workflow.py                # Core pipeline
├── prepare_data.py                # Data preparation CLI
├── run_group_nmf_cli.py           # Group-specific NMF CLI
├── run_allsamples_nmf_cli.py      # All-samples NMF CLI
└── embryonic_group_nmf_outputs_cli/
  ├── preprocessed_data/         # Intermediate files
  ├── Ect_NMF_K_opt_2_26/        # Ectoderm results
  ├── Mes_NMF_K_opt_2_26/        # Mesoderm results
  └── AllSamples_NMF_K_opt_2_26/ # Combined results

Configuration

`config.json` - Essential Settings

{
  "TCGA_ZSCORES_PATH": "/path/to/TCGA_zscores.parquet",
  "K_RANGE_EMBRYONIC_START": 2,
  "K_RANGE_EMBRYONIC_END": 26,
  "TOP_N_FEATURES_CUTOFF": 117442,
  "N_JOBS_PARALLEL": -1
}

`emb.json` - Sample Groupings

{
  "organ_system_groupings": [
  {
    "group_name": "Ectoderm",
    "cancer_codes": ["BRCA", "SKCM", "HNSC"]
  }
  ]
}

Input Requirements

TCGA Z-scores: Parquet file with features as rows, samples as columns (starting column 7)
Sample IDs: Must follow TCGA format for cancer type extraction
Python 3.7+ with packages from requirements.txt

Usage

Single Commands

# Data preparation (run once)
python prepare_data.py

# Group-specific analysis
python run_group_nmf_cli.py Ectoderm
python run_group_nmf_cli.py Mesoderm --n-jobs 4

# All samples analysis
python run_allsamples_nmf_cli.py

Background Execution

# Using tmux
tmux new -s nmf_pipeline
source nmf_env/bin/activate
python prepare_data.py && \
python run_group_nmf_cli.py Ectoderm && \
python run_allsamples_nmf_cli.py
# Ctrl+b, d to detach

Output Structure

Each analysis produces:

{k}NMF/weights/: W.npy (features×k), H.npy (samples×k) matrices
*_evaluation_summary.csv: Metrics for all k values
summary_figures/: K-selection plots
*_run_parameters.json: Analysis parameters

Key Metrics

Max Mean F1: Higher = better component separation
AUPRC: Higher = better reconstruction quality
Reconstruction Error: Lower = better fit

Troubleshooting

Issue	Solution
Wrong cancer type extraction	Modify `get_cancer_type_from_sample_id()` in `data_utils.py`
Memory issues	Reduce `N_JOBS_PARALLEL` or `BOOL_MAP_CHUNK_SIZE`
Path errors	Verify all paths in `config.json`
Polars issues	Update Polars version, check `enable_string_cache()` usage

FAIR Compliance

Findable: Clear naming, version control support
Accessible: Open-source, CLI-based, standard dependencies
Interoperable: JSON, Parquet, NPZ, CSV formats
Reusable: Modular design, configurable parameters

Preprocessed Data Workflow

Overview

The NMF analysis pipeline relies on preprocessed data stored in /koptlib/preprocessed_data/. These files are generated during the data preparation stage and serve as inputs for all downstream analyses.

Key Files

bool_map_overall_sparse_feat_x_sample.npz: Sparse boolean matrix with features as rows and samples as columns
all_tcga_samples.json: Complete list of sample IDs
sample_to_cancer_type_map.json: Maps sample IDs to their corresponding cancer types
emb_groupings.json: Defines sample groupings by embryonic origin

Data Preparation Process

The boolean matrix is created during the prepare_data.py execution which:

Reads all input data files
Extracts features for each sample
Constructs a complete feature × sample boolean matrix
Converts it to a sparse format for memory efficiency
Saves all necessary files to the preprocessed data directory

Adding New Datasets

When introducing new datasets:

Always run prepare_data.py to regenerate all preprocessed files
Ensure new sample IDs follow the expected format
Verify new cancer types are properly mapped in groupings
Check that the matrix maintains the correct dimensions (features × rows)

The NMF workflow dynamically subsets this matrix based on sample groups without modifying the original files.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NMF K-Optimization Pipeline for ATAC-seq Data

Overview

Quick Start

Repository Structure

Configuration

`config.json` - Essential Settings

`emb.json` - Sample Groupings

Input Requirements

Usage

Single Commands

Background Execution

Output Structure

Key Metrics

Troubleshooting

FAIR Compliance

Preprocessed Data Workflow

Overview

Key Files

Data Preparation Process

Adding New Datasets

koptlib

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
preprocessed_data		preprocessed_data
.gitignore		.gitignore
README.md		README.md
config.json		config.json
data_utils.py		data_utils.py
emb.json		emb.json
kcalc.ipynb		kcalc.ipynb
nmf_evaluation.py		nmf_evaluation.py
nmf_plotting.py		nmf_plotting.py
nmf_workflow.py		nmf_workflow.py
prepare_data.py		prepare_data.py
requirements.txt		requirements.txt
run_all.sh		run_all.sh
run_allsamples_nmf_cli.py		run_allsamples_nmf_cli.py
run_group_nmf_cli.py		run_group_nmf_cli.py

vedatonuryilmaz/koptlib

Folders and files

Latest commit

History

Repository files navigation

NMF K-Optimization Pipeline for ATAC-seq Data

Overview

Quick Start

Repository Structure

Configuration

config.json - Essential Settings

emb.json - Sample Groupings

Input Requirements

Usage

Single Commands

Background Execution

Output Structure

Key Metrics

Troubleshooting

FAIR Compliance

Preprocessed Data Workflow

Overview

Key Files

Data Preparation Process

Adding New Datasets

koptlib

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`config.json` - Essential Settings

`emb.json` - Sample Groupings

Packages