Comprehensive music feature extraction pipeline for conditioning Stable Audio Tools and similar audio generation models. Extracts 97+ numeric MIR features, 496 AI classification labels, and 5 natural language descriptions from audio files.
Status: Work-in-progress but functional. Core analysis scripts are tested; pipeline glue may lag behind. Scripts have built-in --help.
- Organizes audio files into structured folders
- Separates stems (drums, bass, other, vocals) via Demucs or BS-RoFormer
- Extracts rhythm, loudness, spectral, harmonic, timbral, and aesthetic features
- Classifies genre (400), mood (56), instruments (40) via Essentia
- Generates 5 AI text descriptions via Music Flamingo (8B params)
- Benchmarks caption quality across Music Flamingo, LLM revision, and Qwen2.5-Omni
- Transcribes drums to MIDI via ADTOF-PyTorch
- Creates beat-aligned training crops with feature migration
All features are saved to .INFO JSON files with atomic writes (never overwrites).
- Python 3.12+
- GPU: AMD ROCm 7.2+ (tested on RX 9070 XT / RDNA4) or NVIDIA CUDA
- VRAM: 5-13 GB depending on workload (up to 10 GB for captioning benchmark)
- OS: Linux (tested on Arch)
| Package | Purpose |
|---|---|
| PyTorch (ROCm/CUDA) | GPU compute |
| Demucs / BS-RoFormer | Stem separation |
| Essentia + TensorFlow | Classification (genre/mood/instrument) |
| llama.cpp (HIP build) | Music Flamingo GGUF inference |
| llama-cpp-python | LLM revision (captioning benchmark) |
| autoawq, qwen-omni-utils | Qwen2.5-Omni-7B-AWQ (captioning benchmark) |
| librosa, soundfile | Audio I/O and analysis |
| timbral_models | Audio Commons perceptual features (patched, cloned via setup script) |
See requirements.txt for the full list.
# Setup
python -m venv mir && source mir/bin/activate
pip install -r requirements.txt
bash scripts/setup_external_repos.sh
# Test all features on a single file
python src/test_all_features.py "/path/to/audio.flac"
# Full pipeline (config-driven)
python src/master_pipeline.py --config config/master_pipeline.yaml
# Audio captioning benchmark (compare Flamingo, LLM revision, Qwen-Omni)
python tests/poc_lmm_revise.py "/path/to/audio.flac" --genre "Goa Trance" -vAll ROCm environment variables are centralized in src/core/rocm_env.py and documented in config/master_pipeline.yaml. Every GPU-using script calls setup_rocm_env() before importing torch.
Key variables (set automatically, shell exports override):
export FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE
export PYTORCH_TUNABLEOP_ENABLED=1
export PYTORCH_TUNABLEOP_TUNING=0
export PYTORCH_ALLOC_CONF=garbage_collection_threshold:0.8,max_split_size_mb:512
export HIP_FORCE_DEV_KERNARG=1
export TORCH_COMPILE=0 # buggy with FA on RDNA- USER_MANUAL.md - Usage guide, module reference, troubleshooting
- MUSIC_FLAMINGO.md - Music Flamingo setup and usage
- FEATURES_STATUS.md - Feature implementation tracker
- config/master_pipeline.yaml - All pipeline and ROCm settings
src/
core/ # Utilities: JSON handler, file utils, rocm_env, text normalization
preprocessing/ # File organization, stem separation (Demucs, BS-RoFormer), loudness
rhythm/ # Beat detection, BPM, syncopation, onsets, per-stem rhythm
spectral/ # Spectral features, multiband RMS
harmonic/ # Chroma, per-stem harmonic movement
timbral/ # Audio Commons features, AudioBox aesthetics
classification/ # Essentia, Music Flamingo (GGUF + Transformers)
transcription/ # MIDI drum transcription (ADTOF, Drumsep)
tools/ # Metadata lookup, training crops, statistics
crops/ # Crop-specific pipeline and feature extraction
tests/ # Benchmarks (audio captioning comparison)
config/ # YAML pipeline configuration
models/ # GGUF model files (Qwen3, GPT-OSS, Granite, Music Flamingo)
repos/ # External repos (cloned by setup script, not tracked)
TBD