IRA: Integrated Reasoning with Audio

A speech-to-language model pipeline that unifies audio generation and natural language understanding through knowledge distillation and reinforced behavior alignment.

Overview

IRA bridges speech synthesis and language model reasoning through a unified multimodal system. It trains a SpeechLM to generate audio tokens directly compatible with LLM representations, enabling seamless integration without intermediate transcription steps.

Key Features

Staged Training Pipeline: A robust 3-stage process (Pre-train $\to$ Align $\to$ Fine-tune) to ensure stability.
Native Multimodal Alignment: No ASR steps; the model learns to "understand" raw audio tokens.
Voice Style Control: CLIP-style encoder for semantic voice descriptions.
Hardware Optimized: Supports training on consumer hardware (Mac M-series, RTX laptops) via quantization and gradient accumulation.

Installation

Ensure you have Python 3.10+ installed.

# Clone the repository
git clone [https://github.com/Khushiyant/ira.git](https://github.com/Khushiyant/ira.git)
cd ira

# Install dependencies
pip install -e .

Hardware-Specific Setup

For Mac (M1/M2/M3): Mac users should rely on accelerate and MPS (Metal Performance Shaders). Do not install bitsandbytes as it requires CUDA.

pip install accelerate

**For Nvidia GPUs (Linux/Windows):**
To enable 4-bit quantization for lower VRAM usage:

```bash
pip install bitsandbytes accelerate

Data Preparation

The system is designed to be plug-and-play. You have two options:

Option 1: Auto-Download (Recommended)

The model can automatically download and format the LJSpeech-1.1 dataset. simply set download: true and dataset_name: "ljspeech" in your config file.

Option 2: Dummy Data (For Testing)

To verify the pipeline without downloading large datasets, generate synthetic data:

python scripts/create_dummy_data.py

Usage

Configuration

Two configuration files are provided:

configs/train_config.yaml: Standard setup for A100/H100 GPUs.
configs/laptop_config.yaml: Optimized for consumer hardware (12-16GB VRAM/RAM).

Mac Users: Ensure your config has load_in_4bit: false and device: "mps".

Training Pipeline

IRA uses a Staged Training approach to ensure convergence. You must run these stages in order.

Stage 1: SpeechLM Pre-training

Trains the model to generate valid acoustic tokens (audio reconstruction).

python main.py --stage 1 --config configs/laptop_config.yaml

Stage 2: Adapter Alignment

Freezes the SpeechLM and LLM. Trains the Adapter to map audio embeddings to the LLM's text space. Requires checkpoint from Stage 1.

python main.py \
  --stage 2 \
  --config configs/laptop_config.yaml \
  --resume_checkpoint checkpoints_laptop/stage1_epoch50.pt

Stage 3: Reinforced Behavior Alignment (RBA)

Fine-tunes the model using Reinforcement Learning (PPO). The LLM acts as a judge, rewarding the SpeechLM for generating audio that "makes sense" semantically. Requires checkpoint from Stage 2.

python main.py \
  --stage 3 \
  --config configs/laptop_config.yaml \
  --resume_checkpoint checkpoints_laptop/stage2_epoch50.pt

Inference

To generate speech and process it with the LLM:

from src.inference import create_pipeline_from_checkpoints

# Load your Stage 3 checkpoint
pipeline = create_pipeline_from_checkpoints(
    speech_lm_checkpoint="checkpoints_laptop/stage3_epoch50.pt",
    llm_name_or_path="HuggingFaceTB/SmolLM2-1.7B",
    adapter_checkpoint="checkpoints_laptop/stage3_epoch50.pt", # Adapter weights are saved here too
    device="mps" # or "cuda"
)

# Generate and Reason
result = pipeline.end_to_end_pipeline(
    input_text="Hello, analyze the sentiment of this speech.",
    return_audio=True,
)

# Save audio
pipeline.save_audio(result["audio_waveform"], "output.wav")
print("LLM Response:", result["llm_response"])

Architecture Details

SpeechLM: Autoregressive transformer (512 dim, 8 layers) generating EnCodec tokens.
Audio Adapter: Perceiver-style bridge that compresses audio sequences into fixed-length queries for the LLM.
Multimodal LLM: SmolLM2-1.7B serving as both the semantic grounding (Teacher) and the reasoning engine.

License

MIT License. See LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
configs		configs
src		src
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

IRA: Integrated Reasoning with Audio

Overview

Key Features

Installation

Hardware-Specific Setup

Data Preparation

Option 1: Auto-Download (Recommended)

Option 2: Dummy Data (For Testing)

Usage

Configuration

Training Pipeline

Stage 1: SpeechLM Pre-training

Stage 2: Adapter Alignment

Stage 3: Reinforced Behavior Alignment (RBA)

Inference

Architecture Details

License

About

Uh oh!

Releases

Packages

Languages

License

Khushiyant/ira

Folders and files

Latest commit

History

Repository files navigation

IRA: Integrated Reasoning with Audio

Overview

Key Features

Installation

Hardware-Specific Setup

Data Preparation

Option 1: Auto-Download (Recommended)

Option 2: Dummy Data (For Testing)

Usage

Configuration

Training Pipeline

Stage 1: SpeechLM Pre-training

Stage 2: Adapter Alignment

Stage 3: Reinforced Behavior Alignment (RBA)

Inference

Architecture Details

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages