Skip to content

Support data-processing transformations specified through a static list of input files #66

@aldbr

Description

@aldbr

Story

As a CTAO transformation manager submitting CWL workflows.

I want to specify input files directly in CWL inputs and have them automatically grouped into jobs based on a group size parameter

So that I can efficiently process large datasets by distributing files across multiple jobs without manually creating job definitions, and easily keep track of them.

Acceptance Criteria

1. Enhanced TransformationSubmissionModel

  • The TransformationSubmissionModel is extended to accept CWL inputs in addition to the CWL workflow
  • CWL inputs can contain a list of file paths (either LFNs or local paths)
  • The model validates that CWL inputs conform to the expected format

2. Input File Specification

  • Users can specify a list of input files in the CWL inputs
  • Each file path should be LFNs (no sandbox is planned for now)
  • The system accepts arbitrarily large lists of input files

3. Automatic Job Grouping

  • A group_size parameter controls how many files are processed per job
  • The system automatically calculates the number of jobs based on: total_files / group_size
  • Files are distributed evenly across jobs according to the group size
  • Each job receives its subset of input files from the original list

4. Job Creation Logic

  • Given N input files and a group size of M, the system creates N/M jobs (rounded up if needed)
  • Each job is configured to process exactly M files (or fewer for the last job if N is not divisible by M)

Example Usage

Transformation Submission with 150 Files

# CWL Workflow (existing)
workflow: <cwl_workflow_definition>

# CWL Inputs (new)
inputs:
  input_files:
    - lfn://dirac/prod/dataset/file001.root
    - lfn://dirac/prod/dataset/file002.root
    - lfn://dirac/prod/dataset/file003.root
    # ... (147 more files)

# Transformation Hints
group_size: 5

Result:

  • Total input files: 150
  • Group size: 5
  • Jobs created: 150 ÷ 5 = 30 jobs
  • Each job processes 5 files

Job Distribution Example

  • Job 1: files 1-5
  • Job 2: files 6-10
  • Job 3: files 11-15
  • ...
  • Job 30: files 146-150

Context

This solution reuses the existing CWL inputs mechanism rather than introducing a new option in TransformationExecutionHooksHint. The approach provides a clean separation between workflow definition and input data while enabling efficient batch processing of large datasets through automatic job grouping.

Related to the discussion in #61

Dependencies

  • Requires enhancement to TransformationSubmissionModel to accept CWL inputs
  • Depends on implementation of job grouping logic based on group_size parameter

Technical Notes

  • The group_size parameter determines the granularity of parallelization
  • Users can optimize job submission based on file size and processing requirements
  • The solution maintains consistency with CWL standards while adding DIRAC-specific batch processing capabilities.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions