-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Milestone
Description
Story
As a CTAO transformation manager submitting CWL workflows.
I want to specify input files directly in CWL inputs and have them automatically grouped into jobs based on a group size parameter
So that I can efficiently process large datasets by distributing files across multiple jobs without manually creating job definitions, and easily keep track of them.
Acceptance Criteria
1. Enhanced TransformationSubmissionModel
- The
TransformationSubmissionModelis extended to accept CWL inputs in addition to the CWL workflow - CWL inputs can contain a list of file paths (either LFNs or local paths)
- The model validates that CWL inputs conform to the expected format
2. Input File Specification
- Users can specify a list of input files in the CWL inputs
- Each file path should be LFNs (no sandbox is planned for now)
- The system accepts arbitrarily large lists of input files
3. Automatic Job Grouping
- A
group_sizeparameter controls how many files are processed per job - The system automatically calculates the number of jobs based on:
total_files / group_size - Files are distributed evenly across jobs according to the group size
- Each job receives its subset of input files from the original list
4. Job Creation Logic
- Given N input files and a group size of M, the system creates
N/Mjobs (rounded up if needed) - Each job is configured to process exactly M files (or fewer for the last job if N is not divisible by M)
Example Usage
Transformation Submission with 150 Files
# CWL Workflow (existing)
workflow: <cwl_workflow_definition>
# CWL Inputs (new)
inputs:
input_files:
- lfn://dirac/prod/dataset/file001.root
- lfn://dirac/prod/dataset/file002.root
- lfn://dirac/prod/dataset/file003.root
# ... (147 more files)
# Transformation Hints
group_size: 5Result:
- Total input files: 150
- Group size: 5
- Jobs created: 150 ÷ 5 = 30 jobs
- Each job processes 5 files
Job Distribution Example
- Job 1: files 1-5
- Job 2: files 6-10
- Job 3: files 11-15
- ...
- Job 30: files 146-150
Context
This solution reuses the existing CWL inputs mechanism rather than introducing a new option in TransformationExecutionHooksHint. The approach provides a clean separation between workflow definition and input data while enabling efficient batch processing of large datasets through automatic job grouping.
Related to the discussion in #61
Dependencies
- Requires enhancement to
TransformationSubmissionModelto accept CWL inputs - Depends on implementation of job grouping logic based on
group_sizeparameter
Technical Notes
- The
group_sizeparameter determines the granularity of parallelization - Users can optimize job submission based on file size and processing requirements
- The solution maintains consistency with CWL standards while adding DIRAC-specific batch processing capabilities.
Metadata
Metadata
Assignees
Labels
No labels