Implement New Workflow Approach with PVC and S3 Integration #38

mosoriob · 2024-09-23T18:04:55Z

Description

We need to implement a new approach for running workflows in our MINT ensemble manager. This approach will use Kubernetes Jobs, Persistent Volume Claims (PVCs), and S3 for efficient data handling and processing.

Proposed Implementation

Job Creation and Data Download:
- Ensemble manager creates a Kubernetes Job
- Job creates a PVC
- Job mounts the PVC
- Job includes an initContainer which downloads data (from HTTP or S3) to the PVC
Main Processing:
- Main container mounts the PVC
- Runs the processing script
Data Upload:
- After job completion, ensemble manager creates another job
- This new job uploads the outputs to S3

Technical Details

Use Kubernetes Jobs API for creating and managing jobs
Implement PVC creation and management
Develop initContainer for data download (support both HTTP and S3)
Implement S3 upload functionality

Code Outline

async function createWorkflowJob(inputDataUrl, jobParams) {
  const pvc = await createPVC();
  
  const job = await createKubernetesJob({
    pvc,
    initContainer: {
      image: 'data-downloader',
      command: ['download', inputDataUrl, '/data']
    },
    mainContainer: {
      image: 'data-processor',
      command: ['process', '/data/input', '/data/output']
    }
  });

  await monitorJob(job);

  const uploadJob = await createUploadJob(pvc);
  await monitorJob(uploadJob);

  await cleanupResources(pvc, job, uploadJob);
}

// Helper functions
async function createPVC() { /* ... */ }
async function createKubernetesJob(config) { /* ... */ }
async function monitorJob(job) { /* ... */ }
async function createUploadJob(pvc) { /* ... */ }
async function cleanupResources(pvc, ...jobs) { /* ... */ }

Benefits

Clear separation of data download, processing, and upload steps
Efficient use of Kubernetes resources
Flexibility to handle both HTTP and S3 data sources
Improved data isolation with per-job PVCs
Remove dependencies between ensemble manager and pvc

Acceptance Criteria

Implement job creation with PVC and initContainer for data download
Develop main container logic for data processing
Create separate job for S3 upload after processing
Implement proper error handling and logging
Ensure cleanup of resources (PVC, completed jobs) after workflow completion
Add unit and integration tests for new components
Update documentation to reflect the new workflow approach

Additional Considerations

Evaluate performance impact of creating/deleting PVCs for each job
Consider implementing retry logic for failed jobs
Assess security implications of accessing various data sources

Questions

Do we need to support any data sources other than HTTP and S3?
Are there any size limitations we should be aware of for the PVCs?
Should we implement any specific monitoring or alerting for long-running jobs?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement New Workflow Approach with PVC and S3 Integration #38

Implement New Workflow Approach with PVC and S3 Integration #38

mosoriob commented Sep 23, 2024 •

edited

Loading

Implement New Workflow Approach with PVC and S3 Integration #38

Implement New Workflow Approach with PVC and S3 Integration #38

Comments

mosoriob commented Sep 23, 2024 • edited Loading

Description

Proposed Implementation

Technical Details

Code Outline

Benefits

Acceptance Criteria

Additional Considerations

Questions

mosoriob commented Sep 23, 2024 •

edited

Loading