Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement New Workflow Approach with PVC and S3 Integration #38

Open
7 tasks
mosoriob opened this issue Sep 23, 2024 · 0 comments
Open
7 tasks

Implement New Workflow Approach with PVC and S3 Integration #38

mosoriob opened this issue Sep 23, 2024 · 0 comments

Comments

@mosoriob
Copy link
Contributor

mosoriob commented Sep 23, 2024

Description

We need to implement a new approach for running workflows in our MINT ensemble manager. This approach will use Kubernetes Jobs, Persistent Volume Claims (PVCs), and S3 for efficient data handling and processing.

Proposed Implementation

  1. Job Creation and Data Download:

    • Ensemble manager creates a Kubernetes Job
    • Job creates a PVC
    • Job mounts the PVC
    • Job includes an initContainer which downloads data (from HTTP or S3) to the PVC
  2. Main Processing:

    • Main container mounts the PVC
    • Runs the processing script
  3. Data Upload:

    • After job completion, ensemble manager creates another job
    • This new job uploads the outputs to S3

Technical Details

  • Use Kubernetes Jobs API for creating and managing jobs
  • Implement PVC creation and management
  • Develop initContainer for data download (support both HTTP and S3)
  • Implement S3 upload functionality

Code Outline

async function createWorkflowJob(inputDataUrl, jobParams) {
  const pvc = await createPVC();
  
  const job = await createKubernetesJob({
    pvc,
    initContainer: {
      image: 'data-downloader',
      command: ['download', inputDataUrl, '/data']
    },
    mainContainer: {
      image: 'data-processor',
      command: ['process', '/data/input', '/data/output']
    }
  });

  await monitorJob(job);

  const uploadJob = await createUploadJob(pvc);
  await monitorJob(uploadJob);

  await cleanupResources(pvc, job, uploadJob);
}

// Helper functions
async function createPVC() { /* ... */ }
async function createKubernetesJob(config) { /* ... */ }
async function monitorJob(job) { /* ... */ }
async function createUploadJob(pvc) { /* ... */ }
async function cleanupResources(pvc, ...jobs) { /* ... */ }

Benefits

  1. Clear separation of data download, processing, and upload steps
  2. Efficient use of Kubernetes resources
  3. Flexibility to handle both HTTP and S3 data sources
  4. Improved data isolation with per-job PVCs
  5. Remove dependencies between ensemble manager and pvc

Acceptance Criteria

  • Implement job creation with PVC and initContainer for data download
  • Develop main container logic for data processing
  • Create separate job for S3 upload after processing
  • Implement proper error handling and logging
  • Ensure cleanup of resources (PVC, completed jobs) after workflow completion
  • Add unit and integration tests for new components
  • Update documentation to reflect the new workflow approach

Additional Considerations

  • Evaluate performance impact of creating/deleting PVCs for each job
  • Consider implementing retry logic for failed jobs
  • Assess security implications of accessing various data sources

Questions

  • Do we need to support any data sources other than HTTP and S3?
  • Are there any size limitations we should be aware of for the PVCs?
  • Should we implement any specific monitoring or alerting for long-running jobs?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant