Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(kfp): restructure pipeline to allow mocking sections #17

Merged
merged 2 commits into from
Sep 11, 2024

Conversation

tumido
Copy link
Member

@tumido tumido commented Sep 10, 2024

This PR restructures our approach to KFP pipelines. It also adds a CLI interface:

python pipeline.py --help    
Usage: pipeline.py [OPTIONS]

Options:
  --mock [sdg|train|eval]  Mock part of the pipeline
  --help                   Show this message and exit

Since we can't pass data between nested pipelines (kubeflow/pipelines#10041), we need to use a flat, single pipeline.

Each stage code components live in their respective folders as a python package:

sdg
 ├── __init__.py        // Package for the stage 
 ├── componenets.py     // Individual components
 └── faked
      ├── __init__.py        // Package for the faked stage
      └── componenets.py     // Individual faked components

In order to provide better dev experience, each stage can be mocked individually. This is done via component substitution - it preserves the component signature but replaces the body. This way developers can mock individual outputs (where it matters) so we don't break continuity.

In this example I provide mocked components for the SDG stage:

if 'sdg' in mock:
    from sdg.faked import git_clone_op, sdg_op
else:
    from sdg import git_clone_op, sdg_op

In order to provide output artifact from SDG, I've used a trick for Python lightweight components (yes, it's a hack). This trick bypasses the "Hermetic" nature of KFP Python lightweight components. I'm creating a new empty shell python package that provides data via setuptools.data-files. This package is installed on the fly to the component runtime. The faked component then ensures that the data gets copied to the output artifact.

I order to test the pipeline with mocked SDG do:

  1. Replace package url at main...tumido:kfp-to-cli?expand=1#diff-8c5a3a6ccb with:
    git+https://github.com/tumido/ilab-on-ocp.git@kfp-to-cli#subdirectory=sdg/faked/fixtures
    
  2. Run
    python pipeline.py --mock sdg
    
  3. This will update pipeline.yaml and replace each SDG step content with a faked component. Upload the new pipeline.yaml to KFP and run.
  4. Once executed, you get the same execution flow as if it was not faked - but the steps actually do not do the work. The API and component signatures are maintained, but no heavy lifting is happening. 🙂

@boarder7395
Copy link

@tumido out of curiousity why not just fix the bug that prevents nested pipelines from passing artifacts? I ask because I plan to start looking at that issue and am wondering if you already found some unsurmountable issue which makes the hack approach a better option?

@tumido
Copy link
Member Author

tumido commented Sep 10, 2024

@boarder7395 At this moment I'm glad I know KFP good enough from the user perspective. It would take me a while to obtain skills in working on KFP itself. It's a codebase I never touched. Besides that I need to have the pipeline running on our current KFP deployment. I can't be waiting at this moment for a PR getting merged into KFP, then bubbling through KFP release process, then getting the new KFP version adopted by RHOIA, then getting RHOAI updated on our cluster. It's a lot faster to workaround the issue.

Especially when my job right now is to get the pipeline working and I can't justify weeks of studying of how KFP works internally, how to set up a dev environment, how to write tests, etc... all of it. I know... in ideal world it would be nice to be able to always resolve the root cause. 🤷

Besides, this is not much of a hack - instead of having a master pipeline and then multiple pipelines per stage, that would be chained from the umbrella master pipeline, we just define a single pipeline that runs all the steps. The mocking would need to happen in either case - we want to be able to develop the stages independently but stages depend on data from previous stage - this data need to be provided by something somehow. Fixing the issue above would change nothing here.

@boarder7395
Copy link

@tumido That makes sense to me, the one looking down the barrel of doing exactly those steps now. I was familiar with kfp 1.0 but 2.0 has been something I've avoided. Just wanted to make sure there wasn't already a dead end at the end of the tunnel :)

pipeline.py Outdated Show resolved Hide resolved
pipeline.py Show resolved Hide resolved
Copy link
Collaborator

@MichaelClifford MichaelClifford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

@tumido
Copy link
Member Author

tumido commented Sep 11, 2024

Then I'm gonna self-merge. Because I can. 😄 🙌

@tumido tumido merged commit 06b0ce2 into opendatahub-io:main Sep 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants