layout | title | permalink | custom_css |
---|---|---|---|
docs |
Working with data science pipelines |
/docs/working-with-data-science-pipelines |
asciidoc.css |
As a data scientist, you can enhance your data science projects on {productname-short} by building portable machine learning (ML) workflows with data science pipelines, using Docker containers. This enables you to standardize and automate machine learning workflows to enable you to develop and deploy your data science models.
For example, the steps in a machine learning workflow might include items such as data extraction, data processing, feature extraction, model training, model validation, and model serving. Automating these activities enables your organization to develop a continuous process of retraining and updating a model based on newly received data. This can help resolve challenges related to building an integrated machine learning deployment and continuously operating it in production.
You can also use the Elyra JupyterLab extension to create and run data science pipelines within JupyterLab. For more information, see Working with pipelines in JupyterLab.
From {productname-long} version 2.10.0, data science pipelines are based on KubeFlow Pipelines (KFP) version 2.0. For more information, see Migrating to data science pipelines 2.0.
To use a data science pipeline in {productname-short}, you need the following components:
-
Pipeline server: A server that is attached to your data science project and hosts your data science pipeline.
-
Pipeline: A pipeline defines the configuration of your machine learning workflow and the relationship between each component in the workflow.
-
Pipeline code: A definition of your pipeline in a YAML file.
-
Pipeline graph: A graphical illustration of the steps executed in a pipeline run and the relationship between them.
-
-
Pipeline experiment: A workspace where you can try different configurations of your pipelines. You can use experiments to organize your runs into logical groups.
-
Archived pipeline experiment: An archived pipeline experiment.
-
Pipeline artifact: An output artifact produced by a pipeline component.
-
Pipeline execution: The execution of a task in a pipeline.
-
-
Pipeline run: An execution of your pipeline.
-
Active run: A pipeline run that is executing, or stopped.
-
Scheduled run: A pipeline run that is scheduled to execute at least once.
-
Archived run: An archived pipeline run.
-
This feature is based on Kubeflow Pipelines 2.0. Use the latest Kubeflow Pipelines 2.0 SDK to build your data science pipeline in Python code. After you have built your pipeline, use the SDK to compile it into an Intermediate Representation (IR) YAML file. The {productname-short} user interface enables you to track and manage pipelines, experiments, and pipeline runs. To view a record of previously executed, scheduled, and archived runs, you must first select the experiment from the Experiments → Experiments and Runs page in the {productname-short} interface. After selecting the experiment, you can access all of its pipeline runs from the Runs page.
You can store your pipeline artifacts in an S3-compatible object storage bucket so that you do not consume local storage. To do this, you must first configure write access to your S3 bucket on your storage account.