Familiarize with Intel DSS environment #145

orfeas-k · 2024-07-15T08:28:06Z

Why it needs to get done

In order to be able to tackle #144, we 'll need first to spend some time to familiarize with the Intel DSS environment.

What needs to get done

Interact with Intel DSS environment and document instructions for it.

When is the task considered done

We have familiarized and documented how to interact with the Intel DSS environment

syncronize-issues-to-jira · 2024-07-15T08:28:14Z

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-6002.

This message was autogenerated

misohu · 2024-07-17T12:43:43Z

To proceed with the spec for DSS Intel integration we need to answer following questions:

How to install gpu operator for Intel Hardware on Microk8s?
How to get jupyter backed images for pytorch and tensorfow?
How to support multiple containers on one Intel gpu device?
How to support iGPUs and dGPUs Intel devices at the same time?
How to support Intel and NVIDIA workloads at the same time?
Where we gonna develop the DSS Intel suppport?
How we gonna run CI for Intel support?

I went through older poc guide for setting up the Intel support on microk8s and I also went through the new spec which we received.

How to install gpu operator for Intel Hardware on Microk8s?

In order to install intel gpu plugin to microk8s we need

Node feature discovery manifests and rules. These are responsible for labeling the nodes with required labels and annotations.
The GPU plugin daemonset. which will install the plugin on nodes which have correct labels.

NOTE: The script to generate yamls is here.

How to get jupyter backed images for pytorch and tensorfow?

Currently we should be using the

ITEX - intel tensorflow extension - intel/intel-extension-for-tensorflow:2.15.0-xpu-idp-jupyter
IPEX - intel pytorch extension - intel/intel-extension-for-pytorch:2.1.20-xpu-idp-jupyter

Both of the images come with present jupyter. Keep in mind that there is a setting on the pod's site we need to do in order to run it correctly in DSS.

How to support multiple containers on one Intel gpu device?

There is a setting in the intel-gpu-plugin. Which enables sharing the gpu accross multiple containers. Without it only one container can get the device. To assign the GPU to container we need this setting. Here is discussion about the setting.

How to support iGPUs and dGPUs Intel devices at the same time?

This one may not be supported. Please check the discussion.

How to support Intel and NVIDIA workloads at the same time?

The initial tests in this doc show that it is possible without problems.

Where we gonna develop the DSS Intel suppport?

Wainting for access to machines with iGPU and dGPU. After that I will rerun all the tests from this doc.

How we gonna run CI for Intel support?

This might be a challenge we might need a way to access on demand an instance with iGPU dGPU for CI testing.

misohu · 2024-07-19T12:12:30Z

Today I got the access to the dell device lab and successfully execute test cases from this spec. Namely:

I was able to dpeloy dss with intel plugin.
I was able to run ipex and itex images as dss notebooks without problems.
I executed ML workloads for given ipex itex ML frameworks.
I also tried to run multiple notebooks at the same time.

The process to get access to the lab

Install testflinger to run workload jobs in Intel device lab
Setup tw or us vpn
Ask in testflinger channel to add you to the canonical-vpn-taipei-vpn (here is my thread)
Prepare a testflinger job to create instance, add you lpid and run sleep so you can ssh into machine.

❯ cat dell-precision3470-c30322.yaml
job_queue: dell-precision-3470-c30322
provision_data:
  distro: noble
test_data:
  test_cmds: |
    ssh $DEVICE_IP sudo apt -y install git
reserve_data:
  ssh_keys:
    - lp:michalhucko
  timeout: 43200

Execute the job

testflinger submit --poll dell-precision3470-c30322.yaml

Wait for job to be scheduled, machine to be created. After that you will see ssh instructions on how to access machine. Take a note of the job if (it will be print to the screen at the end). Be sure to stay on VPN
SSH into machine
At the end kill the job to release the resources

# example id
testflinger-cli cancel b525f94b-ab53-4310-83d7-04664c569303

Note: read more about the procedure here.

misohu · 2024-07-19T13:07:25Z

Changes needed for dss Intel support

Add intel status to dss status command.

Right now the dss status command outputs this information

[INFO] MLflow deployment: Ready
[INFO] MLflow URL: http://10.152.183.68:5000
[INFO] GPU acceleration: Disabled

We need to add one more row about Intel status. The correct way to get the intel device info is under discussion here. First idea is to check for intel gpu labels on kubernetes node.

Add functionality to create intel gpu instances with create command
In order to create a kubernetes pod with the intel gpu acceleration enabled we must:

Have the intel GPU operator enabled in Kubernetes cluster.
Have the Kubernetes resources section filled with the gpu.intel.com/i915 section. Example here.

After discussing with the team we decided to drop the --gpu intel argument from the dss create command. If the Intel acceleration is enabled (by user manually deploying the intel gpu operator) , all the notebooks will have the intel resources section filled automatically. Meaning that having correct image notebooks can use Intel hardware. This is not the problem for images without intel librarries as they will not use the resource anyways.

Because of this the dss create should check for the presence of intel gpu plugin. I f the plugin is there it will automatically populate the resources section.

Because we are using intel/intel-extension-for-tensorflow:2.15.0-xpu-idp-jupyter and intel/intel-extension-for-pytorch:2.1.20-xpu-idp-jupyter images for intel ML notebooks we also need to adjust the command and args section (check the example). We can add these settings blobally to all dss notebook deployments as non intel ones are setting these in their Dockerfiles anyways (This I need to test).

We also need to add intel/intel-extension-for-tensorflow:2.15.0-xpu-idp-jupyter and intel/intel-extension-for-pytorch:2.1.20-xpu-idp-jupyter images are recommendations to dss create --help

Docs on how to setup intel device plugin
We can use the setup described in this spec . Procedure is to deploy the devvice plugin manifests which we now keep in the dss repo here . There is a microk8s problem when deploying manifests from the URL. When fixed we can deploy directly form upstream.
Docs on How to spin up Notebook from Intel with IPEX or ITEX
After implementing points 1 and 2. User can simply deploy intel notebooks with following commands (this is only possible when the device plugin is enabled otherwise the notebooks will be deployed but resources will not be available).

dss create my-itex-notebook --image=intel/intel-extension-for-tensorflow:2.15.0-xpu-idp-jupyter
dss create my-ipex-notebook --image=intel/intel-extension-for-pytorch:2.1.20-xpu-idp-jupyte

Documentation on how to run simple calculations with Intel ML frameworks in DSS
Documentation on supported versions of Intel GPUs
Regarding this point I need to reach out to intel team

misohu · 2024-07-22T12:42:59Z

As part of this task we have opnned following issues:

#146
#147
#148
#149
#150

When designing the spec we need to align on following open problems:

How are we going to recommend installation of theIntel device plugin?

Accoding to this spec we need to instruct the user to build the manifests from upstream repository as microk8s has problems with remote urls for its customization feature. the aforementioned spec recommands to keep the built manifests in the DSS repository. This is not ideal solution as DSS should not be responsible for installing the device plugin.

Should we be specific about Intel GPUs' versions which we support with DSS?

As DSS is not responsible for setting up the plugin, it should not care about the versions of the underlying Intel GPUs. User should handle the correct plugin installation with the correct GPU device.

How to support iGPUs and dGPUs Intel devices at the same time?

This one may not be supported. Please check the discussion.

mvlassis · 2024-07-25T11:03:25Z

@misohu Your exploration of the Intel DSS environment has been very thorough, and you have specified very clearly defined tasks in order to achieve the integration. Great job!

The only thing that I find missing is to determine clearly whether iGPUs and dGPUs Intel devices will be supported simultaneously, before we proceed with the spec.

misohu · 2024-07-25T11:09:48Z

Thanks @mvlassis

The thing is that devices with both Intel iGPUs and dGPUs will be support just we cannot specify in the resources section if the workload should be deployed to iGPU or dGPU.

mvlassis · 2024-07-25T11:15:02Z

@misohu If that is the case we should add a note/warning in the DSS documentation for that specific usecase.

orfeas-k added the enhancement New feature or request label Jul 15, 2024

misohu mentioned this issue Jul 22, 2024

Create a spec integration with Intel GPU #144

Open

orfeas-k closed this as completed Jul 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Familiarize with Intel DSS environment #145

Familiarize with Intel DSS environment #145

orfeas-k commented Jul 15, 2024

syncronize-issues-to-jira bot commented Jul 15, 2024

misohu commented Jul 17, 2024

misohu commented Jul 19, 2024

misohu commented Jul 19, 2024 •

edited by mvlassis

Loading

misohu commented Jul 22, 2024

mvlassis commented Jul 25, 2024

misohu commented Jul 25, 2024

mvlassis commented Jul 25, 2024

Familiarize with Intel DSS environment #145

Familiarize with Intel DSS environment #145

Comments

orfeas-k commented Jul 15, 2024

Why it needs to get done

What needs to get done

When is the task considered done

syncronize-issues-to-jira bot commented Jul 15, 2024

misohu commented Jul 17, 2024

How to install gpu operator for Intel Hardware on Microk8s?

How to get jupyter backed images for pytorch and tensorfow?

How to support multiple containers on one Intel gpu device?

How to support iGPUs and dGPUs Intel devices at the same time?

How to support Intel and NVIDIA workloads at the same time?

Where we gonna develop the DSS Intel suppport?

How we gonna run CI for Intel support?

misohu commented Jul 19, 2024

The process to get access to the lab

misohu commented Jul 19, 2024 • edited by mvlassis Loading

Changes needed for dss Intel support

misohu commented Jul 22, 2024

How are we going to recommend installation of theIntel device plugin?

Should we be specific about Intel GPUs' versions which we support with DSS?

How to support iGPUs and dGPUs Intel devices at the same time?

mvlassis commented Jul 25, 2024

misohu commented Jul 25, 2024

mvlassis commented Jul 25, 2024

misohu commented Jul 19, 2024 •

edited by mvlassis

Loading