Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Familiarize with Intel DSS environment #145

Closed
orfeas-k opened this issue Jul 15, 2024 · 8 comments
Closed

Familiarize with Intel DSS environment #145

orfeas-k opened this issue Jul 15, 2024 · 8 comments
Labels
enhancement New feature or request

Comments

@orfeas-k
Copy link
Contributor

Why it needs to get done

In order to be able to tackle #144, we 'll need first to spend some time to familiarize with the Intel DSS environment.

What needs to get done

Interact with Intel DSS environment and document instructions for it.

When is the task considered done

We have familiarized and documented how to interact with the Intel DSS environment

@orfeas-k orfeas-k added the enhancement New feature or request label Jul 15, 2024
Copy link

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-6002.

This message was autogenerated

@misohu
Copy link
Member

misohu commented Jul 17, 2024

To proceed with the spec for DSS Intel integration we need to answer following questions:

  • How to install gpu operator for Intel Hardware on Microk8s?
  • How to get jupyter backed images for pytorch and tensorfow?
  • How to support multiple containers on one Intel gpu device?
  • How to support iGPUs and dGPUs Intel devices at the same time?
  • How to support Intel and NVIDIA workloads at the same time?
  • Where we gonna develop the DSS Intel suppport?
  • How we gonna run CI for Intel support?

I went through older poc guide for setting up the Intel support on microk8s and I also went through the new spec which we received.

How to install gpu operator for Intel Hardware on Microk8s?

In order to install intel gpu plugin to microk8s we need

  1. Node feature discovery manifests and rules. These are responsible for labeling the nodes with required labels and annotations.
  2. The GPU plugin daemonset. which will install the plugin on nodes which have correct labels.

NOTE: The script to generate yamls is here.

How to get jupyter backed images for pytorch and tensorfow?

Currently we should be using the

  • ITEX - intel tensorflow extension - intel/intel-extension-for-tensorflow:2.15.0-xpu-idp-jupyter
  • IPEX - intel pytorch extension - intel/intel-extension-for-pytorch:2.1.20-xpu-idp-jupyter

Both of the images come with present jupyter. Keep in mind that there is a setting on the pod's site we need to do in order to run it correctly in DSS.

How to support multiple containers on one Intel gpu device?

There is a setting in the intel-gpu-plugin. Which enables sharing the gpu accross multiple containers. Without it only one container can get the device. To assign the GPU to container we need this setting. Here is discussion about the setting.

How to support iGPUs and dGPUs Intel devices at the same time?

This one may not be supported. Please check the discussion.

How to support Intel and NVIDIA workloads at the same time?

The initial tests in this doc show that it is possible without problems.

Where we gonna develop the DSS Intel suppport?

Wainting for access to machines with iGPU and dGPU. After that I will rerun all the tests from this doc.

How we gonna run CI for Intel support?

This might be a challenge we might need a way to access on demand an instance with iGPU dGPU for CI testing.

@misohu
Copy link
Member

misohu commented Jul 19, 2024

Today I got the access to the dell device lab and successfully execute test cases from this spec. Namely:

  • I was able to dpeloy dss with intel plugin.
  • I was able to run ipex and itex images as dss notebooks without problems.
  • I executed ML workloads for given ipex itex ML frameworks.
  • I also tried to run multiple notebooks at the same time.

The process to get access to the lab

  • Install testflinger to run workload jobs in Intel device lab
  • Setup tw or us vpn
  • Ask in testflinger channel to add you to the canonical-vpn-taipei-vpn (here is my thread)
  • Prepare a testflinger job to create instance, add you lpid and run sleep so you can ssh into machine.
❯ cat dell-precision3470-c30322.yaml
job_queue: dell-precision-3470-c30322
provision_data:
  distro: noble
test_data:
  test_cmds: |
    ssh $DEVICE_IP sudo apt -y install git
reserve_data:
  ssh_keys:
    - lp:michalhucko
  timeout: 43200
  • Execute the job
testflinger submit --poll dell-precision3470-c30322.yaml
  • Wait for job to be scheduled, machine to be created. After that you will see ssh instructions on how to access machine. Take a note of the job if (it will be print to the screen at the end). Be sure to stay on VPN
  • SSH into machine
  • At the end kill the job to release the resources
# example id
testflinger-cli cancel b525f94b-ab53-4310-83d7-04664c569303

Note: read more about the procedure here.

@misohu
Copy link
Member

misohu commented Jul 19, 2024

Changes needed for dss Intel support

  1. Add intel status to dss status command.

Right now the dss status command outputs this information

[INFO] MLflow deployment: Ready
[INFO] MLflow URL: http://10.152.183.68:5000
[INFO] GPU acceleration: Disabled

We need to add one more row about Intel status. The correct way to get the intel device info is under discussion here. First idea is to check for intel gpu labels on kubernetes node.

  1. Add functionality to create intel gpu instances with create command
    In order to create a kubernetes pod with the intel gpu acceleration enabled we must:
  • Have the intel GPU operator enabled in Kubernetes cluster.
  • Have the Kubernetes resources section filled with the gpu.intel.com/i915 section. Example here.

After discussing with the team we decided to drop the --gpu intel argument from the dss create command. If the Intel acceleration is enabled (by user manually deploying the intel gpu operator) , all the notebooks will have the intel resources section filled automatically. Meaning that having correct image notebooks can use Intel hardware. This is not the problem for images without intel librarries as they will not use the resource anyways.

Because of this the dss create should check for the presence of intel gpu plugin. I f the plugin is there it will automatically populate the resources section.

Because we are using intel/intel-extension-for-tensorflow:2.15.0-xpu-idp-jupyter and intel/intel-extension-for-pytorch:2.1.20-xpu-idp-jupyter images for intel ML notebooks we also need to adjust the command and args section (check the example). We can add these settings blobally to all dss notebook deployments as non intel ones are setting these in their Dockerfiles anyways (This I need to test).

We also need to add intel/intel-extension-for-tensorflow:2.15.0-xpu-idp-jupyter and intel/intel-extension-for-pytorch:2.1.20-xpu-idp-jupyter images are recommendations to dss create --help

  1. Docs on how to setup intel device plugin
    We can use the setup described in this spec . Procedure is to deploy the devvice plugin manifests which we now keep in the dss repo here . There is a microk8s problem when deploying manifests from the URL. When fixed we can deploy directly form upstream.

  2. Docs on How to spin up Notebook from Intel with IPEX or ITEX
    After implementing points 1 and 2. User can simply deploy intel notebooks with following commands (this is only possible when the device plugin is enabled otherwise the notebooks will be deployed but resources will not be available).

dss create my-itex-notebook --image=intel/intel-extension-for-tensorflow:2.15.0-xpu-idp-jupyter
dss create my-ipex-notebook --image=intel/intel-extension-for-pytorch:2.1.20-xpu-idp-jupyte
  1. Documentation on how to run simple calculations with Intel ML frameworks in DSS
  2. Documentation on supported versions of Intel GPUs
    Regarding this point I need to reach out to intel team

@misohu
Copy link
Member

misohu commented Jul 22, 2024

As part of this task we have opnned following issues:

#146
#147
#148
#149
#150

When designing the spec we need to align on following open problems:

How are we going to recommend installation of theIntel device plugin?

Accoding to this spec we need to instruct the user to build the manifests from upstream repository as microk8s has problems with remote urls for its customization feature. the aforementioned spec recommands to keep the built manifests in the DSS repository. This is not ideal solution as DSS should not be responsible for installing the device plugin.

Should we be specific about Intel GPUs' versions which we support with DSS?

As DSS is not responsible for setting up the plugin, it should not care about the versions of the underlying Intel GPUs. User should handle the correct plugin installation with the correct GPU device.

How to support iGPUs and dGPUs Intel devices at the same time?

This one may not be supported. Please check the discussion.

@mvlassis
Copy link
Contributor

@misohu Your exploration of the Intel DSS environment has been very thorough, and you have specified very clearly defined tasks in order to achieve the integration. Great job!

The only thing that I find missing is to determine clearly whether iGPUs and dGPUs Intel devices will be supported simultaneously, before we proceed with the spec.

@misohu
Copy link
Member

misohu commented Jul 25, 2024

Thanks @mvlassis

The thing is that devices with both Intel iGPUs and dGPUs will be support just we cannot specify in the resources section if the workload should be deployed to iGPU or dGPU.

@mvlassis
Copy link
Contributor

@misohu If that is the case we should add a note/warning in the DSS documentation for that specific usecase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants