Skip to content

Commit 43e24e6

Browse files
authored
Merge pull request #65 from sophongo/en1
add more files
2 parents b82802e + 7d01e93 commit 43e24e6

File tree

589 files changed

+11232
-45
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

589 files changed

+11232
-45
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
---
2+
MTPE: windsonsea
3+
Date: 2024-07-30
4+
---
5+
6+
# Add Job Scheduler
7+
8+
DCE 5.0 AI Lab provides a job scheduler to help you better manage jobs.
9+
In addition to the basic scheduler, it also supports custom schedulers.
10+
11+
## Introduction to Job Scheduler
12+
13+
In Kubernetes, the job scheduler is responsible for deciding which node to assign a Pod to.
14+
It considers various factors such as resource requirements, hardware/software constraints,
15+
affinity/anti-affinity rules, and data locality.
16+
17+
The default scheduler is a core component in a Kubernetes cluster that decides
18+
which node a Pod should run on. Let's delve into its working principles,
19+
features, and configuration methods.
20+
21+
### Scheduler Workflow
22+
23+
The workflow of the default scheduler can be divided into two main phases: filtering and scoring.
24+
25+
#### Filtering Phase
26+
27+
The scheduler traverses all nodes and excludes those that do not meet the Pod's requirements,
28+
considering factors such as:
29+
30+
- Resource requirements
31+
- Node selectors
32+
- Node affinity
33+
- Taints and tolerations
34+
35+
These parameters can be set through advanced configurations when creating a job.
36+
37+
<!-- add screenshot later -->
38+
39+
#### Scoring Phase
40+
41+
The scheduler scores the nodes that passed the filtering phase and selects
42+
the highest-scoring node to run the Pod. Factors considered include:
43+
44+
- Resource utilization
45+
- Pod affinity/anti-affinity
46+
- Node affinity
47+
48+
## Scheduler Plugins
49+
50+
In addition to basic job scheduling capabilities, we also support the use of
51+
`Scheduler Plugins: Kubernetes SIG Scheduling`, which maintains a set of scheduler plugins
52+
including `Coscheduling (Gang Scheduling)` and other features.
53+
54+
### Deploy Scheduler Plugins
55+
56+
To deploy a secondary scheduler plugin in a worker cluster, refer to
57+
[Deploying Secondary Scheduler Plugin](../../kpanda/user-guide/clusters/cluster-scheduler-plugin.md).
58+
59+
### Enable Scheduler Plugins in AI Lab
60+
61+
!!! danger
62+
63+
Improper operations when adding scheduler plugins may affect the stability of the entire cluster.
64+
It is recommended to test in a test environment or contact our technical support team.
65+
66+
Note that if you wish to use more scheduler plugins in training jobs, you need to manually install
67+
them successfully in the worker cluster first. Then, when deploying the `baize-agent` in the cluster,
68+
add the proper scheduler plugin configuration.
69+
70+
Through the container management UI provided by **Helm Apps** ,
71+
you can easily deploy scheduler plugins in the cluster.
72+
73+
<!-- add screenshot later -->
74+
75+
Then, click **Install** in the top right corner.
76+
(If the `baize-agent` has already been deployed, you can update it in the Helm Application list.)
77+
Add the scheduler.
78+
79+
<!-- add screenshot later -->
80+
81+
Note the parameter hierarchy of the scheduler. After adding, click **OK** .
82+
83+
> Note: Do not omit this configuration when updating the `baize-agent` in the future.
84+
85+
## Specify Scheduler When Creating a Job
86+
87+
Once you have successfully deployed the corresponding scheduler plugin in the cluster and
88+
correctly added the corresponding scheduler configuration in the `baize-agent`,
89+
you can specify the scheduler when creating a job.
90+
91+
If everything is set up correctly, you will see the scheduler plugin you deployed in the scheduler dropdown menu.
92+
93+
<!-- add screenshot later -->
94+
95+
This concludes the instructions for configuring and using the scheduler options in AI Lab.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,201 @@
1+
# Update Built-in Notebook Images
2+
3+
In the Notebook, multiple available base images are provided by default for developers to choose from.
4+
In most cases, this will meet the developers' needs.
5+
6+
![Creating notebook interface](../images/notebook-images.png)
7+
8+
DaoCloud provides a default Notebook image that contains all necessary development tools and resources.
9+
10+
```markdown
11+
baize/baize-notebook
12+
```
13+
14+
This Notebook includes basic development tools. Taking `baize-notebook:v0.5.0` (May 30, 2024) as an example, the relevant dependencies and versions are as follows:
15+
16+
| Dependency | Version | Description |
17+
| ------------- | -------- | --------------------------------------------------------- |
18+
| Ubuntu | 22.04.3 | Default OS |
19+
| Python | 3.11.6 | Default Python version |
20+
| pip | 23.3.1 | |
21+
| conda(mamba) | 23.3.1 | |
22+
| jupyterlab | 3.6.6 | JupyterLab image, providing a complete Notebook experience |
23+
| codeserver | v4.89.1 | Mainstream Code development tool for a familiar experience |
24+
| *baizectl | v0.5.0 | DaoCloud built-in CLI task management tool |
25+
| *SSH | - | Supports local SSH direct access to the Notebook container |
26+
| *kubectl | v1.27 | Kubernetes CLI for managing container resources within Notebook |
27+
28+
!!! note
29+
30+
With each version iteration, DCE 5.0 will proactively maintain and update.
31+
32+
However, sometimes users may need custom images. This page explains how to update images and add them to the Notebook creation interface for selection.
33+
34+
## Build Custom Images (For Reference Only)
35+
36+
!!! note
37+
38+
Building a new image **requires using `baize-notebook` as the base image** to ensure the Notebook runs properly.
39+
40+
When building a custom image, it is recommended to first understand the Dockerfile of
41+
the baize-notebook image to better understand how to build a custom image.
42+
43+
### Dockerfile for baize-notebook
44+
45+
```dockerfile
46+
ARG BASE_IMG=docker.m.daocloud.io/kubeflownotebookswg/jupyter:v1.8.0
47+
48+
FROM $BASE_IMG
49+
50+
USER root
51+
52+
# install - useful linux packages
53+
RUN export DEBIAN_FRONTEND=noninteractive \
54+
&& apt-get -yq update \
55+
&& apt-get -yq install --no-install-recommends \
56+
openssh-server git git-lfs bash-completion \
57+
&& apt-get clean \
58+
&& rm -rf /var/lib/apt/lists/*
59+
60+
# remove default s6 jupyterlab run script
61+
RUN rm -rf /etc/services.d/jupyterlab
62+
63+
# install - useful jupyter plugins
64+
RUN mamba install -n base -y jupyterlab-language-pack-zh-cn \
65+
&& mamba clean --all -y
66+
67+
ARG CODESERVER_VERSION=4.89.1
68+
ARG TARGETARCH
69+
70+
RUN curl -fsSL "https://github.com/coder/code-server/releases/download/v$CODESERVER_VERSION/code-server_${CODESERVER_VERSION}_$TARGETARCH.deb" -o /tmp/code-server.deb \
71+
&& dpkg -i /tmp/code-server.deb \
72+
&& rm -f /tmp/code-server.deb
73+
74+
ARG CODESERVER_PYTHON_VERSION=2024.4.1
75+
ARG CODESERVER_JUPYTER_VERSION=2024.3.1
76+
ARG CODESERVER_LANGUAGE_PACK_ZH_CN=1.89.0
77+
ARG CODESERVER_YAML=1.14.0
78+
ARG CODESERVER_DOTENV=1.0.1
79+
ARG CODESERVER_EDITORCONFIG=0.16.6
80+
ARG CODESERVER_TOML=0.19.1
81+
ARG CODESERVER_GITLENS=15.0.4
82+
83+
# configure for code-server extensions
84+
# # https://github.com/kubeflow/kubeflow/blob/709254159986d2cc99e675d0fad5a128ddeb0917/components/example-notebook-servers/codeserver-python/Dockerfile
85+
# # and
86+
# # https://github.com/kubeflow/kubeflow/blob/709254159986d2cc99e675d0fad5a128ddeb0917/components/example-notebook-servers/codeserver/Dockerfile
87+
RUN code-server --list-extensions --show-versions \
88+
&& code-server --list-extensions --show-versions \
89+
&& code-server \
90+
--install-extension MS-CEINTL.vscode-language-pack-zh-hans@$CODESERVER_LANGUAGE_PACK_ZH_CN \
91+
--install-extension ms-python.python@$CODESERVER_PYTHON_VERSION \
92+
--install-extension ms-toolsai.jupyter@$CODESERVER_JUPYTER_VERSION \
93+
--install-extension redhat.vscode-yaml@$CODESERVER_YAML \
94+
--install-extension mikestead.dotenv@$CODESERVER_DOTENV \
95+
--install-extension EditorConfig.EditorConfig@$CODESERVER_EDITORCONFIG \
96+
--install-extension tamasfe.even-better-toml@$CODESERVER_TOML \
97+
--install-extension eamodio.gitlens@$CODESERVER_GITLENS \
98+
--install-extension catppuccin.catppuccin-vsc-pack \
99+
--force \
100+
&& code-server --list-extensions --show-versions
101+
102+
# configure for code-server
103+
RUN mkdir -p /home/${NB_USER}/.local/share/code-server/User \
104+
&& chown -R ${NB_USER}:users /home/${NB_USER} \
105+
&& cat <<EOF > /home/${NB_USER}/.local/share/code-server/User/settings.json
106+
{
107+
"gitlens.showWelcomeOnInstall": false,
108+
"workbench.colorTheme": "Catppuccin Mocha",
109+
}
110+
EOF
111+
112+
RUN mkdir -p /tmp_home/${NB_USER}/.local/share \
113+
&& mv /home/${NB_USER}/.local/share/code-server /tmp_home/${NB_USER}/.local/share
114+
115+
# set ssh configuration
116+
RUN mkdir -p /run/sshd \
117+
&& chown -R ${NB_USER}:users /etc/ssh \
118+
&& chown -R ${NB_USER}:users /run/sshd \
119+
&& sed -i "/#\?Port/s/^.*$/Port 2222/g" /etc/ssh/sshd_config \
120+
&& sed -i "/#\?PasswordAuthentication/s/^.*$/PasswordAuthentication no/g" /etc/ssh/sshd_config \
121+
&& sed -i "/#\?PubkeyAuthentication/s/^.*$/PubkeyAuthentication yes/g" /etc/ssh/sshd_config \
122+
&& rclone_version=v1.65.0 && \
123+
arch=$(uname -m | sed -E 's/x86_64/amd64/g;s/aarch64/arm64/g') && \
124+
filename=rclone-${rclone_version}-linux-${arch} && \
125+
curl -fsSL https://github.com/rclone/rclone/releases/download/${rclone_version}/${filename}.zip -o ${filename}.zip && \
126+
unzip ${filename}.zip && mv ${filename}/rclone /usr/local/bin && rm -rf ${filename} ${filename}.zip
127+
128+
# Init mamba
129+
RUN mamba init --system
130+
131+
# init baize-base environment for essential python packages
132+
RUN mamba create -n baize-base -y python \
133+
&& /opt/conda/envs/baize-base/bin/pip install tensorboard \
134+
&& mamba clean --all -y \
135+
&& ln -s /opt/conda/envs/baize-base/bin/tensorboard /usr/local/bin/tensorboard
136+
137+
# prepare baize-runtime-env directory
138+
RUN mkdir -p /opt/baize-runtime-env \
139+
&& chown -R ${NB_USER}:users /opt/baize-runtime-env
140+
141+
ARG APP
142+
ARG PROD_NAME
143+
ARG TARGETOS
144+
145+
COPY out/$TARGETOS/$TARGETARCH/data-loader /usr/local/bin/
146+
COPY out/$TARGETOS/$TARGETARCH/baizectl /usr/local/bin/
147+
148+
RUN chmod +x /usr/local/bin/baizectl /usr/local/bin/data-loader && \
149+
echo "source /etc/bash_completion" >> /opt/conda/etc/profile.d/conda.sh && \
150+
echo "source <(baizectl completion bash)" >> /opt/conda/etc/profile.d/conda.sh && \
151+
echo "source <(kubectl completion bash)" >> /opt/conda/etc/profile.d/conda.sh && \
152+
echo '[ -f /run/baize-env ] && export $(cat /run/baize-env | xargs)' >> /opt/conda/etc/profile.d/conda.sh && \
153+
echo 'alias conda="mamba"' >> /opt/conda/etc/profile.d/conda.sh
154+
155+
USER ${NB_UID}
156+
```
157+
158+
### Build Your Image
159+
160+
```dockerfile
161+
ARG BASE_IMG=release.daocloud.io/baize/baize-notebook:v0.5.0
162+
163+
FROM $BASE_IMG
164+
USER root
165+
166+
# Do Customization
167+
RUN mamba install -n baize-base -y pytorch torchvision torchaudio cpuonly -c pytorch \
168+
&& mamba install -n baize-base -y tensorflow \
169+
&& mamba clean --all -y
170+
171+
USER ${NB_UID}
172+
```
173+
174+
## Add to the Notebook Image List (Helm)
175+
176+
!!! warning
177+
178+
Note that this must be done by the platform administrator. Be cautious with changes.
179+
180+
Currently, the image selector needs to be modified by updating the `Helm` parameters of `baize`. The specific steps are as follows:
181+
182+
In the `Helm Applications` list of the kpanda-global-cluster global management cluster,
183+
find baize, enter the update page, and modify the Notebook image in the `YAML` parameters:
184+
185+
![Update Baize](../images/update-baize.png)
186+
187+
Note the parameter modification path `global.config.notebook_images`:
188+
189+
```yaml
190+
...
191+
global:
192+
...
193+
config:
194+
notebook_images:
195+
...
196+
names: release.daocloud.io/baize/baize-notebook:v0.5.0
197+
# Add your image information here
198+
```
199+
200+
After the update is completed and the Helm application restarts successfully,
201+
you can see the new image in the Notebook creation interface image selection.

0 commit comments

Comments
 (0)