Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

no import change notebooks updates #830

Open
wants to merge 8 commits into
base: branch-25.02
Choose a base branch
from

Conversation

eordentlich
Copy link
Collaborator

No description provided.

Signed-off-by: Erik Ordentlich <[email protected]>
Signed-off-by: Erik Ordentlich <[email protected]>
Signed-off-by: Erik Ordentlich <[email protected]>
@eordentlich
Copy link
Collaborator Author

build

Copy link
Collaborator

@rishic3 rishic3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome new feature! Minor comments that are mostly stylistic.

```
- Upload the zip file and the initialization script to S3.
or if you wish to run the [no-import-change](../README.md#no-import-change) example notebook to:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the scripts be combined and the no-import-change logic triggered via an argument to the script, e.g.

--bootstrap-actions Name='Spark Rapids ML Bootstrap action',Path=s3://${S3_BUCKET}/${INIT_SCRIPT},Args=["no-import-change"]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

sudo /usr/local/bin/pip3.10 install scikit-learn numpy~=1.0

# install cudf and cuml
sudo /usr/local/bin/pip3.10 install --no-cache-dir cudf-cu12 --extra-index-url=https://pypi.nvidia.com --verbose
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cuXX-cu12~=$RAPIDS_VERSION for these installs and likewise for other script?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit - not changed in this PR but seems like "spark.rapids.memory.pinnedPool.size":"2G" is duplicated in this config.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted dupe.

```
import spark_rapids_ml.install
```
After executing a cell with this command, all subsequent imports and accesses of supported accelerated classes from `pyspark.ml` will automatically redirect and return their counterparts in `spark_rapids_ml`. Unaccelerated classes will import from `pyspark.ml` as usual. Thus, with the above single import statement, all supported acceleration in an existing `pyspark` notebook is enabled with no additional import statement or code changes. Directly importing from `spark_rapids_ml` also still works (needed for non-MLlib algorithms like UMAP).
or by modifying the PySpark/Jupyter launch command above to use a CLI `pyspark-rapids` installed by our `pip` package to start Jupyter with pyspark as follows:
```bash
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unindent this block?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -17,34 +17,44 @@ If you already have a Databricks account, you can run the example notebooks on a
export SAVE_DIR="/path/to/save/artifacts"
databricks fs cp spark_rapids_ml.zip dbfs:${SAVE_DIR}/spark_rapids_ml.zip --profile ${PROFILE}
```
- Edit the [init-pip-cuda-11.8.sh](init-pip-cuda-11.8.sh) init script to set the `SPARK_RAPIDS_ML_ZIP` variable to the DBFS location used above.
- Edit the [init-pip-cuda-11.8.sh](init-pip-cuda-11.8.sh) and [init-pip-cuda-11.8-no-import.sh](init-pip-cuda-11.8-no-import.sh) init scripts to set the `SPARK_RAPIDS_ML_ZIP` variable to the DBFS location used above.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be cleaner to pass SPARK_RAPIDS_ML_ZIP path as a cluster environment variable which the init scripts can access, avoiding the file edit.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just went with pip install from pypi like other clusters.

- updates the CUDA runtime to 11.8 (required for Spark Rapids ML dependencies).
- downloads and installs the [Spark-Rapids](https://github.com/NVIDIA/spark-rapids) plugin for accelerating data loading and Spark SQL.
- installs various `cuXX` dependencies via pip.
- in the case of `init-pip-cuda-11.8-no-import.sh` it also modifies a Databricks notebook kernel startup script to enable no-import change UX. See the [no-import-change](../README.md#no-import-change).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly might be cleaner if the scripts are combined and no-import-change logic triggered by a cluster environment variable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

```bash
export WS_SAVE_DIR="/path/to/directory/in/workspace"
databricks workspace mkdirs ${WS_SAVE_DIR} --profile ${PROFILE}
```
For Mac
```bash
databricks workspace import --format AUTO --content $(base64 -i init-pip-cuda-11.8.sh) ${WS_SAVE_DIR}/init-pip-cuda-11.8.sh --profile ${PROFILE}
databricks workspace import --format AUTO --content $(base64 -i ${INIT_SCRIPT}) ${WS_SAVE_DIR}/${INIT_SCRIPT} --profile ${PROFILE}
Copy link
Collaborator

@rishic3 rishic3 Jan 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would just --file ${INIT_SCRIPT} work here for both platforms?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Apparently, that arg wasn't around with earlier cli version.

@@ -21,8 +21,16 @@ If you already have a Dataproc account, you can run the example notebooks on a D
gcloud storage buckets create gs://${GCS_BUCKET}
```
- Upload the initialization scripts to your GCS bucket:
First, set the init script name for running the default notebooks:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly wondering if the scripts could be combined and no-import-change toggled via metadata.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -194,6 +194,12 @@ and if the app is deployed using `spark-submit` the following included CLI (inst
```bash
spark-rapids-submit --master <master> <other spark submit options> application.py <application options>
```

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe we should have a short table of contents at the top of this file? Imagining this no-import section would be lost below all the examples.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.


# install cudf and cuml
sudo /usr/local/bin/pip3.10 install --no-cache-dir cudf-cu12 --extra-index-url=https://pypi.nvidia.com --verbose
sudo /usr/local/bin/pip3.10 install --no-cache-dir cuml-cu12 cuvs-cu12 --extra-index-url=https://pypi.nvidia.com --verbose
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also noticed that EMR script does not install Raft/RMM whereas Databricks/Dataproc do; not sure if this is intentional.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only cudf, cuml, and cuvs are needed which pull in these other dependencies, so removed in all scripts.

Signed-off-by: Erik Ordentlich <[email protected]>
Signed-off-by: Erik Ordentlich <[email protected]>
@eordentlich
Copy link
Collaborator Author

Great suggestions. Will incorporate in revision.

"Uncommenting and running the next cell will redirect all subsequent `pyspark.ml` imports (e.g. `pyspark.ml.clustering.Kmeans`) to GPU accelerated `spark_rapids_ml` counterparts if they are supported. Comment out and restart the kernel to revert to `pyspark.ml` CPU imports.\n",
"\n",
"If you are running this notebook in local mode after starting `pyspark` with the `spark-rapids-ml` CLI `pyspark-rapids` or on EMR, Databricks, or Dataproc by following the respective READMEs and have selected the \\*no-import\\* init/bootstrap scripts the cell need not be run as the import was already run as part of the kernel launch.\n",
"Note that in these cases, there is no way to revert to CPU mode without creating a new (CPU) cluster or using the baseline `pyspark` command for local mode."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: would it be possible to rephrase the command to be simpler in one sentence like "If you are running this notebook with pyspark-rapids, there is no way to revert to GPU mode."

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

--extra-index-url=https://pypi.nvidia.com

# set up no-import-change
sed -i /databricks/python_shell/dbruntime/monkey_patches.py -e '1 s/\(.*\)/import spark_rapids_ml.install\n\1/g'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good finding.

Copy link
Collaborator

@lijinf2 lijinf2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

Signed-off-by: Erik Ordentlich <[email protected]>
@eordentlich
Copy link
Collaborator Author

build

@lijinf2
Copy link
Collaborator

lijinf2 commented Jan 28, 2025

@eordentlich I have a question in mind: Would it be beneficial to move "no-code-change" to an independent folder for better organization and future reference? Since this is an important feature, having a centralized location for reference or a pointer could be helpful. Do you have any thoughts on how cuDF or cuML direct users to "no-code-change"?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants