no import change notebooks updates #830

eordentlich · 2025-01-24T22:59:23Z

No description provided.

…ripts, doc updates Signed-off-by: Erik Ordentlich <[email protected]>

Signed-off-by: Erik Ordentlich <[email protected]>

eordentlich · 2025-01-24T23:01:01Z

build

rishic3

Awesome new feature! Minor comments that are mostly stylistic.

rishic3 · 2025-01-25T02:20:45Z

notebooks/aws-emr/README.md

  ```
- Upload the zip file and the initialization script to S3.
+  or if you wish to run the [no-import-change](../README.md#no-import-change) example notebook to:


Could the scripts be combined and the no-import-change logic triggered via an argument to the script, e.g.

--bootstrap-actions Name='Spark Rapids ML Bootstrap action',Path=s3://${S3_BUCKET}/${INIT_SCRIPT},Args=["no-import-change"]

rishic3 · 2025-01-25T02:24:40Z

notebooks/aws-emr/init-bootstrap-action-no-import.sh

+sudo /usr/local/bin/pip3.10 install scikit-learn numpy~=1.0
+
+# install cudf and cuml
+sudo /usr/local/bin/pip3.10 install --no-cache-dir cudf-cu12 --extra-index-url=https://pypi.nvidia.com --verbose


cuXX-cu12~=$RAPIDS_VERSION for these installs and likewise for other script?

rishic3 · 2025-01-25T02:26:34Z

notebooks/aws-emr/init-configurations.json

Nit - not changed in this PR but seems like "spark.rapids.memory.pinnedPool.size":"2G" is duplicated in this config.

Deleted dupe.

rishic3 · 2025-01-25T02:29:03Z

notebooks/README.md

 ```
 import spark_rapids_ml.install
 ```
-After executing a cell with this command, all subsequent imports and accesses of supported accelerated classes from `pyspark.ml` will automatically redirect and return their counterparts in `spark_rapids_ml`.  Unaccelerated classes will import from `pyspark.ml` as usual.  Thus, with the above single import statement, all supported acceleration in an existing `pyspark` notebook is enabled with no additional import statement or code changes.  Directly importing from `spark_rapids_ml` also still works (needed for non-MLlib algorithms like UMAP).
+or by modifying the PySpark/Jupyter launch command above to use a CLI `pyspark-rapids` installed by our `pip` package to start Jupyter with pyspark as follows: 
+```bash


nit: unindent this block?

rishic3 · 2025-01-25T02:42:39Z

notebooks/databricks/README.md

@@ -17,34 +17,44 @@ If you already have a Databricks account, you can run the example notebooks on a
  export SAVE_DIR="/path/to/save/artifacts"
  databricks fs cp spark_rapids_ml.zip dbfs:${SAVE_DIR}/spark_rapids_ml.zip --profile ${PROFILE}
  ```
- Edit the [init-pip-cuda-11.8.sh](init-pip-cuda-11.8.sh) init script to set the `SPARK_RAPIDS_ML_ZIP` variable to the DBFS location used above.
+- Edit the [init-pip-cuda-11.8.sh](init-pip-cuda-11.8.sh) and [init-pip-cuda-11.8-no-import.sh](init-pip-cuda-11.8-no-import.sh) init scripts to set the `SPARK_RAPIDS_ML_ZIP` variable to the DBFS location used above.


Might be cleaner to pass SPARK_RAPIDS_ML_ZIP path as a cluster environment variable which the init scripts can access, avoiding the file edit.

Just went with pip install from pypi like other clusters.

rishic3 · 2025-01-25T02:44:25Z

notebooks/databricks/README.md

  - updates the CUDA runtime to 11.8 (required for Spark Rapids ML dependencies).
  - downloads and installs the [Spark-Rapids](https://github.com/NVIDIA/spark-rapids) plugin for accelerating data loading and Spark SQL.
  - installs various `cuXX` dependencies via pip.
+  - in the case of `init-pip-cuda-11.8-no-import.sh` it also modifies a Databricks notebook kernel startup script to enable no-import change UX.  See the [no-import-change](../README.md#no-import-change).


Similarly might be cleaner if the scripts are combined and no-import-change logic triggered by a cluster environment variable.

rishic3 · 2025-01-25T02:49:02Z

notebooks/databricks/README.md

  ```bash
  export WS_SAVE_DIR="/path/to/directory/in/workspace"
  databricks workspace mkdirs ${WS_SAVE_DIR} --profile ${PROFILE}
  ```
  For Mac
  ```bash
-  databricks workspace import --format AUTO --content $(base64 -i init-pip-cuda-11.8.sh) ${WS_SAVE_DIR}/init-pip-cuda-11.8.sh --profile ${PROFILE}
+  databricks workspace import --format AUTO --content $(base64 -i ${INIT_SCRIPT}) ${WS_SAVE_DIR}/${INIT_SCRIPT} --profile ${PROFILE}


Would just --file ${INIT_SCRIPT} work here for both platforms?

Yes. Apparently, that arg wasn't around with earlier cli version.

rishic3 · 2025-01-25T02:54:56Z

notebooks/dataproc/README.md

@@ -21,8 +21,16 @@ If you already have a Dataproc account, you can run the example notebooks on a D
  gcloud storage buckets create gs://${GCS_BUCKET}
  ```
 - Upload the initialization scripts to your GCS bucket:
+  First, set the init script name for running the default notebooks:


Similarly wondering if the scripts could be combined and no-import-change toggled via metadata.

rishic3 · 2025-01-25T02:59:28Z

python/README.md

@@ -194,6 +194,12 @@ and if the app is deployed using `spark-submit` the following included CLI (inst
 ```bash
 spark-rapids-submit --master <master> <other spark submit options> application.py <application options>
 ```
+


nit: maybe we should have a short table of contents at the top of this file? Imagining this no-import section would be lost below all the examples.

rishic3 · 2025-01-25T03:08:15Z

notebooks/aws-emr/init-bootstrap-action-no-import.sh

+
+# install cudf and cuml
+sudo /usr/local/bin/pip3.10 install --no-cache-dir cudf-cu12 --extra-index-url=https://pypi.nvidia.com --verbose
+sudo /usr/local/bin/pip3.10 install --no-cache-dir cuml-cu12 cuvs-cu12 --extra-index-url=https://pypi.nvidia.com --verbose


Also noticed that EMR script does not install Raft/RMM whereas Databricks/Dataproc do; not sure if this is intentional.

Only cudf, cuml, and cuvs are needed which pull in these other dependencies, so removed in all scripts.

Signed-off-by: Erik Ordentlich <[email protected]>

eordentlich · 2025-01-27T17:22:47Z

Great suggestions. Will incorporate in revision.

lijinf2 · 2025-01-27T18:11:56Z

notebooks/kmeans-no-import-change.ipynb

+    "Uncommenting and running the next cell will redirect all subsequent `pyspark.ml` imports (e.g. `pyspark.ml.clustering.Kmeans`) to GPU accelerated `spark_rapids_ml` counterparts if they are supported.  Comment out and restart the kernel to revert to `pyspark.ml` CPU imports.\n",
+    "\n",
+    "If you are running this notebook in local mode after starting `pyspark` with the `spark-rapids-ml` CLI `pyspark-rapids` or on EMR, Databricks, or Dataproc by following the respective READMEs and have selected the \\*no-import\\* init/bootstrap scripts the cell need not be run as the import was already run as part of the kernel launch.\n",
+    "Note that in these cases, there is no way to revert to CPU mode without creating a new (CPU) cluster or using the baseline `pyspark` command for local mode."


Nit: would it be possible to rephrase the command to be simpler in one sentence like "If you are running this notebook with pyspark-rapids, there is no way to revert to GPU mode."

lijinf2 · 2025-01-27T18:32:21Z

notebooks/databricks/init-pip-cuda-11.8-no-import.sh

+    --extra-index-url=https://pypi.nvidia.com
+
+# set up no-import-change
+sed -i /databricks/python_shell/dbruntime/monkey_patches.py -e '1 s/\(.*\)/import spark_rapids_ml.install\n\1/g'


Good finding.

lijinf2

Looks good to me!

Signed-off-by: Erik Ordentlich <[email protected]>

eordentlich · 2025-01-27T23:56:53Z

build

lijinf2 · 2025-01-28T01:10:35Z

@eordentlich I have a question in mind: Would it be beneficial to move "no-code-change" to an independent folder for better organization and future reference? Since this is an important feature, having a centralized location for reference or a pointer could be helpful. Do you have any thoughts on how cuDF or cuML direct users to "no-code-change"?

eordentlich added 4 commits January 24, 2025 13:15

no import change on csps, updates to emr notebook instructions and sc…

827ad2b

…ripts, doc updates Signed-off-by: Erik Ordentlich <[email protected]>

formatting

2a33e5d

Signed-off-by: Erik Ordentlich <[email protected]>

cleanup

ed12796

Signed-off-by: Erik Ordentlich <[email protected]>

license

a38ca15

Signed-off-by: Erik Ordentlich <[email protected]>

rishic3 reviewed Jan 25, 2025

View reviewed changes

eordentlich added 2 commits January 26, 2025 21:59

address comments

22888e8

Signed-off-by: Erik Ordentlich <[email protected]>

clean up

bf004d8

Signed-off-by: Erik Ordentlich <[email protected]>

lijinf2 reviewed Jan 27, 2025

View reviewed changes

eordentlich added 2 commits January 27, 2025 15:34

dataproc fixes, shorten notebook comment

981366f

Signed-off-by: Erik Ordentlich <[email protected]>

cleanup

f4a7dd5

Signed-off-by: Erik Ordentlich <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

no import change notebooks updates #830

no import change notebooks updates #830

eordentlich commented Jan 24, 2025

eordentlich commented Jan 24, 2025

rishic3 left a comment

rishic3 Jan 25, 2025

eordentlich Jan 28, 2025

rishic3 Jan 25, 2025

eordentlich Jan 28, 2025

rishic3 Jan 25, 2025

eordentlich Jan 28, 2025

rishic3 Jan 25, 2025

eordentlich Jan 28, 2025

rishic3 Jan 25, 2025

eordentlich Jan 28, 2025

rishic3 Jan 25, 2025

eordentlich Jan 28, 2025

rishic3 Jan 25, 2025 •

edited

Loading

eordentlich Jan 28, 2025

rishic3 Jan 25, 2025

eordentlich Jan 28, 2025

rishic3 Jan 25, 2025

eordentlich Jan 28, 2025

rishic3 Jan 25, 2025

eordentlich Jan 28, 2025

eordentlich commented Jan 27, 2025

lijinf2 Jan 27, 2025

eordentlich Jan 28, 2025

lijinf2 Jan 27, 2025

lijinf2 left a comment

eordentlich commented Jan 27, 2025

lijinf2 commented Jan 28, 2025 •

edited

Loading

no import change notebooks updates #830

Are you sure you want to change the base?

no import change notebooks updates #830

Conversation

eordentlich commented Jan 24, 2025

eordentlich commented Jan 24, 2025

rishic3 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rishic3 Jan 25, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eordentlich commented Jan 27, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lijinf2 left a comment

Choose a reason for hiding this comment

eordentlich commented Jan 27, 2025

lijinf2 commented Jan 28, 2025 • edited Loading

rishic3 Jan 25, 2025 •

edited

Loading

lijinf2 commented Jan 28, 2025 •

edited

Loading