-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
no import change notebooks updates #830
base: branch-25.02
Are you sure you want to change the base?
no import change notebooks updates #830
Conversation
…ripts, doc updates Signed-off-by: Erik Ordentlich <[email protected]>
Signed-off-by: Erik Ordentlich <[email protected]>
Signed-off-by: Erik Ordentlich <[email protected]>
Signed-off-by: Erik Ordentlich <[email protected]>
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome new feature! Minor comments that are mostly stylistic.
notebooks/aws-emr/README.md
Outdated
``` | ||
- Upload the zip file and the initialization script to S3. | ||
or if you wish to run the [no-import-change](../README.md#no-import-change) example notebook to: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could the scripts be combined and the no-import-change logic triggered via an argument to the script, e.g.
--bootstrap-actions Name='Spark Rapids ML Bootstrap action',Path=s3://${S3_BUCKET}/${INIT_SCRIPT},Args=["no-import-change"]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
sudo /usr/local/bin/pip3.10 install scikit-learn numpy~=1.0 | ||
|
||
# install cudf and cuml | ||
sudo /usr/local/bin/pip3.10 install --no-cache-dir cudf-cu12 --extra-index-url=https://pypi.nvidia.com --verbose |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cuXX-cu12~=$RAPIDS_VERSION
for these installs and likewise for other script?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit - not changed in this PR but seems like "spark.rapids.memory.pinnedPool.size":"2G"
is duplicated in this config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleted dupe.
``` | ||
import spark_rapids_ml.install | ||
``` | ||
After executing a cell with this command, all subsequent imports and accesses of supported accelerated classes from `pyspark.ml` will automatically redirect and return their counterparts in `spark_rapids_ml`. Unaccelerated classes will import from `pyspark.ml` as usual. Thus, with the above single import statement, all supported acceleration in an existing `pyspark` notebook is enabled with no additional import statement or code changes. Directly importing from `spark_rapids_ml` also still works (needed for non-MLlib algorithms like UMAP). | ||
or by modifying the PySpark/Jupyter launch command above to use a CLI `pyspark-rapids` installed by our `pip` package to start Jupyter with pyspark as follows: | ||
```bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: unindent this block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
notebooks/databricks/README.md
Outdated
@@ -17,34 +17,44 @@ If you already have a Databricks account, you can run the example notebooks on a | |||
export SAVE_DIR="/path/to/save/artifacts" | |||
databricks fs cp spark_rapids_ml.zip dbfs:${SAVE_DIR}/spark_rapids_ml.zip --profile ${PROFILE} | |||
``` | |||
- Edit the [init-pip-cuda-11.8.sh](init-pip-cuda-11.8.sh) init script to set the `SPARK_RAPIDS_ML_ZIP` variable to the DBFS location used above. | |||
- Edit the [init-pip-cuda-11.8.sh](init-pip-cuda-11.8.sh) and [init-pip-cuda-11.8-no-import.sh](init-pip-cuda-11.8-no-import.sh) init scripts to set the `SPARK_RAPIDS_ML_ZIP` variable to the DBFS location used above. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be cleaner to pass SPARK_RAPIDS_ML_ZIP
path as a cluster environment variable which the init scripts can access, avoiding the file edit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just went with pip install from pypi like other clusters.
notebooks/databricks/README.md
Outdated
- updates the CUDA runtime to 11.8 (required for Spark Rapids ML dependencies). | ||
- downloads and installs the [Spark-Rapids](https://github.com/NVIDIA/spark-rapids) plugin for accelerating data loading and Spark SQL. | ||
- installs various `cuXX` dependencies via pip. | ||
- in the case of `init-pip-cuda-11.8-no-import.sh` it also modifies a Databricks notebook kernel startup script to enable no-import change UX. See the [no-import-change](../README.md#no-import-change). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly might be cleaner if the scripts are combined and no-import-change logic triggered by a cluster environment variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
notebooks/databricks/README.md
Outdated
```bash | ||
export WS_SAVE_DIR="/path/to/directory/in/workspace" | ||
databricks workspace mkdirs ${WS_SAVE_DIR} --profile ${PROFILE} | ||
``` | ||
For Mac | ||
```bash | ||
databricks workspace import --format AUTO --content $(base64 -i init-pip-cuda-11.8.sh) ${WS_SAVE_DIR}/init-pip-cuda-11.8.sh --profile ${PROFILE} | ||
databricks workspace import --format AUTO --content $(base64 -i ${INIT_SCRIPT}) ${WS_SAVE_DIR}/${INIT_SCRIPT} --profile ${PROFILE} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would just --file ${INIT_SCRIPT}
work here for both platforms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Apparently, that arg wasn't around with earlier cli version.
notebooks/dataproc/README.md
Outdated
@@ -21,8 +21,16 @@ If you already have a Dataproc account, you can run the example notebooks on a D | |||
gcloud storage buckets create gs://${GCS_BUCKET} | |||
``` | |||
- Upload the initialization scripts to your GCS bucket: | |||
First, set the init script name for running the default notebooks: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly wondering if the scripts could be combined and no-import-change toggled via metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -194,6 +194,12 @@ and if the app is deployed using `spark-submit` the following included CLI (inst | |||
```bash | |||
spark-rapids-submit --master <master> <other spark submit options> application.py <application options> | |||
``` | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe we should have a short table of contents at the top of this file? Imagining this no-import section would be lost below all the examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
|
||
# install cudf and cuml | ||
sudo /usr/local/bin/pip3.10 install --no-cache-dir cudf-cu12 --extra-index-url=https://pypi.nvidia.com --verbose | ||
sudo /usr/local/bin/pip3.10 install --no-cache-dir cuml-cu12 cuvs-cu12 --extra-index-url=https://pypi.nvidia.com --verbose |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also noticed that EMR script does not install Raft/RMM whereas Databricks/Dataproc do; not sure if this is intentional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only cudf, cuml, and cuvs are needed which pull in these other dependencies, so removed in all scripts.
Signed-off-by: Erik Ordentlich <[email protected]>
Signed-off-by: Erik Ordentlich <[email protected]>
Great suggestions. Will incorporate in revision. |
"Uncommenting and running the next cell will redirect all subsequent `pyspark.ml` imports (e.g. `pyspark.ml.clustering.Kmeans`) to GPU accelerated `spark_rapids_ml` counterparts if they are supported. Comment out and restart the kernel to revert to `pyspark.ml` CPU imports.\n", | ||
"\n", | ||
"If you are running this notebook in local mode after starting `pyspark` with the `spark-rapids-ml` CLI `pyspark-rapids` or on EMR, Databricks, or Dataproc by following the respective READMEs and have selected the \\*no-import\\* init/bootstrap scripts the cell need not be run as the import was already run as part of the kernel launch.\n", | ||
"Note that in these cases, there is no way to revert to CPU mode without creating a new (CPU) cluster or using the baseline `pyspark` command for local mode." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: would it be possible to rephrase the command to be simpler in one sentence like "If you are running this notebook with pyspark-rapids, there is no way to revert to GPU mode."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
--extra-index-url=https://pypi.nvidia.com | ||
|
||
# set up no-import-change | ||
sed -i /databricks/python_shell/dbruntime/monkey_patches.py -e '1 s/\(.*\)/import spark_rapids_ml.install\n\1/g' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good finding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
Signed-off-by: Erik Ordentlich <[email protected]>
Signed-off-by: Erik Ordentlich <[email protected]>
build |
@eordentlich I have a question in mind: Would it be beneficial to move "no-code-change" to an independent folder for better organization and future reference? Since this is an important feature, having a centralized location for reference or a pointer could be helpful. Do you have any thoughts on how cuDF or cuML direct users to "no-code-change"? |
No description provided.