Skip to content

Latest commit

 

History

History
122 lines (90 loc) · 5.06 KB

README.md

File metadata and controls

122 lines (90 loc) · 5.06 KB

An example of Grid.ai running Ray in the model. The examples will show how to:

Get started with Development Setup

  • Setup development environment
# Grid.ai minimum is python=3.8
conda create --name ray python=3.8
conda activate ray
# Python modules required
cat >requirements.txt <<EOF
ray
ray[tune]
ray[default]
pandas
tabulate
tensorboardX
EOF
# Install Python modules for the experiment
pip install --ignore-requires-python -v -r requirements.txt
# Install Python modules for the Grid
pip install lightning-grid --upgrade

Unit test by running experiment locally

python ray-tune-quickstart.py

Run on Grid.ai Cloud with zero code modification

  • Login into Grid.ai
grid login
  • Run using default Grid.ai container. Use CLI below or click on Grid.ai Run Badge Single Run
grid run ray-tune-quickstart.py

Advanced Dockerfile usage on Grid.ai

Use Grid.ai with GitHub and Dockerfile examples by using customized container with --dockerfile gridray.dockerfile flag.

  • Run using manually specifying the Dockerfile. Use CLI below.
grid run --dockerfile gridray.dockerfile --name ray-dk-$(date '+%m%d-%H%M%S') ray-tune-quickstart.py
  • Use spot instance and override Run name with ray-MMDD-HHMMSS for easier search later. Use CLI below.
grid run --dockerfile gridray.dockerfile --use_spot --name ray-sp-dk-$(date '+%m%d-%H%M%S') ray-tune-quickstart.py

Use Grid.ai when the model is not on GitHub

Using --localdir does not allow the Grid.ai cloning feature.

  • Let Grid.ai build the container
grid run --name ray-local-$(date '+%m%d-%H%M%S') --localdir ray-tune-quickstart.py
  • Use the container specification
grid run --dockerfile gridray.dockerfile --use_spot --name ray-sp-dk-lc-$(date '+%m%d-%H%M%S') --localdir ray-tune-quickstart.py

Troubleshooting Tips

  • Review grid history
grid history | grep -e Run -e ray -e $(date '+%Y-%m-%d')
┃ Run                              ┃               Created At ┃ Experiments ┃ Failed ┃ Stopped ┃ Completed ┃
│ ray-sp-dk-lc-0720-105956         │ 2021-07-20 15:00:09+0000 │           1 │      0 │       0 │         1 │
│ ray-local-0720-105916            │ 2021-07-20 14:59:30+0000 │           1 │      0 │       0 │         1 │
│ ray-sp-dk-0720-105713            │ 2021-07-20 14:57:25+0000 │           1 │      0 │       0 │         1 │
│ ray-dk-0720-105640               │ 2021-07-20 14:56:53+0000 │           1 │      0 │       0 │         1 │
│ fervent-tamarin-146              │ 2021-07-20 14:55:39+0000 │           1 │      0 │       0 │         1 │
  • Review grid status
for run in $(grid history | grep -e Run -e ray -e $(date '+%Y-%m-%d') | awk -F'' '{print $2}'); do
  echo $run
  grid status $run
done
ray-sp-dk-lc-0720-105956
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Experiment                    ┃                 Command ┃    Status ┃    Duration ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ ray-sp-dk-lc-0720-105956-exp0 │ ray-tune-quickstart.py] │ succeeded │ 0d-00:01:28 │
└───────────────────────────────┴─────────────────────────┴───────────┴─────────────┘