- Example 1: Simple 2D CNN with the MNIST dataset
- Example 2: TensorBoard application
- Example 3: Multi-gpu example from TensorFlow documentation
- Example 4: Multi-gpu example -- modified
tf_test.py
TensorFlow (TF) is an open-source software library for numerical computation using data flow graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) that flow between them. This flexible architecture lets you deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device without rewriting code.
TensorFlow was originally developed by researchers and engineers working on the Google Brain team within Google's Machine Intelligence research organization for the purposes of conducting machine learning and deep neural networks research. The system is general enough to be applicable in a wide variety of other domains, as well.
The below instructions are intended to help you set up TF on the FASRC cluster.
The specific example illustrates the installation of TF version 2.16.1 with Python version 3.10, CUDA version 12.1.0, and CUDNN version 9.0.0.312. Please refer to our documentation on running GPU jobs on the FASRC cluster.
The two recommended methods for setting up TF in your user environment is installing TF in a conda environment in your user space, or use a TF singularity container.
Installing TF in a Conda Environment
You can install your own TF instance following these simple steps:
- Load required software modules, e.g.,
module load python/3.10.13-fasrc01
- Create a new
conda
environment with Python:
mamba create -n tf2.16.1_cuda12.1 python=3.10 pip wheel
- Activate the new
conda
environment, e.g.,
source activate tf2.16.1_cuda12.1
- Install CUDA and cuDNN with conda/mamba and pip:
mamba install -c "nvidia/label/cuda-12.1.0" cuda-toolkit=12.1.0
pip install nvidia-cudnn-cu12==9.0.0.312
Configure the system paths. You can do it with the following command every time you start a new terminal after activating your conda
environment:
CUDNN_PATH=$(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)"))
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/:$CUDNN_PATH/lib
For your convenience it is recommended that you automate it with the following commands. The system paths will be automatically configured when you activate this conda environment:
mkdir -p $CONDA_PREFIX/etc/conda/activate.d
echo 'CUDNN_PATH=$(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)"))' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/:$CUDNN_PATH/lib' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
- Install extra packages required for data analytics, e.g.,
mamba install -c conda-forge numpy scipy pandas matplotlib seaborn h5py jupyterlab jupyterlab-spellchecker scikit-learn
- Install TF plus required GPU libraries with pip, e.g.,
pip install --upgrade tensorflow[and-cuda]==2.16.*
- Set up the
KERAS
backend (required forKERAS
version 3.0 and above)
export KERAS_BACKEND="tensorflow"
NOTE: Starting with version 2.16.1, TF includes KERAS version 3.0. Please, refer to the TensorFlow 2.16.1 release notes for important changes.
Pull a TF singularity container
Alternatively, one can pull and use a TensorFlow singularity container:
singularity pull --name tf2.16.1_gpu.simg docker://tensorflow/tensorflow:2.16.1-gpu
This will result in the image tf2.16.1_gpu.simg
. The image then can be used with, e.g.,
$ KERAS_BACKEND="tensorflow" singularity exec --nv tf2.16.1_gpu.simg python3
Python 3.11.0rc1 (main, Aug 12 2022, 10:02:14) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os; os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
>>> import tensorflow as tf
>>> print(tf.__version__)
2.16.1
>>> print(tf.reduce_sum(tf.random.normal([1000, 1000])))
tf.Tensor(1365.5554, shape=(), dtype=float32)
>>> print(tf.config.list_physical_devices('GPU'))
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU')]
Note: Please notice the use of the --nv
option. This is required to make use of the NVIDIA GPU card on the host system. Please also notice the use of KERAS_BACKEND="tensorflow"
environment variable, which is required to set the KERAS backend to TF.
Alternatively, you can pull a container from the NVIDA NGC Catalog, e.g.,
singularity pull docker://nvcr.io/nvidia/tensorflow:24.03-tf2-py3
This will result in the image tensorflow_24.03-tf2-py3.sif
, which has TF version 2.15.0
.
The NGC catalog provides access to optimized containers of many popular apps.
Similarly to the GPU installation you can either install TF in a conda
environment or use a TF singularity container.
Installing TF in a Conda Environment
# (1) Load required software modules
module load python/3.10.13-fasrc01
# (2) Create conda environment
mamba create -n tf2.16.1_cpu python=3.10 pip wheel
# (3) Activate the conda environment
source activate tf2.16.1_cpu
# (4) Install required packages for data analytics, e.g.,
mamba install -c conda-forge numpy scipy pandas matplotlib seaborn h5py jupyterlab jupyterlab-spellchecker scikit-learn
# (5) Install a CPU version TF with pip
pip install --upgrade tensorflow-cpu==2.16.*
# (6) Set up KERAS backend to use TF
export KERAS_BACKEND="tensorflow"
Pull a TF singularity container
singularity pull --name tf2.12_cpu.simg docker://tensorflow/tensorflow:2.12.0
This will result in the image tf2.12_cpu.simg
. The image then can be used with, e.g.,
KERAS_BACKEND="tensorflow" singularity exec tf2.16.1_cpu.simg python3 -c "import os; os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'; import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"
tf.Tensor(2878.413, shape=(), dtype=float32)
For an interactive session to work with the GPUs you can use following:
salloc -p gpu_test -t 0-06:00 --mem=8000 --gres=gpu:1
While on GPU node, you can run nvidia-smi
to get information about the assigned GPU's.
[username@holygpu7c26306 ~]$ nvidia-smi
Fri Apr 5 16:00:55 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12 Driver Version: 535.104.12 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-40GB On | 00000000:E3:00.0 Off | On |
| N/A 25C P0 46W / 400W | 259MiB / 40960MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| MIG devices: |
+------------------+--------------------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
| | | ECC| |
|==================+================================+===========+=======================|
| 0 2 0 0 | 37MiB / 19968MiB | 42 0 | 3 0 2 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
Load required modules, and source your TF environment:
[username@holygpu7c26306 ~]$ module load python/3.10.13-fasrc01 && source activate tf2.16.1_cuda12.1
(tf2.16.1_cuda12.1) [username@holygpu7c26306 ~]$
Test TF:
(Example adapted from here.)
(tf2.16.1_cuda12.1) [username@holygpu7c26306 ~]$ python tf_test.py
2.16.1
Epoch 1/10
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 5s 839us/step - accuracy: 0.7867 - loss: 0.6247
Epoch 2/10
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 2s 829us/step - accuracy: 0.8600 - loss: 0.3855
Epoch 3/10
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 3s 827us/step - accuracy: 0.8788 - loss: 0.3373
Epoch 4/10
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 2s 831us/step - accuracy: 0.8852 - loss: 0.3124
Epoch 5/10
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 2s 828us/step - accuracy: 0.8912 - loss: 0.2915
Epoch 6/10
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 3s 830us/step - accuracy: 0.8961 - loss: 0.2773
Epoch 7/10
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 2s 828us/step - accuracy: 0.9025 - loss: 0.2625
Epoch 8/10
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 2s 830us/step - accuracy: 0.9044 - loss: 0.2606
Epoch 9/10
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 2s 828us/step - accuracy: 0.9081 - loss: 0.2489
Epoch 10/10
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 2s 829us/step - accuracy: 0.9109 - loss: 0.2405
313/313 - 2s - 6ms/step - accuracy: 0.8804 - loss: 0.3411
Test accuracy: 0.8804000020027161
313/313 ━━━━━━━━━━━━━━━━━━━━ 1s 1ms/step
[1.0222636e-07 7.9844620e-09 4.7857565e-11 5.2755653e-09 2.7131367e-10
2.1757800e-04 5.9717085e-09 6.6847289e-03 4.5007189e-07 9.9309713e-01]
9
In the above example we used the following test code, tf_test.py
:
#!/usr/bin/env python
from __future__ import absolute_import, division, print_function, unicode_literals
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
import tensorflow as tf
from tensorflow import keras
import numpy as np
print(tf.__version__)
fashion_mnist = keras.datasets.fashion_mnist
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()
class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat', 'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']
train_images = train_images / 255.0
test_images = test_images / 255.0
model = keras.Sequential([
keras.layers.Flatten(input_shape=(28, 28)),
keras.layers.Dense(128, activation='relu'),
keras.layers.Dense(10, activation='softmax')
])
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.fit(train_images, train_labels, epochs=10)
test_loss, test_acc = model.evaluate(test_images, test_labels, verbose=2)
print('\nTest accuracy:', test_acc)
predictions = model.predict(test_images)
print(predictions[0])
print(np.argmax(predictions[0]))
You may pull a singularity TensorFlow version 2.12.0 image with the below command:
# Pull a singularity container with version 2.12.0
singularity pull --name tf2.12_gpu.simg docker://tensorflow/tensorflow:2.12.0-gpu
This image comes with a number of basic Python packages. If you need additional packages, you could use the example singularity definition file tf-2.12.def
to build the singularity image:
Bootstrap: docker
From: tensorflow/tensorflow:2.12.0-gpu
%post
pip install --upgrade pip
pip install matplotlib
pip install seaborn
pip install scipy
pip install scikit-learn
pip install jupyterlab
pip install notebook
You could install additional packages directly in the image with pip
by adding them in the %post
section of the definition file as illustrated above. Please, refer to our documentation on how to build singularity images from definition files.