Skip to content

OpenEmbedding is an open source framework for Tensorflow distributed training acceleration.

License

Notifications You must be signed in to change notification settings

4paradigm/OpenEmbedding

Repository files navigation

OpenEmbedding

build status docker pulls python version pypi package version downloads

English version | 中文版

About

OpenEmbedding is an open-source framework for TensorFlow distributed training acceleration.

Nowadays, many machine learning and deep learning applications are built based on parameter servers, which are used to efficiently store and update model weights. When a model has a large number of sparse features (e.g., Wide&Deep and DeepFM for CTR prediction), the number of weights easily runs into billions to trillions. In such a case, the tradition synchronization solutions (such as the Allreduce-based solution adopted by Horovod) are unable to achieve high-performance because of massive communication overhead introduced by a tremendous number of sparse features. In order to achieve efficiency for such sparse models, we develop OpenEmbedding, which enhances the parameter server especially for the sparse model training and inference.

Highlights

Efficiency

  • We propose an efficient customized sparse format to handle sparse features. Together with our fine-grained optimization, such as cache-conscious algorithms, asynchronous cache read and write, and lightweight locks to maximize parallelism. OpenEmbedding is able to achieve the performance speedup of 3-8x compared with the Allreduce-based solution on a single machine equipped with 8 GPUs for sparse model training.

Ease-of-use

  • We have integrated OpenEmbedding into Tensorflow. Only three lines of code changes are required to utilize OpenEmbedding in Tensorflow for both training and inference.

Adaptability

  • In addition to Tensorflow, it is straightforward to integrate OpenEmbedding into other popular frameworks. We have demonstrated the integration with DeepCTR and Horovod in the examples.

Benchmark

benchmark

For models that contain sparse features, it is difficult to speed up using the Allreduce-based framework Horovod. Using both OpenEmbedding and Horovod can get better acceleration effects. In the single 8 GPU scene, the speedup ratio is 3 to 8 times. Many models achieved 3 to 7 times the performance of Horovod.

Install & Quick Start

You can install and run OpenEmbedding by the following steps. The examples show the whole process of training criteo data with OpenEmbedding and predicting with Tensorflow Serving.

Docker

NVIDIA docker is required to use GPU in image. The OpenEmbedding image can be obtained from Docker Hub.

# The script "criteo_deepctr_stanalone.sh" will train and export the model to the path "tmp/criteo/1".
# It is okay to switch to:
#    "criteo_deepctr_horovod.sh" (multi-GPU training with Horovod),
#    "criteo_deepctr_mirrored.sh" (multi-GPU training with MirroredStrategy),
#    "criteo_deepctr_mpi.sh" (multi-GPU training with MultiWorkerMirroredStrategy and MPI).
docker run --rm --gpus all -v /tmp/criteo:/openembedding/tmp/criteo \
    4pdosc/openembedding:latest examples/run/criteo_deepctr_standalone.sh 

# Start TensorFlow Serving to load the trained model.
docker run --name serving-example -td -p 8500:8500 -p 8501:8501 \
        -v /tmp/criteo:/models/criteo -e MODEL_NAME=criteo tensorflow/serving:latest
# Wait the model server start.
sleep 5

# Send requests and get predict results.
docker run --rm --network host 4pdosc/openembedding:latest examples/run/criteo_deepctr_restful.sh

# Clear docker.
docker stop serving-example
docker rm serving-example

Ubuntu

# Install the dependencies required by OpenEmbedding.
apt update && apt install -y gcc-7 g++-7 python3 libpython3-dev python3-pip
pip3 install --upgrade pip
pip3 install tensorflow==2.5.1
pip3 install openembedding

# Install the dependencies required by examples.
apt install -y git cmake mpich curl 
HOROVOD_WITHOUT_MPI=1 pip3 install horovod
pip3 install deepctr pandas scikit-learn mpi4py

# Download the examples.
git clone https://github.com/4paradigm/OpenEmbedding.git
cd OpenEmbedding

# The script "criteo_deepctr_stanalone.sh" will train and export the model to the path "tmp/criteo/1".
# It is okay to switch to:
#    "criteo_deepctr_horovod.sh" (multi-GPU training with Horovod),
#    "criteo_deepctr_mirrored.sh" (multi-GPU training with MirroredStrategy),
#    "criteo_deepctr_mpi.sh" (multi-GPU training with MultiWorkerMirroredStrategy and MPI).
examples/run/criteo_deepctr_standalone.sh 

# Start TensorFlow Serving to load the trained model.
docker run --name serving-example -td -p 8500:8500 -p 8501:8501 \
        -v `pwd`/tmp/criteo:/models/criteo -e MODEL_NAME=criteo tensorflow/serving:latest
# Wait the model server start.
sleep 5

# Send requests and get predict results.
examples/run/criteo_deepctr_restful.sh

# Clear docker.
docker stop serving-example
docker rm serving-example

CentOS

# Install the dependencies required by OpenEmbedding.
yum install -y centos-release-scl
yum install -y python3 python3-devel devtoolset-7
scl enable devtoolset-7 bash
pip3 install --upgrade pip
pip3 install tensorflow==2.5.1
pip3 install openembedding

# Install the dependencies required by examples.
yum install -y git cmake mpich curl 
HOROVOD_WITHOUT_MPI=1 pip3 install horovod
pip3 install deepctr pandas scikit-learn mpi4py

# Download the examples.
git clone https://github.com/4paradigm/OpenEmbedding.git
cd OpenEmbedding

# The script "criteo_deepctr_stanalone.sh" will train and export the model to the path "tmp/criteo/1".
# It is okay to switch to:
#    "criteo_deepctr_horovod.sh" (multi-GPU training with Horovod),
#    "criteo_deepctr_mirrored.sh" (multi-GPU training with MirroredStrategy),
#    "criteo_deepctr_mpi.sh" (multi-GPU training with MultiWorkerMirroredStrategy and MPI).
examples/run/criteo_deepctr_standalone.sh 

# Start TensorFlow Serving to load the trained model.
docker run --name serving-example -td -p 8500:8500 -p 8501:8501 \
        -v `pwd`/tmp/criteo:/models/criteo -e MODEL_NAME=criteo tensorflow/serving:latest
# Wait the model server start.
sleep 5

# Send requests and get predict results.
examples/run/criteo_deepctr_restful.sh

# Clear docker.
docker stop serving-example
docker rm serving-example

Note

The installation usually requires g++ 7 or higher, or a compiler compatible with tf.version.COMPILER_VERSION. The compiler can be specified by environment variable CC and CXX. Currently OpenEmbedding can only be installed on linux.

CC=gcc CXX=g++ pip3 install openembedding

If TensorFlow was updated, you need to reinstall OpenEmbedding.

pip3 uninstall openembedding && pip3 install --no-cache-dir openembedding

User Guide

A sample program for common usage is as follows.

Create Model and Optimizer.

import tensorflow as tf
import deepctr.models import WDL
optimizer = tf.keras.optimizers.Adam()
model = WDL(feature_columns, feature_columns, task='binary')

Transform to distributed Model and distributed Optimizer. The Embedding layer will be stored on the parameter server.

import horovod as hvd
import openembedding.tensorflow as embed
hvd.init()

optimizer = embed.distributed_optimizer(optimizer)
optimizer = hvd.DistributedOptimizer(optimizer)

model = embed.distributed_model(model)

Here, embed.distributed_optimizer is used to convert the TensorFlow optimizer into an optimizer that supports the parameter server, so that the parameters on the parameter server can be updated. The function embed.distributed_model is to replace the Embedding layers in the model and override the methods to support saving and loading with parameter servers. Method Embedding.call will pull the parameters from the parameter server and the backpropagation function was registered to push the gradients to the parameter server.

Data parallelism by Horovod.

model.compile(optimizer, "binary_crossentropy", metrics=['AUC'],
              experimental_run_tf_function=False)
callbacks = [ hvd.callbacks.BroadcastGlobalVariablesCallback(0),
              hvd.callbacks.MetricAverageCallback() ]
model.fit(dataset, epochs=10, verbose=2, callbacks=callbacks)

Export as a stand-alone SavedModel so that can be loaded by TensorFlow Serving.

if hvd.rank() == 0:
    # Must specify include_optimizer=False explicitly
    model.save_as_original_model('model_path', include_optimizer=False)

More examples as follows.

Build

Docker Build

docker build -t 4pdosc/openembedding-base:0.1.0 -f docker/Dockerfile.base .
docker build -t 4pdosc/openembedding:0.0.0-build -f docker/Dockerfile.build .

Native Build

The compiler needs to be compatible with tf.version.COMPILER_VERSION (>= 7), and install all prpc dependencies to tools or /usr/local, and then run build.sh to complete the compilation. The build.sh will automatically install prpc (pico-core) and parameter-server (pico-ps) to the tools directory.

git submodule update --init --checkout --recursive
pip3 install tensorflow
./build.sh clean && ./build.sh build
pip3 install ./build/openembedding-*.tar.gz

Features

TensorFlow 2

  • dtype: float32, float64.
  • tensorflow.keras.initializers
    • RandomNormal, RandomUniform, Constant, Zeros, Ones.
    • The parameter seed is currently ignored.
  • tensorflow.keras.optimizers
    • Adadelta, Adagrad, Adam, Adamax, Ftrl, RMSprop, SGD.
    • decay and LearningRateSchedule are not supported.
    • Adam(amsgrad=True) is not supported.
    • RMSProp(centered=True) is not supported.
    • The parameter server uses a sparse update method, which may cause different training results for the Optimizer with momentum.
  • tensorflow.keras.layers.Embedding
    • Support array for known input_dim and hash table for unknown input_dim (2**63 range).
    • Can still be stored on workers and use dense update method.
    • Should not use embeddings_regularizer, embeddings_constraint.
  • tensorflow.keras.Model
    • Can be converted to distributed Model and automatically ignore or convert incompatible settings (such as embeddings_constraint).
    • Distributed save, save_weights, load_weights and ModelCheckpoint.
    • Saving the distributed Model as a stand-alone SavedModel, which can be load by TensorFlow Serving.
    • Do not support training multiple distributed Models in one task.
  • Can collaborate with Horovod. Training with MirroredStrategy or MultiWorkerMirroredStrategy is experimental.

TODO

  • Improve performance
  • Support PyTorch training
  • Support tf.feature_column.embedding_column
  • Approximate embedding_regularizer, LearningRateSchedule and etc.
  • Improve the support for Initializer and Optimizer
  • Training multiple distributed Models in one task
  • Support ONNX

Designs

Authors

Persistent Memory (PMem)

Currently, the interface for persistent memory is experimental. PMem-based OpenEmbedding provides a lightweight checkpointing scheme as well as the comparable performance with its DRAM version. For long-running deep learning recommendation model training, PMem-based OpenEmbedding provides not only an efficient but also a reliable training process.

Publications