Skip to content

Latest commit

 

History

History
155 lines (109 loc) · 6.92 KB

VeloxQAT.md

File metadata and controls

155 lines (109 loc) · 6.92 KB
layout title nav_order parent
page
QAT Support in Velox Backend
1
Getting-Started

Intel® QuickAssist Technology (QAT) support

Gluten supports using Intel® QuickAssist Technology (QAT) for data compression during Spark Shuffle. It benefits from QAT Hardware-based acceleration on compression/decompression, and uses Gzip as compression format for higher compression ratio to reduce the pressure on disks and network transmission.

This feature is based on QAT driver library and QATzip library. Please manually download QAT driver for your system, and follow its README to build and install on all Driver and Worker node: Intel® QuickAssist Technology Driver for Linux* – HW Version 2.0.

Software Requirements

  • Download QAT driver for your system, and follow its README to build and install on all Driver and Worker nodes: Intel® QuickAssist Technology Driver for Linux* – HW Version 2.0.
  • Below compression libraries need to be installed on all Driver and Worker nodes:
    • Zlib* library of version 1.2.7 or higher
    • ZSTD* library of version 1.5.4 or higher
    • LZ4* library

Build Gluten with QAT

  1. Setup ICP_ROOT environment variable to the directory where QAT driver is extracted. This environment variable is required during building Gluten and running Spark applications. It's recommended to put it in .bashrc on Driver and Worker nodes.
echo "export ICP_ROOT=/path/to/QAT_driver" >> ~/.bashrc
source ~/.bashrc

# Also set for root if running as non-root user
sudo su - 
echo "export ICP_ROOT=/path/to/QAT_driver" >> ~/.bashrc
exit
  1. This step is required if your application is running as Non-root user. The users must be added to the 'qat' group after QAT drvier is installed. And change the amount of max locked memory for the username that is included in the group name. This can be done by specifying the limit in /etc/security/limits.conf.
sudo su -
usermod -aG qat username # need relogin to take effect

# To set 500MB add a line like this in /etc/security/limits.conf
echo "@qat - memlock 500000" >> /etc/security/limits.conf

exit
  1. Enable huge page. This step is required to execute each time after system reboot. We recommend using systemctl to manage at system startup. You change the values for "max_huge_pages" and "max_huge_pages_per_process" to make sure there are enough resources for your workload. As for Spark applications, one process matches one executor. Within the executor, every task is allocated a maximum of 5 huge pages.
sudo su -

cat << EOF > /usr/local/bin/qat_startup.sh
#!/bin/bash
echo 1024 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
rmmod usdm_drv
insmod $ICP_ROOT/build/usdm_drv.ko max_huge_pages=1024 max_huge_pages_per_process=32
EOF

chmod +x /usr/local/bin/qat_startup.sh

cat << EOF > /etc/systemd/system/qat_startup.service
[Unit]
Description=Configure QAT

[Service]
ExecStart=/usr/local/bin/qat_startup.sh

[Install]
WantedBy=multi-user.target
EOF

systemctl enable qat_startup.service
systemctl start qat_startup.service # setup immediately
systemctl status qat_startup.service

exit
  1. After the setup, you are now ready to build Gluten with QAT. Use the command below to enable this feature:
cd /path/to/gluten

## The script builds four jars for spark 3.2.2, 3.3.1, 3.4.3 and 3.5.1.
./dev/buildbundle-veloxbe.sh --enable_qat=ON

Enable QAT with Gzip/Zstd for shuffle compression

  1. To offload shuffle compression into QAT, first make sure you have the right QAT configuration file at /etc/4xxx_devX.conf. We provide a example configuration file. This configuration sets up to 4 processes that can bind to 1 QAT, and each process can use up to 16 QAT DC instances.
## run as root
## Overwrite QAT configuration file.
cd /etc
for i in {0..7}; do echo "4xxx_dev$i.conf"; done | xargs -i cp -f /path/to/gluten/docs/qat/4x16.conf {}
## Restart QAT after updating configuration files.
adf_ctl restart
  1. Check QAT status and make sure the status is up
adf_ctl status

The output should be like:

Checking status of all devices.
There is 8 QAT acceleration device(s) in the system:
 qat_dev0 - type: 4xxx,  inst_id: 0,  node_id: 0,  bsf: 0000:6b:00.0,  #accel: 1 #engines: 9 state: up
 qat_dev1 - type: 4xxx,  inst_id: 1,  node_id: 1,  bsf: 0000:70:00.0,  #accel: 1 #engines: 9 state: up
 qat_dev2 - type: 4xxx,  inst_id: 2,  node_id: 2,  bsf: 0000:75:00.0,  #accel: 1 #engines: 9 state: up
 qat_dev3 - type: 4xxx,  inst_id: 3,  node_id: 3,  bsf: 0000:7a:00.0,  #accel: 1 #engines: 9 state: up
 qat_dev4 - type: 4xxx,  inst_id: 4,  node_id: 4,  bsf: 0000:e8:00.0,  #accel: 1 #engines: 9 state: up
 qat_dev5 - type: 4xxx,  inst_id: 5,  node_id: 5,  bsf: 0000:ed:00.0,  #accel: 1 #engines: 9 state: up
 qat_dev6 - type: 4xxx,  inst_id: 6,  node_id: 6,  bsf: 0000:f2:00.0,  #accel: 1 #engines: 9 state: up
 qat_dev7 - type: 4xxx,  inst_id: 7,  node_id: 7,  bsf: 0000:f7:00.0,  #accel: 1 #engines: 9 state: up
  1. Extra Gluten configurations are required when starting Spark application
--conf spark.gluten.sql.columnar.shuffle.codec=gzip # Valid options are gzip and zstd
--conf spark.gluten.sql.columnar.shuffle.codecBackend=qat
  1. You can use below command to check whether QAT is working normally at run-time. The value of fw_counters should continue to increase during shuffle.
while :; do cat /sys/kernel/debug/qat_4xxx_0000:6b:00.0/fw_counters; sleep 1; done

QAT driver references

Documentation

README Text Files (README_QAT20.L.1.0.0-00021.txt)

Release Notes

Check out the Intel® QuickAssist Technology Software for Linux* - Release Notes for the latest changes in this release.

Getting Started Guide

Check out the Intel® QuickAssist Technology Software for Linux* - Getting Started Guide for detailed installation instructions.

Programmer's Guide

Check out the Intel® QuickAssist Technology Software for Linux* - Programmer's Guide for software usage guidelines.

For more Intel® QuickAssist Technology resources go to Intel® QuickAssist Technology (Intel® QAT)