ai-center
2021.10
false
AI Center Installation Guide
Automation CloudAutomation SuiteStandalone
Last updated Jun 6, 2024

Provisioning a GPU

Note: A GPU can be installed only on an agent node, not a server node. Do not use or modify the gpu_support flag from the cluster_config.json. Instead, follow the instructions below to add a dedicated agent node with GPU support to the cluster.

Currently, Automation Suite only supports Nvidia GPU Drivers. See the list of GPU-supported operating systems.

You can find cloud-specific instance types for the nodes here:

Follow the steps from Adding a new node to the cluster to ensure the agent node is added correctly.

For more examples on how to deploy NVIDIA CUDA on a GPU, check this page.

Installing a GPU driver

  1. Run the following command to install the GPU driver on the agent node:
    sudo yum install kernel kernel-tools kernel-headers kernel-devel 
    sudo reboot
    sudo yum install https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm
    sudo sed 's/$releasever/8/g' -i /etc/yum.repos.d/epel.repo
    sudo sed 's/$releasever/8/g' -i /etc/yum.repos.d/epel-modular.repo
    sudo yum config-manager --add-repo http://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
    sudo yum install cudasudo yum install kernel kernel-tools kernel-headers kernel-devel 
    sudo reboot
    sudo yum install https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm
    sudo sed 's/$releasever/8/g' -i /etc/yum.repos.d/epel.repo
    sudo sed 's/$releasever/8/g' -i /etc/yum.repos.d/epel-modular.repo
    sudo yum config-manager --add-repo http://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
    sudo yum install cuda
  2. Run the following command to install the container toolkits:
    distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \\
              && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
    sudo dnf clean expire-cache
    sudo yum install -y nvidia-container-runtime.x86_64distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \\
              && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
    sudo dnf clean expire-cache
    sudo yum install -y nvidia-container-runtime.x86_64

Verify if drivers are installed properly

Run sudo nvidia-smi command on the node to verify if the drivers were installed properly.


Note: After the cluster has been provisioned, additional steps are required to configure the provisioned GPUs.

At this point, the GPU drivers have been installed and that the GPU nodes have been added to the cluster.

Adding the GPU to the agent node

Run below two command to update contianerd configuration of agent node.
cat <<EOF > gpu_containerd.sh
if ! nvidia-smi &>/dev/null;
then
  echo "GPU Drivers are not installed on the VM. Please refer the documentation."
  exit 0
fi
if ! which nvidia-container-runtime &>/dev/null;
then
  echo "Nvidia container runtime is not installed on the VM. Please refer the documentation."
  exit 0 
fi
grep "nvidia-container-runtime" /var/lib/rancher/rke2/agent/etc/containerd/config.toml &>/dev/null && info "GPU containerd changes already applied" && exit 0
awk '1;/plugins.cri.containerd]/{print "  default_runtime_name = \\"nvidia-container-runtime\\""}' /var/lib/rancher/rke2/agent/etc/containerd/config.toml > /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl
echo -e '\
[plugins.linux]\
  runtime = "nvidia-container-runtime"' >> /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl
echo -e '\
[plugins.cri.containerd.runtimes.nvidia-container-runtime]\
  runtime_type = "io.containerd.runc.v2"\
  [plugins.cri.containerd.runtimes.nvidia-container-runtime.options]\
    BinaryName = "nvidia-container-runtime"' >> /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl
EOFsudo bash gpu_containerd.shcat <<EOF > gpu_containerd.sh
if ! nvidia-smi &>/dev/null;
then
  echo "GPU Drivers are not installed on the VM. Please refer the documentation."
  exit 0
fi
if ! which nvidia-container-runtime &>/dev/null;
then
  echo "Nvidia container runtime is not installed on the VM. Please refer the documentation."
  exit 0 
fi
grep "nvidia-container-runtime" /var/lib/rancher/rke2/agent/etc/containerd/config.toml &>/dev/null && info "GPU containerd changes already applied" && exit 0
awk '1;/plugins.cri.containerd]/{print "  default_runtime_name = \\"nvidia-container-runtime\\""}' /var/lib/rancher/rke2/agent/etc/containerd/config.toml > /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl
echo -e '\
[plugins.linux]\
  runtime = "nvidia-container-runtime"' >> /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl
echo -e '\
[plugins.cri.containerd.runtimes.nvidia-container-runtime]\
  runtime_type = "io.containerd.runc.v2"\
  [plugins.cri.containerd.runtimes.nvidia-container-runtime.options]\
    BinaryName = "nvidia-container-runtime"' >> /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl
EOFsudo bash gpu_containerd.sh
Now run below command to restart rke2-agent
[[ "$(sudo systemctl is-enabled rke2-server 2>/dev/null)" == "enabled" ]] && systemctl restart rke2-server
[[ "$(sudo systemctl is-enabled rke2-agent 2>/dev/null)" == "enabled" ]] && systemctl restart rke2-agent[[ "$(sudo systemctl is-enabled rke2-server 2>/dev/null)" == "enabled" ]] && systemctl restart rke2-server
[[ "$(sudo systemctl is-enabled rke2-agent 2>/dev/null)" == "enabled" ]] && systemctl restart rke2-agent

Enabling the GPU driver post-installation

Run the below commands from any of the primary server nodes.

Navigate to UiPathAutomationSuite folder.
cd /opt/UiPathAutomationSuitecd /opt/UiPathAutomationSuite

Enable in online install

DOCKER_REGISTRY_URL=$(cat defaults.json | jq -er ".registries.docker.url")
sed -i "s/REGISTRY_PLACEHOLDER/${DOCKER_REGISTRY_URL}/g" ./Infra_Installer/gpu_plugin/nvidia-device-plugin.yaml
kubectl apply -f ./Infra_Installer/gpu_plugin/nvidia-device-plugin.yaml
kubectl -n kube-system rollout restart daemonset nvidia-device-plugin-daemonsetDOCKER_REGISTRY_URL=$(cat defaults.json | jq -er ".registries.docker.url")
sed -i "s/REGISTRY_PLACEHOLDER/${DOCKER_REGISTRY_URL}/g" ./Infra_Installer/gpu_plugin/nvidia-device-plugin.yaml
kubectl apply -f ./Infra_Installer/gpu_plugin/nvidia-device-plugin.yaml
kubectl -n kube-system rollout restart daemonset nvidia-device-plugin-daemonset

Enable in offline install

DOCKER_REGISTRY_URL=localhost:30071
sed -i "s/REGISTRY_PLACEHOLDER/${DOCKER_REGISTRY_URL}/g" ./Infra_Installer/gpu_plugin/nvidia-device-plugin.yaml
kubectl apply -f ./Infra_Installer/gpu_plugin/nvidia-device-plugin.yaml
kubectl -n kube-system rollout restart daemonset nvidia-device-plugin-daemonsetDOCKER_REGISTRY_URL=localhost:30071
sed -i "s/REGISTRY_PLACEHOLDER/${DOCKER_REGISTRY_URL}/g" ./Infra_Installer/gpu_plugin/nvidia-device-plugin.yaml
kubectl apply -f ./Infra_Installer/gpu_plugin/nvidia-device-plugin.yaml
kubectl -n kube-system rollout restart daemonset nvidia-device-plugin-daemonset

GPU taints

GPU workloads get scheduled on GPU nodes automatically when a workload requests for it. But normal CPU workloads also might get scheduled on these nodes, reserving the capacity. If you want only GPU workloads to be scheduled on these nodes you can add taints to these nodes using following commands from the first node.

  • nvidia.com/gpu=present:NoSchedule - non-GPU workloads do not get scheduled on this node unless explicitly specified
  • nvidia.com/gpu=present:PreferNoSchedule - this makes it a preferred condition rather than a hard one like the first option
Replace <node-name> with the corresponding GPU node name in your cluster and <taint-name> with one of the above 2 options in following command
kubectl taint node <node-name> <taint-name>kubectl taint node <node-name> <taint-name>

Validating GPU node provisioning

To ensure you have added the GPU nodes successfully, run the following command in the terminal. The output should show nvidia.com/gpu as an output along with the CPU and RAM resources.
kubectl describe node <node-name>kubectl describe node <node-name>

Was this page helpful?

Get The Help You Need
Learning RPA - Automation Courses
UiPath Community Forum
Uipath Logo White
Trust and Security
© 2005-2024 UiPath. All rights reserved.