High Utilization in NVIDIA GPU

Problem:

In recent GPUs, you may notice that somehow the GPU is getting high utilization while there is no process running. Based on explaination in http://docs.nvidia.com/deploy/driver-persistence/, this is happened because the kernel module is loaded but the GPU is not initialized yet. By default, GPU will be initialized when there is a GPU process start working on it, and then deinitialized when the process is completed.

sysadmin@sap-dl:~$ nvidia-smi
Tue Sep 27 19:01:52 2016
+------------------------------------------------------+
| NVIDIA-SMI 352.99 Driver Version: 352.99 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M40 Off | 0000:02:00.0 Off | 0 |
| 0% 33C P0 67W / 250W | 55MiB / 11519MiB | 67% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

 

Solution:

We can keep the GPU to be initialized all the time. This is called “persistence mode”. To enable it:
sysadmin@sap-dl:~$ sudo nvidia-smi -i 0 -pm 1
[sudo] password for sysadmin:
Enabled persistence mode for GPU 0000:02:00.0.
All done.

sysadmin@sap-dl:~$ nvidia-smi
Tue Sep 27 19:02:52 2016
+------------------------------------------------------+
| NVIDIA-SMI 352.99 Driver Version: 352.99 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M40 On | 0000:02:00.0 Off | 0 |
| 0% 36C P0 66W / 250W | 55MiB / 11519MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

However, this setting will be reset when the server is rebooted. Hence, we will need to set it in rc.local to make it starts during startup.

Limiting CPU Usage of A Process in CentOS/RHEL 7

In HPC, we may need to protect head node from unnecessary heavy process that may cause login problem for users. One of the solutions is by using cpulimit. We can create a cronjob to monitor all processes and set certain limit for them. This is how I usually did in CentOS/RHEL 7.x.

  1. Install cpulimit package from EPEL repo.

yum install cpulimit

  1. Create a script to monitor the process. The script below is a modified version of the script in this forum. You can modify inputs of the first 3 variables: CPU_LIMIT, BLACK_PROCESSES_LIST, and WHITE_PROCESSES_LIST.

Continue reading