杀死占用特定 Nvidia GPU 的进程

今天在使用 kill 杀掉训练模型的 python 脚本时,忘了加上 -9 从而导致还有进程占用了 Nvidia GPU 显存,如下所示。

# kill 之前 nvidia-smi 的输出
+-----------------------------------------------------------------------------+
Tue Feb xx       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 49x.xx.05    Driver Version: 49x.xx.05    CUDA Version: 11.x     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100S-PCI...  Off  | 00000000:3D:00.0 Off |                    0 |
| N/A   27C    P0    24W / 250W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100S-PCI...  Off  | 00000000:41:00.0 Off |                    0 |
| N/A   28C    P0    24W / 250W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100S-PCI...  Off  | 00000000:45:00.0 Off |                    0 |
| N/A   26C    P0    24W / 250W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100S-PCI...  Off  | 00000000:47:00.0 Off |                    0 |
| N/A   59C    P0    66W / 250W |  14227MiB / 32510MiB |     99%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    3   N/A  N/A     26538      C   python                          14223MiB |
+-----------------------------------------------------------------------------+

kill 26538 杀掉进程之后,仍然有进程占用显存。

+-----------------------------------------------------------------------------+
Tue Feb xx       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 49x.xx.05    Driver Version: 49x.xx.05    CUDA Version: 11.x     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100S-PCI...  Off  | 00000000:3D:00.0 Off |                    0 |
| N/A   27C    P0    24W / 250W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100S-PCI...  Off  | 00000000:41:00.0 Off |                    0 |
| N/A   28C    P0    24W / 250W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100S-PCI...  Off  | 00000000:45:00.0 Off |                    0 |
| N/A   26C    P0    24W / 250W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100S-PCI...  Off  | 00000000:47:00.0 Off |                    0 |
| N/A   59C    P0    66W / 250W |  14227MiB / 32510MiB |     99%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|                                                                             |
+-----------------------------------------------------------------------------+

解决方法:使用 fuser 命令

使用 fuser 命令查看占用 Nvidia GPU 的进程,然后逐一 kill 掉即可,如下所示。

$ fuser -v /dev/nvidia*
                     USER        PID ACCESS COMMAND
/dev/nvidia3:        gzdx      13077 F...m python
                     gzdx      13369 F...m python
/dev/nvidiactl:      gzdx      13077 F...m python
                     gzdx      13369 F...m python
/dev/nvidia-uvm:     gzdx      13077 F...m python
                     gzdx      13369 F...m python
$ ps -ef | grep 13077
gzdx     13077     1  0 16:48 ?        00:00:03 python train.py
gzdx     15828 81724  0 17:01 pts/1    00:00:00 grep --color=auto 13077

$ ps -ef | grep 13369
gzdx     13369     1  0 16:48 ?        00:00:02 python train.py
gzdx     15858 81724  0 17:01 pts/1    00:00:00 grep --color=auto 13369

$ kill -9 13077
$ kill -9 13369

参考

文档信息

Search

    Table of Contents