今天在使用 kill 杀掉训练模型的 python 脚本时,忘了加上 -9
从而导致还有进程占用了 Nvidia GPU 显存,如下所示。
# kill 之前 nvidia-smi 的输出
+-----------------------------------------------------------------------------+
Tue Feb xx
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 49x.xx.05 Driver Version: 49x.xx.05 CUDA Version: 11.x |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100S-PCI... Off | 00000000:3D:00.0 Off | 0 |
| N/A 27C P0 24W / 250W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100S-PCI... Off | 00000000:41:00.0 Off | 0 |
| N/A 28C P0 24W / 250W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100S-PCI... Off | 00000000:45:00.0 Off | 0 |
| N/A 26C P0 24W / 250W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100S-PCI... Off | 00000000:47:00.0 Off | 0 |
| N/A 59C P0 66W / 250W | 14227MiB / 32510MiB | 99% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 3 N/A N/A 26538 C python 14223MiB |
+-----------------------------------------------------------------------------+
在 kill 26538
杀掉进程之后,仍然有进程占用显存。
+-----------------------------------------------------------------------------+
Tue Feb xx
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 49x.xx.05 Driver Version: 49x.xx.05 CUDA Version: 11.x |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100S-PCI... Off | 00000000:3D:00.0 Off | 0 |
| N/A 27C P0 24W / 250W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100S-PCI... Off | 00000000:41:00.0 Off | 0 |
| N/A 28C P0 24W / 250W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100S-PCI... Off | 00000000:45:00.0 Off | 0 |
| N/A 26C P0 24W / 250W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100S-PCI... Off | 00000000:47:00.0 Off | 0 |
| N/A 59C P0 66W / 250W | 14227MiB / 32510MiB | 99% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| |
+-----------------------------------------------------------------------------+
解决方法:使用 fuser 命令
使用 fuser
命令查看占用 Nvidia GPU 的进程,然后逐一 kill 掉即可,如下所示。
$ fuser -v /dev/nvidia*
USER PID ACCESS COMMAND
/dev/nvidia3: gzdx 13077 F...m python
gzdx 13369 F...m python
/dev/nvidiactl: gzdx 13077 F...m python
gzdx 13369 F...m python
/dev/nvidia-uvm: gzdx 13077 F...m python
gzdx 13369 F...m python
$ ps -ef | grep 13077
gzdx 13077 1 0 16:48 ? 00:00:03 python train.py
gzdx 15828 81724 0 17:01 pts/1 00:00:00 grep --color=auto 13077
$ ps -ef | grep 13369
gzdx 13369 1 0 16:48 ? 00:00:02 python train.py
gzdx 15858 81724 0 17:01 pts/1 00:00:00 grep --color=auto 13369
$ kill -9 13077
$ kill -9 13369
参考
文档信息
- 本文作者:Bookstall
- 本文链接:https://bookstall.github.io/fragment/2023-02-28-find-process-run-in-nvidia-gpu/
- 版权声明:自由转载-非商用-非衍生-保持署名(创意共享3.0许可证)