I was running some CUDA jobs that use a lot of both VRAM and RAM. At some point, I hit OOM and the oom killer killed one of these processes.
ps now shows that process as defunct, the ppid as 1, and in the state "Zl".
In the past, everyone has told me "zombies don't use any resources, you don't have to worry about them".
Except for one small issue. The VRAM allocated by this process is still in use! The pid doesn't show up in nvidia-smi, but nvidia-smi does show the VRAM as in use. This is a big problem. I'm using this system for a lot of other tasks, so reboot would result in major downtime. The fact that a zombie is using system resources, blocking my work currently, seems to contradict everything I was taught about zombie processes.
How can I manually free the VRAM so I can get back up and running without rebooting? I don't mind compiling and running some c code if it would fix this issue.
Statistics: Posted by kerryhall β 2025-04-09 01:33