Notifying users of processes killed by the in-kernel OOM killer
Posted on Wed 31 August 2022 in Sysadmin
Updates:
- 2023-05-01: Moved code to GitLab. Corrections to reflect how new version works.
- 2022-09-09: Updated script to add Bcc option.
We've got servers with up to 512GB of RAM. And we've got users who periodically use it all up. Then their process gets killed by the kernel and the user doesn't know this. They come back a few days later to find their job died. It would be nice if we could notify them right when this happens.
The problem is knowing which user was running the program that got killed. This information is not available when the kernel logs the OOM kill.
I solved this in the following way:
- Use auditd to keep a record of all process executions.
-
Use rsyslog to run a script when an OOM message is logged. For example:
Aug 30 14:35:36 jammy kernel: [95970.737826] Out of memory: Killed process 105857 (tail) total-vm:695
-
The script then queries the auditd logs for the PID, UID, and command name. The script looks up the username from the UID. How you get an email address from the username there depends on your system.(We've got an LDAP/NSSwitch setup) Look up the email address then send an email. The script example below can use
ldapsearch
to get the email address or just assume that mail sent touser
will go to the right place. -
You can create your own script. It just needs to:
- Grab the PID from the syslog line passed as the first argument
-
Run:
ausearch -p $PID
and grabs the UID or AUID from this line:type=SYSCALL msg=audit(1661888129.477:153157): arch=c000003e syscall=59 success=yes exit=0 a0=556ce2d2cb10 a1=556ce2d2d220 a2=556ce2d507d0 a3=8 items=2 ppid=105855 pid=105857 auid=2185 uid=2185 gid=100 euid=2185 suid=2185 fsuid=2185 egid=100 sgid=100 fsgid=100 tty=pts1 ses=2177 comm="tail" exe="/usr/bin/tail" subj=? key=(null)
-
Look up the username from the UID
id -un $UID
-
Look up the email address(site dependent)
- Send the message
The GitLab repository has the latest code and more information: