Wow, after many years with Linux I just stubled upon some ultra useful functionality and now I feel kind of stupid because I didn’t know about it all that time, lol. :)
I have a multi-threaded application – namely Asterisk, a software PBX – that was always at 250% CPU for many months without a visible reason. Googling for Asterisk + high CPU brought me to this site: https://moythreads.com/wordpress/2009/05/06/why-does-asterisk-consume-100-cpu
Basically everything is already explained there but I’ll give another example. Here sample output from top:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
16780 asterisk -11 0 4681m 115m 10m S 207.5 3.0 649:34.98 asterisk
So, Asterisk going berserk at 207% CPU. Let’s install pstack:
yum install gdb
Let’s look at all the threads of Asterisk, PID 16780, using the magic parameters -LlFm to ps which will show all threads of that process:
ps -LlFm 16780
The output will be something like the following. Look for the C column which means CPU usage:
F S UID PID PPID LWP C NLWP PRI NI ADDR SZ WCHAN RSS PSR STIME TTY TIME CMD
[...]
1 S asterisk - - 32498 0 - 49 - - - futex_ - 7 04:15 - 00:00:00 -
1 S asterisk - - 32499 0 - 49 - - - poll_s - 1 04:15 - 00:00:00 -
1 S asterisk - - 418 0 - 49 - - - inotif - 4 04:33 - 00:00:00 -
1 R asterisk - - 3967 99 - 49 - - - - - 0 05:59 - 12:14:44 -
1 R asterisk - - 4367 99 - 49 - - - - - 1 06:05 - 12:08:27 -
1 S asterisk - - 22668 0 - 49 - - - poll_s - 4 16:40 - 00:00:19 -
1 S asterisk - - 23627 0 - 49 - - - poll_s - 7 17:20 - 00:00:12 -
1 S asterisk - - 23641 0 - 49 - - - poll_s - 2 17:20 - 00:00:11 -
[...]
Notice those two entries with 99 CPU: LWP 3967 and 4367! Let’s look at these little f*ckers in more detail:
pstack 16780 > /tmp/asterisk.stack.txt
Let’s look into /tmp/asterisk.stack.txt and search for our two LWPs 3967 and 4367…
Thread 35 (Thread 0x7ff211a9b700 (LWP 3967)):
#0 0x00007ff248cb96ec in recv () from /lib64/libc.so.6
#1 0x00007ff23c07b8b1 in ooSocketRecv () from /usr/lib64/asterisk/modules/chan_ooh323.so
#2 0x00007ff23c06461f in ooH2250Receive () from /usr/lib64/asterisk/modules/chan_ooh323.so
#3 0x00007ff23c064fba in ooProcessCallFDSETsAndTimers () from /usr/lib64/asterisk/modules/chan_ooh323.so
#4 0x00007ff23c06518e in ooMonitorCallChannels () from /usr/lib64/asterisk/modules/chan_ooh323.so
#5 0x00007ff23c14ed95 in ooh323c_call_thread () from /usr/lib64/asterisk/modules/chan_ooh323.so
#6 0x000000000057a1a8 in dummy_start ()
#7 0x00007ff2476eaaa1 in start_thread () from /lib64/libpthread.so.0
#8 0x00007ff248cb893d in clone () from /lib64/libc.so.6
Thread 34 (Thread 0x7ff21195e700 (LWP 4367)):
#0 0x00007ff248cb96ec in recv () from /lib64/libc.so.6
#1 0x00007ff23c07b8b1 in ooSocketRecv () from /usr/lib64/asterisk/modules/chan_ooh323.so
#2 0x00007ff23c06461f in ooH2250Receive () from /usr/lib64/asterisk/modules/chan_ooh323.so
#3 0x00007ff23c064fba in ooProcessCallFDSETsAndTimers () from /usr/lib64/asterisk/modules/chan_ooh323.so
#4 0x00007ff23c06518e in ooMonitorCallChannels () from /usr/lib64/asterisk/modules/chan_ooh323.so
#5 0x00007ff23c14ed95 in ooh323c_call_thread () from /usr/lib64/asterisk/modules/chan_ooh323.so
#6 0x000000000057a1a8 in dummy_start ()
#7 0x00007ff2476eaaa1 in start_thread () from /lib64/libpthread.so.0
#8 0x00007ff248cb893d in clone () from /lib64/libc.so.6
Aha! Something about ooh323 which is a module responsible for H.323 which we really don’t need in 2019. Let’s deactivate the culprit and put this into /etc/asterisk/modules.conf:
noload => chan_ooh323.so
Restart asterisk and voila:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
25893 asterisk -11 0 4601m 64m 10m S 14.3 1.7 3:16.92 asterisk
14% CPU is about what Asterisk really consumes given the load. Problem solved. pstack and ps to the rescue!