Bug #8323

[API] Puma hangs forever on a futex, requiring restart

Added by Nico C├ęsar over 1 year ago. Updated 12 days ago.

Status:ResolvedStart date:01/29/2016
Priority:NormalDue date:
Assignee:-% Done:

0%

Category:API
Target version:-
Story points-
Velocity based estimate-

Description

puma process is behaving like the following:

top - 20:22:26 up 52 days,  7:39,  1 user,  load average: 2.47, 2.36, 2.42
Threads:   7 total,   1 running,   6 sleeping,   0 stopped,   0 zombie
%Cpu(s): 31.0 us,  2.2 sy,  0.0 ni, 65.4 id,  1.3 wa,  0.0 hi,  0.1 si,  0.0 st
KiB Mem:  28811768 total, 27909092 used,   902676 free,   155872 buffers
KiB Swap:        0 total,        0 used,        0 free. 21708188 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ nTH WCHAN      COMMAND                                                                                   
 2395 www-data  20   0 3310868 856216  11496 R 99.2  3.0   2525:32   7 -          ruby                                                                                      
 2279 www-data  20   0 3310868 856216  11496 S  0.0  3.0   0:03.06   7 futex_wai+ ruby                                                                                      
 2360 www-data  20   0 3310868 856216  11496 S  0.0  3.0   0:23.95   7 poll_sche+ ruby-timer-thr                                                                            
 2404 www-data  20   0 3310868 856216  11496 S  0.0  3.0   0:02.41   7 poll_sche+ ruby                                                                                      
 2405 www-data  20   0 3310868 856216  11496 S  0.0  3.0   0:09.23   7 futex_wai+ ruby                                                                                      
 2406 www-data  20   0 3310868 856216  11496 S  0.0  3.0   0:00.10   7 poll_sche+ ruby                                                                                      
 2534 www-data  20   0 3310868 856216  11496 S  0.0  3.0   1:49.18   7 futex_wai+ ruby                                                                                      

Look at WCHAN that has nothing in a process that has almost 100% of CPU

I'm comparing this to 2 healthy puma processes in other API server:

top - 20:24:08 up 57 days, 22:03,  1 user,  load average: 1.79, 1.64, 1.45
Threads:   7 total,   0 running,   7 sleeping,   0 stopped,   0 zombie
%Cpu(s): 18.2 us,  1.4 sy,  0.0 ni, 80.1 id,  0.3 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  28811768 total, 28386492 used,   425276 free,   126632 buffers
KiB Swap:        0 total,        0 used,        0 free. 22583144 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ nTH WCHAN      COMMAND                                                                                   
 1880 www-data  20   0  725132 144572  10940 S 11.0  0.5 825:43.18   7 poll_sche+ ruby                                                                                      
 1404 www-data  20   0  725132 144572  10940 S  0.0  0.5   0:03.18   7 futex_wai+ ruby                                                                                      
 1619 www-data  20   0  725132 144572  10940 S  0.0  0.5   0:04.50   7 poll_sche+ ruby-timer-thr                                                                            
 1886 www-data  20   0  725132 144572  10940 S  0.0  0.5   0:10.08   7 poll_sche+ ruby                                                                                      
 1887 www-data  20   0  725132 144572  10940 S  0.0  0.5   1:27.25   7 futex_wai+ ruby                                                                                      
 1888 www-data  20   0  725132 144572  10940 S  0.0  0.5   0:00.33   7 poll_sche+ ruby                                                                                      
 1935 www-data  20   0  725132 144572  10940 S  0.0  0.5   0:33.05   7 poll_sche+ ruby                                                                                      
top - 20:24:45 up 52 days, 23:37,  1 user,  load average: 0.56, 0.70, 0.67
Threads:   7 total,   0 running,   7 sleeping,   0 stopped,   0 zombie
%Cpu(s):  2.5 us,  0.8 sy,  0.0 ni, 96.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  28811768 total, 13967572 used, 14844196 free,   162308 buffers
KiB Swap:        0 total,        0 used,        0 free. 10219944 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ nTH WCHAN      COMMAND                                                                                   
14792 www-data  20   0  928992 263040  11348 S  0.0  0.9   0:02.88   7 futex_wai+ ruby                                                                                      
14876 www-data  20   0  928992 263040  11348 S  0.0  0.9   0:02.19   7 poll_sche+ ruby-timer-thr                                                                            
14907 www-data  20   0  928992 263040  11348 S  0.0  0.9 413:08.24   7 poll_sche+ ruby                                                                                      
14918 www-data  20   0  928992 263040  11348 S  0.0  0.9   0:05.22   7 poll_sche+ ruby                                                                                      
14919 www-data  20   0  928992 263040  11348 S  0.0  0.9   1:28.96   7 futex_wai+ ruby                                                                                      
14920 www-data  20   0  928992 263040  11348 S  0.0  0.9   0:00.44   7 poll_sche+ ruby                                                                                      
14952 www-data  20   0  928992 263040  11348 S  0.0  0.9   0:25.55   7 poll_sche+ ruby                                                                                      

my feeling is that the master puma thread is doing something that will never ask the kernel to wait for a resources ( something like while true: pass will cause this)

making 1 core hot and thus delaying any other processes in that box.

There could be the possibility that the "worker" threads are working and if everything goes well the functionality isn't affected at all. (maybe only in the shutdown process).

just another hint that the process isn't calling anything in kernelspace:

wx7k5:~# cat /proc/2395/stack 
[<ffffffffffffffff>] 0xffffffffffffffff

while the others:

su92l:~# cat /proc/14792/stack
[<ffffffff810ede0e>] futex_wait_queue_me+0xde/0x140
[<ffffffff810ee9a2>] futex_wait+0x182/0x290
[<ffffffff810f0a8e>] do_futex+0xde/0x5d0
[<ffffffff810f0ff1>] SyS_futex+0x71/0x150
[<ffffffff817b6dcd>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

(all libraries are stripped so we have no symbols to do a ptrace. :/ )

How can we debug further this? (and avoid the 100%CPU usage)


Related issues

Duplicated by Arvados - Bug #8277: puma stops responding: "Connection reset by peer" Duplicate 01/21/2016

History

#1 Updated by Brett Smith about 1 year ago

  • Subject changed from [API] strange puma behavior to [API] Puma hangs forever on a futex, requiring restart
  • Category set to API

#2 Updated by Brett Smith about 1 year ago

Went digging through Puma's issues to see if it's been reported. This is the closest, although I don't understand the resolution. This isn't far, although it doesn't mention some of the specific underlying symptoms we've seen.

#3 Updated by Brett Smith about 1 year ago

I hate to make it something so simple but we might want to consider trying the latest Puma. We're using 2.8.2. There have been several substantial releases since then, and some of those are referenced in these GitHub issues.

#4 Updated by Lucas Di Pentima 12 days ago

  • Status changed from New to Resolved

Also available in: Atom PDF