Bug #14110
closedcrunch-dispatch-slurm is DoSing SLURM
Description
Now that locking has been relaxed in crunch-dispatch-slurm (we are now running a c-d-s built against master @ ac2cc876733c6137d525d12780275f2c02d84383) it appears to be effectively denial of service attacking our SLURM control daemon.
# squeue squeue: error: slurm_receive_msg: Socket timed out on send/recv operation slurm_load_jobs error: Socket timed out on send/recv operation
It might be good to limit the number of concurrent sbatch, scancel, and scontrol operations that c-d-s runs?
Updated by Joshua Randall over 6 years ago
The problem has subsided somewhat as the DoSing is now periodic rather than continuous. Good news there is we have some information to figure out the likely culprit...
it looks like it may be mostly concurrent `scancel` runs that are causing the problem (there are also concurrent `squeue` but those are more likely a symptom rather than a cause because they are taking so long and c-d-s launches a new one every PollPeriod, which is 10s on our system):
$ while true; do ps -e -o comm= | egrep '^(scontrol|sbatch|scancel|squeue|sinfo)$' | sort | uniq -c; /usr/bin/time -f "%e" squeue > /dev/null; echo; sleep 20; done 1 scontrol 0.57 108 scancel 1 scontrol 1 sinfo 3 squeue 56.21 497 scancel 1 scontrol 1 sinfo 14 squeue 55.73 153 scancel 1 scontrol 14 squeue 19.89 608 scancel 1 scontrol 15 squeue squeue: error: slurm_receive_msg: Socket timed out on send/recv operation slurm_load_jobs error: Socket timed out on send/recv operation Command exited with non-zero status 1 60.01 527 scancel 1 scontrol 1 sinfo 14 squeue 84.59 431 scancel 1 scontrol 15 squeue 53.64 318 scancel 1 scontrol 1 sinfo 14 squeue 47.97 13 scancel 1 scontrol 11 squeue 2.95 156 scancel 1 scontrol 11 squeue 16.02 42 scancel 1 scontrol 12 squeue 6.11
Updated by Joshua Randall over 6 years ago
1 scontrol 1 squeue 0.43 0.38 29 scancel 1 squeue 3.47 70 scancel 2 squeue 11.88 1674 sbatch 2664 scancel 1 scontrol 1 sinfo 11 squeue squeue: error: slurm_receive_msg: Socket timed out on send/recv operation slurm_load_jobs error: Socket timed out on send/recv operation Command exited with non-zero status 1 63.05 1440 scancel 1 scontrol 15 squeue squeue: error: slurm_receive_msg: Socket timed out on send/recv operation slurm_load_jobs error: Socket timed out on send/recv operation Command exited with non-zero status 1 60.07 1030 scancel 1 scontrol 11 squeue squeue: error: slurm_receive_msg: Socket timed out on send/recv operation slurm_load_jobs error: Socket timed out on send/recv operation Command exited with non-zero status 1 63.05
Updated by Joshua Randall over 6 years ago
Incidentlally, pretty much all of the slurm commands that c-d-s are issuing concurrently are failing:
# tail -f /var/log/syslog | grep crunch-dispatch-slurm | grep error | head Aug 23 17:51:45 arvados-master-eglyx crunch-dispatch-slurm[1577]: 2018/08/23 17:51:45 scancel: /usr/bin/scancel: exit status 1 ("scancel: error: slurm_receive_msg: Socket timed out on send/recv operation\nslurm_load_jobs error: Socket timed out on send/recv operation") Aug 23 17:51:45 arvados-master-eglyx crunch-dispatch-slurm[1577]: 2018/08/23 17:51:45 "/usr/bin/scancel" ["scancel" "--name=eglyx-dz642-hgseq4jj6nrf674" "--state=pending"]: "scancel: error: slurm_receive_msg: Socket timed out on send/recv operation\nslurm_load_jobs error: Socket timed out on send/recv operation" Aug 23 17:51:45 arvados-master-eglyx crunch-dispatch-slurm[1577]: 2018/08/23 17:51:45 scancel: /usr/bin/scancel: exit status 1 ("scancel: error: slurm_receive_msg: Socket timed out on send/recv operation\nslurm_load_jobs error: Socket timed out on send/recv operation") Aug 23 17:51:45 arvados-master-eglyx crunch-dispatch-slurm[1577]: 2018/08/23 17:51:45 "/usr/bin/scancel" ["scancel" "--name=eglyx-dz642-vl7g2x3cdhms5x7" "--state=pending"]: "scancel: error: slurm_receive_msg: Socket timed out on send/recv operation\nslurm_load_jobs error: Socket timed out on send/recv operation" Aug 23 17:51:45 arvados-master-eglyx crunch-dispatch-slurm[1577]: 2018/08/23 17:51:45 scancel: /usr/bin/scancel: exit status 1 ("scancel: error: slurm_receive_msg: Socket timed out on send/recv operation\nslurm_load_jobs error: Socket timed out on send/recv operation") Aug 23 17:51:45 arvados-master-eglyx crunch-dispatch-slurm[1577]: 2018/08/23 17:51:45 "/usr/bin/scancel" ["scancel" "--name=eglyx-dz642-u2ucq5xx9p9zwm7" "--state=pending"]: "scancel: error: slurm_receive_msg: Socket timed out on send/recv operation\nslurm_load_jobs error: Socket timed out on send/recv operation" Aug 23 17:51:45 arvados-master-eglyx crunch-dispatch-slurm[1577]: 2018/08/23 17:51:45 scancel: /usr/bin/scancel: exit status 1 ("scancel: error: slurm_receive_msg: Socket timed out on send/recv operation\nslurm_load_jobs error: Socket timed out on send/recv operation") Aug 23 17:51:45 arvados-master-eglyx crunch-dispatch-slurm[1577]: 2018/08/23 17:51:45 "/usr/bin/scancel" ["scancel" "--name=eglyx-dz642-tz6s7m3wci0lubl" "--state=pending"]: "scancel: error: slurm_receive_msg: Socket timed out on send/recv operation\nslurm_load_jobs error: Socket timed out on send/recv operation" Aug 23 17:51:45 arvados-master-eglyx crunch-dispatch-slurm[1577]: 2018/08/23 17:51:45 scancel: /usr/bin/scancel: exit status 1 ("scancel: error: slurm_receive_msg: Socket timed out on send/recv operation\nslurm_load_jobs error: Socket timed out on send/recv operation") Aug 23 17:51:45 arvados-master-eglyx crunch-dispatch-slurm[1577]: 2018/08/23 17:51:45 "/usr/bin/scancel" ["scancel" "--name=eglyx-dz642-e26sn3tssp9jki4" "--state=pending"]: "scancel: error: slurm_receive_msg: Socket timed out on send/recv operation\nslurm_load_jobs error: Socket timed out on send/recv operation"
Updated by Joshua Randall over 6 years ago
I have implemented a fix for this and am testing it now.
Updated by Joshua Randall over 6 years ago
Fix appears to have resolved all issues: https://github.com/curoverse/arvados/pull/80
Updated by Joshua Randall over 6 years ago
After this fix, c-d-s is able to cancel 20-30 containers per second (rather than close to 0 without it):
Aug 23 20:08:17 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:17 container eglyx-dz642-vlgrhhin46kiulf is done: cancel slurm job [53/1007] Aug 23 20:08:18 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:18 container eglyx-dz642-50kn85fspvzpfjt is done: cancel slurm job Aug 23 20:08:18 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:18 container eglyx-dz642-axh2mzzixs1adwo is done: cancel slurm job Aug 23 20:08:18 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:18 Start monitoring container eglyx-dz642-fdw2o8fjd5zzya6 in state "Complete" Aug 23 20:08:18 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:18 container eglyx-dz642-gcdf3gpjzl6o698 is done: cancel slurm job Aug 23 20:08:18 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:18 container eglyx-dz642-gce33osvzwpg64y is done: cancel slurm job Aug 23 20:08:18 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:18 container eglyx-dz642-j6wnhm6a7g52s84 is done: cancel slurm job Aug 23 20:08:18 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:18 container eglyx-dz642-utrr0tcfm92pwdp is done: cancel slurm job Aug 23 20:08:18 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:18 container eglyx-dz642-9aqb0qk454crjpe is done: cancel slurm job Aug 23 20:08:18 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:18 container eglyx-dz642-t4l3sun2nqsnx3h is done: cancel slurm job Aug 23 20:08:18 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:18 container eglyx-dz642-axmp03rf7tn745h is done: cancel slurm job Aug 23 20:08:18 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:18 container eglyx-dz642-znovgekyc4i8y89 is done: cancel slurm job Aug 23 20:08:18 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:18 container eglyx-dz642-l8njhtmw5uty8ku is done: cancel slurm job Aug 23 20:08:18 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:18 container eglyx-dz642-1gop2zujkotb5s3 is done: cancel slurm job Aug 23 20:08:18 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:18 container eglyx-dz642-axtahv9g2kqzmvm is done: cancel slurm job Aug 23 20:08:18 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:18 container eglyx-dz642-760mc7t8zs5jq3f is done: cancel slurm job Aug 23 20:08:18 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:18 container eglyx-dz642-nw22ethee7kjqpd is done: cancel slurm job Aug 23 20:08:18 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:18 container eglyx-dz642-63xl96tauyow4k1 is done: cancel slurm job Aug 23 20:08:18 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:18 container eglyx-dz642-mz70nrmrqukyoa3 is done: cancel slurm job Aug 23 20:08:18 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:18 container eglyx-dz642-619j03hocr5s9s2 is done: cancel slurm job Aug 23 20:08:18 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:18 container eglyx-dz642-at73drsraltfx4u is done: cancel slurm job Aug 23 20:08:18 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:18 container eglyx-dz642-ojnf7cmwnwf8b0w is done: cancel slurm job Aug 23 20:08:18 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:18 container eglyx-dz642-bw7jdwfg64vxx1m is done: cancel slurm job Aug 23 20:08:18 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:18 container eglyx-dz642-ncozkktwelas9dx is done: cancel slurm job Aug 23 20:08:19 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:19 container eglyx-dz642-w17titg3ktyh93h is done: cancel slurm job Aug 23 20:08:19 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:19 container eglyx-dz642-jti36e2caoi45g0 is done: cancel slurm job Aug 23 20:08:19 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:19 container eglyx-dz642-dkjo6b8rv8manbp is done: cancel slurm job Aug 23 20:08:19 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:19 container eglyx-dz642-hltkg6qhqv58v74 is done: cancel slurm job Aug 23 20:08:19 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:19 container eglyx-dz642-ofbrn0scu6v0qj3 is done: cancel slurm job Aug 23 20:08:19 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:19 container eglyx-dz642-a65nsc9ssd26shy is done: cancel slurm job Aug 23 20:08:19 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:19 container eglyx-dz642-22x9ehy4pz66z1i is done: cancel slurm job Aug 23 20:08:19 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:19 container eglyx-dz642-7ffkiz1lt542av3 is done: cancel slurm job Aug 23 20:08:19 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:19 container eglyx-dz642-u8slyngxnsburpw is done: cancel slurm job Aug 23 20:08:19 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:19 container eglyx-dz642-ph5zqcka7sex952 is done: cancel slurm job Aug 23 20:08:19 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:19 container eglyx-dz642-k4pticgidsvqh89 is done: cancel slurm job Aug 23 20:08:19 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:19 container eglyx-dz642-1by9vgy4xyq366g is done: cancel slurm job Aug 23 20:08:19 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:19 container eglyx-dz642-q46103c1dimc9ve is done: cancel slurm job Aug 23 20:08:19 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:19 container eglyx-dz642-tp769xhvmhkalzy is done: cancel slurm job Aug 23 20:08:19 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:19 container eglyx-dz642-8zbjtpfvocxe4mn is done: cancel slurm job Aug 23 20:08:19 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:19 container eglyx-dz642-8tbcegm7xxmqafw is done: cancel slurm job Aug 23 20:08:19 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:19 container eglyx-dz642-b1mllj4fyywuxsf is done: cancel slurm job Aug 23 20:08:19 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:19 container eglyx-dz642-ltwsi49j1co15uh is done: cancel slurm job Aug 23 20:08:19 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:19 container eglyx-dz642-0jfbh5i6gm4t70z is done: cancel slurm job Aug 23 20:08:19 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:19 container eglyx-dz642-l3hl61kdm7tb68s is done: cancel slurm job Aug 23 20:08:19 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:19 container eglyx-dz642-tlwllwm5m0k6aat is done: cancel slurm job Aug 23 20:08:19 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:19 container eglyx-dz642-u0yr2evodxnvih6 is done: cancel slurm job Aug 23 20:08:19 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:19 container eglyx-dz642-bqi2557yjdh98g9 is done: cancel slurm job Aug 23 20:08:19 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:19 container eglyx-dz642-pay7jbhr837ghpv is done: cancel slurm job Aug 23 20:08:19 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:19 container eglyx-dz642-yb84hvcy21jlhrz is done: cancel slurm job Aug 23 20:08:19 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:19 container eglyx-dz642-2ptlbnwxwqye01d is done: cancel slurm job Aug 23 20:08:19 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:19 container eglyx-dz642-ed9jsmyicg2ou9z is done: cancel slurm job Aug 23 20:08:19 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:19 container eglyx-dz642-bodp3jwpv9sbh2b is done: cancel slurm job Aug 23 20:08:20 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:20 container eglyx-dz642-xg35sjt7xw8suzu is done: cancel slurm job Aug 23 20:08:20 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:20 container eglyx-dz642-ny74spj2gwq539m is done: cancel slurm job Aug 23 20:08:20 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:20 container eglyx-dz642-qmeoopeyqk2tmcq is done: cancel slurm job Aug 23 20:08:20 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:20 container eglyx-dz642-5odg3n3nrpddbi7 is done: cancel slurm job Aug 23 20:08:20 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:20 container eglyx-dz642-l85g1gwr2hyf0fn is done: cancel slurm job Aug 23 20:08:20 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:20 container eglyx-dz642-ey0ek4co49gw0ub is done: cancel slurm job Aug 23 20:08:20 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:20 container eglyx-dz642-9mzt5j1gipz62wa is done: cancel slurm job Aug 23 20:08:20 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:20 container eglyx-dz642-yxph79y15yeffbg is done: cancel slurm job Aug 23 20:08:20 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:20 container eglyx-dz642-mc1eqo8uwxib9y5 is done: cancel slurm job Aug 23 20:08:20 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:20 container eglyx-dz642-aw8lhl38ybz7ows is done: cancel slurm job Aug 23 20:08:20 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:20 container eglyx-dz642-daiq3w8vmqq34tx is done: cancel slurm job Aug 23 20:08:20 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:20 container eglyx-dz642-oygai68qp7nqja8 is done: cancel slurm job Aug 23 20:08:20 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:20 container eglyx-dz642-c5xh24sit9wo0fn is done: cancel slurm job Aug 23 20:08:20 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:20 container eglyx-dz642-2t2eaxdbtnyq9az is done: cancel slurm job Aug 23 20:08:20 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:20 container eglyx-dz642-dypk9m6rcrdfppf is done: cancel slurm job Aug 23 20:08:20 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:20 container eglyx-dz642-mwfv3z3a6sdk3qx is done: cancel slurm job Aug 23 20:08:20 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:20 container eglyx-dz642-jubow08z70imfqa is done: cancel slurm job Aug 23 20:08:20 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:20 container eglyx-dz642-sdc6q02xibm4hdj is done: cancel slurm job Aug 23 20:08:20 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:20 container eglyx-dz642-t7ce4i2p2d58pew is done: cancel slurm job Aug 23 20:08:20 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:20 container eglyx-dz642-4jr1ypk61bofzum is done: cancel slurm job Aug 23 20:08:20 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:20 container eglyx-dz642-7ikp3xtiplnnnzo is done: cancel slurm job Aug 23 20:08:20 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:20 container eglyx-dz642-tpz3vhdaasy3muc is done: cancel slurm job Aug 23 20:08:20 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:20 container eglyx-dz642-0bqjpy9moott62i is done: cancel slurm job Aug 23 20:08:20 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:20 container eglyx-dz642-j27y5xchso0zmdp is done: cancel slurm job Aug 23 20:08:21 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:21 container eglyx-dz642-n2lh4go6s02piig is done: cancel slurm job
Updated by Peter Amstutz over 6 years ago
- Status changed from New to Resolved