Project

General

Profile

Actions

Bug #14110

closed

crunch-dispatch-slurm is DoSing SLURM

Added by Joshua Randall over 5 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
-
Category:
Crunch
Target version:
-
Story points:
-

Description

Now that locking has been relaxed in crunch-dispatch-slurm (we are now running a c-d-s built against master @ ac2cc876733c6137d525d12780275f2c02d84383) it appears to be effectively denial of service attacking our SLURM control daemon.

# squeue
squeue: error: slurm_receive_msg: Socket timed out on send/recv operation
slurm_load_jobs error: Socket timed out on send/recv operation

It might be good to limit the number of concurrent sbatch, scancel, and scontrol operations that c-d-s runs?

Actions #1

Updated by Joshua Randall over 5 years ago

The problem has subsided somewhat as the DoSing is now periodic rather than continuous. Good news there is we have some information to figure out the likely culprit...

it looks like it may be mostly concurrent `scancel` runs that are causing the problem (there are also concurrent `squeue` but those are more likely a symptom rather than a cause because they are taking so long and c-d-s launches a new one every PollPeriod, which is 10s on our system):

$ while true; do ps -e -o comm= | egrep '^(scontrol|sbatch|scancel|squeue|sinfo)$' | sort | uniq -c; /usr/bin/time -f "%e" squeue > /dev/null; echo; sleep 20; done
      1 scontrol
0.57

    108 scancel
      1 scontrol
      1 sinfo
      3 squeue
56.21

    497 scancel
      1 scontrol
      1 sinfo
     14 squeue
55.73

    153 scancel
      1 scontrol
     14 squeue
19.89

    608 scancel
      1 scontrol
     15 squeue
squeue: error: slurm_receive_msg: Socket timed out on send/recv operation
slurm_load_jobs error: Socket timed out on send/recv operation
Command exited with non-zero status 1
60.01

    527 scancel
      1 scontrol
      1 sinfo
     14 squeue
84.59

    431 scancel
      1 scontrol
     15 squeue
53.64

    318 scancel
      1 scontrol
      1 sinfo
     14 squeue
47.97

     13 scancel
      1 scontrol
     11 squeue
2.95

    156 scancel
      1 scontrol
     11 squeue
16.02

     42 scancel
      1 scontrol
     12 squeue
6.11
Actions #2

Updated by Joshua Randall over 5 years ago

      1 scontrol
      1 squeue
0.43

0.38

     29 scancel
      1 squeue
3.47

     70 scancel
      2 squeue
11.88

   1674 sbatch
   2664 scancel
      1 scontrol
      1 sinfo
     11 squeue
squeue: error: slurm_receive_msg: Socket timed out on send/recv operation
slurm_load_jobs error: Socket timed out on send/recv operation
Command exited with non-zero status 1
63.05

   1440 scancel
      1 scontrol
     15 squeue
squeue: error: slurm_receive_msg: Socket timed out on send/recv operation
slurm_load_jobs error: Socket timed out on send/recv operation
Command exited with non-zero status 1
60.07

   1030 scancel
      1 scontrol
     11 squeue
squeue: error: slurm_receive_msg: Socket timed out on send/recv operation
slurm_load_jobs error: Socket timed out on send/recv operation
Command exited with non-zero status 1
63.05
Actions #3

Updated by Joshua Randall over 5 years ago

Incidentlally, pretty much all of the slurm commands that c-d-s are issuing concurrently are failing:

# tail -f /var/log/syslog | grep crunch-dispatch-slurm | grep error | head
Aug 23 17:51:45 arvados-master-eglyx crunch-dispatch-slurm[1577]: 2018/08/23 17:51:45 scancel: /usr/bin/scancel: exit status 1 ("scancel: error: slurm_receive_msg: Socket timed out on send/recv operation\nslurm_load_jobs error: Socket timed out on send/recv operation")
Aug 23 17:51:45 arvados-master-eglyx crunch-dispatch-slurm[1577]: 2018/08/23 17:51:45 "/usr/bin/scancel" ["scancel" "--name=eglyx-dz642-hgseq4jj6nrf674" "--state=pending"]: "scancel: error: slurm_receive_msg: Socket timed out on send/recv operation\nslurm_load_jobs error: Socket timed out on send/recv operation" 
Aug 23 17:51:45 arvados-master-eglyx crunch-dispatch-slurm[1577]: 2018/08/23 17:51:45 scancel: /usr/bin/scancel: exit status 1 ("scancel: error: slurm_receive_msg: Socket timed out on send/recv operation\nslurm_load_jobs error: Socket timed out on send/recv operation")
Aug 23 17:51:45 arvados-master-eglyx crunch-dispatch-slurm[1577]: 2018/08/23 17:51:45 "/usr/bin/scancel" ["scancel" "--name=eglyx-dz642-vl7g2x3cdhms5x7" "--state=pending"]: "scancel: error: slurm_receive_msg: Socket timed out on send/recv operation\nslurm_load_jobs error: Socket timed out on send/recv operation" 
Aug 23 17:51:45 arvados-master-eglyx crunch-dispatch-slurm[1577]: 2018/08/23 17:51:45 scancel: /usr/bin/scancel: exit status 1 ("scancel: error: slurm_receive_msg: Socket timed out on send/recv operation\nslurm_load_jobs error: Socket timed out on send/recv operation")
Aug 23 17:51:45 arvados-master-eglyx crunch-dispatch-slurm[1577]: 2018/08/23 17:51:45 "/usr/bin/scancel" ["scancel" "--name=eglyx-dz642-u2ucq5xx9p9zwm7" "--state=pending"]: "scancel: error: slurm_receive_msg: Socket timed out on send/recv operation\nslurm_load_jobs error: Socket timed out on send/recv operation" 
Aug 23 17:51:45 arvados-master-eglyx crunch-dispatch-slurm[1577]: 2018/08/23 17:51:45 scancel: /usr/bin/scancel: exit status 1 ("scancel: error: slurm_receive_msg: Socket timed out on send/recv operation\nslurm_load_jobs error: Socket timed out on send/recv operation")
Aug 23 17:51:45 arvados-master-eglyx crunch-dispatch-slurm[1577]: 2018/08/23 17:51:45 "/usr/bin/scancel" ["scancel" "--name=eglyx-dz642-tz6s7m3wci0lubl" "--state=pending"]: "scancel: error: slurm_receive_msg: Socket timed out on send/recv operation\nslurm_load_jobs error: Socket timed out on send/recv operation" 
Aug 23 17:51:45 arvados-master-eglyx crunch-dispatch-slurm[1577]: 2018/08/23 17:51:45 scancel: /usr/bin/scancel: exit status 1 ("scancel: error: slurm_receive_msg: Socket timed out on send/recv operation\nslurm_load_jobs error: Socket timed out on send/recv operation")
Aug 23 17:51:45 arvados-master-eglyx crunch-dispatch-slurm[1577]: 2018/08/23 17:51:45 "/usr/bin/scancel" ["scancel" "--name=eglyx-dz642-e26sn3tssp9jki4" "--state=pending"]: "scancel: error: slurm_receive_msg: Socket timed out on send/recv operation\nslurm_load_jobs error: Socket timed out on send/recv operation" 

Actions #4

Updated by Joshua Randall over 5 years ago

I have implemented a fix for this and am testing it now.

Actions #5

Updated by Joshua Randall over 5 years ago

Fix appears to have resolved all issues: https://github.com/curoverse/arvados/pull/80

Actions #6

Updated by Joshua Randall over 5 years ago

After this fix, c-d-s is able to cancel 20-30 containers per second (rather than close to 0 without it):

Aug 23 20:08:17 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:17 container eglyx-dz642-vlgrhhin46kiulf is done: cancel slurm job                               [53/1007]
Aug 23 20:08:18 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:18 container eglyx-dz642-50kn85fspvzpfjt is done: cancel slurm job
Aug 23 20:08:18 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:18 container eglyx-dz642-axh2mzzixs1adwo is done: cancel slurm job
Aug 23 20:08:18 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:18 Start monitoring container eglyx-dz642-fdw2o8fjd5zzya6 in state "Complete" 
Aug 23 20:08:18 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:18 container eglyx-dz642-gcdf3gpjzl6o698 is done: cancel slurm job
Aug 23 20:08:18 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:18 container eglyx-dz642-gce33osvzwpg64y is done: cancel slurm job
Aug 23 20:08:18 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:18 container eglyx-dz642-j6wnhm6a7g52s84 is done: cancel slurm job
Aug 23 20:08:18 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:18 container eglyx-dz642-utrr0tcfm92pwdp is done: cancel slurm job
Aug 23 20:08:18 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:18 container eglyx-dz642-9aqb0qk454crjpe is done: cancel slurm job
Aug 23 20:08:18 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:18 container eglyx-dz642-t4l3sun2nqsnx3h is done: cancel slurm job
Aug 23 20:08:18 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:18 container eglyx-dz642-axmp03rf7tn745h is done: cancel slurm job
Aug 23 20:08:18 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:18 container eglyx-dz642-znovgekyc4i8y89 is done: cancel slurm job
Aug 23 20:08:18 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:18 container eglyx-dz642-l8njhtmw5uty8ku is done: cancel slurm job
Aug 23 20:08:18 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:18 container eglyx-dz642-1gop2zujkotb5s3 is done: cancel slurm job
Aug 23 20:08:18 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:18 container eglyx-dz642-axtahv9g2kqzmvm is done: cancel slurm job
Aug 23 20:08:18 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:18 container eglyx-dz642-760mc7t8zs5jq3f is done: cancel slurm job
Aug 23 20:08:18 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:18 container eglyx-dz642-nw22ethee7kjqpd is done: cancel slurm job
Aug 23 20:08:18 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:18 container eglyx-dz642-63xl96tauyow4k1 is done: cancel slurm job
Aug 23 20:08:18 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:18 container eglyx-dz642-mz70nrmrqukyoa3 is done: cancel slurm job
Aug 23 20:08:18 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:18 container eglyx-dz642-619j03hocr5s9s2 is done: cancel slurm job
Aug 23 20:08:18 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:18 container eglyx-dz642-at73drsraltfx4u is done: cancel slurm job
Aug 23 20:08:18 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:18 container eglyx-dz642-ojnf7cmwnwf8b0w is done: cancel slurm job
Aug 23 20:08:18 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:18 container eglyx-dz642-bw7jdwfg64vxx1m is done: cancel slurm job
Aug 23 20:08:18 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:18 container eglyx-dz642-ncozkktwelas9dx is done: cancel slurm job
Aug 23 20:08:19 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:19 container eglyx-dz642-w17titg3ktyh93h is done: cancel slurm job
Aug 23 20:08:19 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:19 container eglyx-dz642-jti36e2caoi45g0 is done: cancel slurm job
Aug 23 20:08:19 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:19 container eglyx-dz642-dkjo6b8rv8manbp is done: cancel slurm job
Aug 23 20:08:19 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:19 container eglyx-dz642-hltkg6qhqv58v74 is done: cancel slurm job
Aug 23 20:08:19 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:19 container eglyx-dz642-ofbrn0scu6v0qj3 is done: cancel slurm job
Aug 23 20:08:19 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:19 container eglyx-dz642-a65nsc9ssd26shy is done: cancel slurm job
Aug 23 20:08:19 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:19 container eglyx-dz642-22x9ehy4pz66z1i is done: cancel slurm job
Aug 23 20:08:19 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:19 container eglyx-dz642-7ffkiz1lt542av3 is done: cancel slurm job
Aug 23 20:08:19 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:19 container eglyx-dz642-u8slyngxnsburpw is done: cancel slurm job
Aug 23 20:08:19 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:19 container eglyx-dz642-ph5zqcka7sex952 is done: cancel slurm job
Aug 23 20:08:19 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:19 container eglyx-dz642-k4pticgidsvqh89 is done: cancel slurm job
Aug 23 20:08:19 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:19 container eglyx-dz642-1by9vgy4xyq366g is done: cancel slurm job
Aug 23 20:08:19 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:19 container eglyx-dz642-q46103c1dimc9ve is done: cancel slurm job
Aug 23 20:08:19 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:19 container eglyx-dz642-tp769xhvmhkalzy is done: cancel slurm job
Aug 23 20:08:19 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:19 container eglyx-dz642-8zbjtpfvocxe4mn is done: cancel slurm job
Aug 23 20:08:19 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:19 container eglyx-dz642-8tbcegm7xxmqafw is done: cancel slurm job
Aug 23 20:08:19 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:19 container eglyx-dz642-b1mllj4fyywuxsf is done: cancel slurm job
Aug 23 20:08:19 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:19 container eglyx-dz642-ltwsi49j1co15uh is done: cancel slurm job
Aug 23 20:08:19 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:19 container eglyx-dz642-0jfbh5i6gm4t70z is done: cancel slurm job
Aug 23 20:08:19 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:19 container eglyx-dz642-l3hl61kdm7tb68s is done: cancel slurm job
Aug 23 20:08:19 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:19 container eglyx-dz642-tlwllwm5m0k6aat is done: cancel slurm job
Aug 23 20:08:19 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:19 container eglyx-dz642-u0yr2evodxnvih6 is done: cancel slurm job
Aug 23 20:08:19 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:19 container eglyx-dz642-bqi2557yjdh98g9 is done: cancel slurm job
Aug 23 20:08:19 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:19 container eglyx-dz642-pay7jbhr837ghpv is done: cancel slurm job
Aug 23 20:08:19 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:19 container eglyx-dz642-yb84hvcy21jlhrz is done: cancel slurm job
Aug 23 20:08:19 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:19 container eglyx-dz642-2ptlbnwxwqye01d is done: cancel slurm job
Aug 23 20:08:19 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:19 container eglyx-dz642-ed9jsmyicg2ou9z is done: cancel slurm job
Aug 23 20:08:19 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:19 container eglyx-dz642-bodp3jwpv9sbh2b is done: cancel slurm job
Aug 23 20:08:20 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:20 container eglyx-dz642-xg35sjt7xw8suzu is done: cancel slurm job
Aug 23 20:08:20 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:20 container eglyx-dz642-ny74spj2gwq539m is done: cancel slurm job
Aug 23 20:08:20 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:20 container eglyx-dz642-qmeoopeyqk2tmcq is done: cancel slurm job
Aug 23 20:08:20 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:20 container eglyx-dz642-5odg3n3nrpddbi7 is done: cancel slurm job
Aug 23 20:08:20 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:20 container eglyx-dz642-l85g1gwr2hyf0fn is done: cancel slurm job
Aug 23 20:08:20 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:20 container eglyx-dz642-ey0ek4co49gw0ub is done: cancel slurm job
Aug 23 20:08:20 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:20 container eglyx-dz642-9mzt5j1gipz62wa is done: cancel slurm job
Aug 23 20:08:20 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:20 container eglyx-dz642-yxph79y15yeffbg is done: cancel slurm job
Aug 23 20:08:20 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:20 container eglyx-dz642-mc1eqo8uwxib9y5 is done: cancel slurm job
Aug 23 20:08:20 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:20 container eglyx-dz642-aw8lhl38ybz7ows is done: cancel slurm job
Aug 23 20:08:20 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:20 container eglyx-dz642-daiq3w8vmqq34tx is done: cancel slurm job
Aug 23 20:08:20 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:20 container eglyx-dz642-oygai68qp7nqja8 is done: cancel slurm job
Aug 23 20:08:20 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:20 container eglyx-dz642-c5xh24sit9wo0fn is done: cancel slurm job
Aug 23 20:08:20 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:20 container eglyx-dz642-2t2eaxdbtnyq9az is done: cancel slurm job
Aug 23 20:08:20 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:20 container eglyx-dz642-dypk9m6rcrdfppf is done: cancel slurm job
Aug 23 20:08:20 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:20 container eglyx-dz642-mwfv3z3a6sdk3qx is done: cancel slurm job
Aug 23 20:08:20 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:20 container eglyx-dz642-jubow08z70imfqa is done: cancel slurm job
Aug 23 20:08:20 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:20 container eglyx-dz642-sdc6q02xibm4hdj is done: cancel slurm job
Aug 23 20:08:20 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:20 container eglyx-dz642-t7ce4i2p2d58pew is done: cancel slurm job
Aug 23 20:08:20 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:20 container eglyx-dz642-4jr1ypk61bofzum is done: cancel slurm job
Aug 23 20:08:20 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:20 container eglyx-dz642-7ikp3xtiplnnnzo is done: cancel slurm job
Aug 23 20:08:20 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:20 container eglyx-dz642-tpz3vhdaasy3muc is done: cancel slurm job
Aug 23 20:08:20 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:20 container eglyx-dz642-0bqjpy9moott62i is done: cancel slurm job
Aug 23 20:08:20 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:20 container eglyx-dz642-j27y5xchso0zmdp is done: cancel slurm job
Aug 23 20:08:21 arvados-master-eglyx crunch-dispatch-slurm[5326]: 2018/08/23 20:08:21 container eglyx-dz642-n2lh4go6s02piig is done: cancel slurm job

Actions #7

Updated by Peter Amstutz over 5 years ago

  • Status changed from New to Resolved
Actions

Also available in: Atom PDF