Project

General

Profile

Actions

Bug #13095

closed

when slurm murders a crunch2 job because it exceeds the memory limit, the container is left with a null `log`

Added by Joshua Randall about 6 years ago. Updated about 4 years ago.

Status:
Closed
Priority:
Normal
Assigned To:
Category:
-
Target version:
-
Story points:
-

Description

If a crunch2 job exceeds its memory limit (with cgroup memory limits enabled), SLURM kills it, but no record of it having been killed due to exceeding memory limits is recorded anywhere in the Arvados system (and no container logs are saved to keep). The only trace of the reason it was killed is the SLURM job log file on the execution host.

For example, the SLURM job output (in a slurm-NNNNNN.out file on the compute node on which it ran) ends with:

ncucu-dz642-ejbrs5a65rt8hjq 2018-02-15T23:27:48.080392297Z Uploading 26/f0/a1a25d70795beec266b79ee9872a (1031 bytes)
ncucu-dz642-ejbrs5a65rt8hjq 2018-02-15T23:27:48.086622228Z Uploading 27/05/f017f46d301002f0ccf08933080c (2106 bytes)
ncucu-dz642-ejbrs5a65rt8hjq 2018-02-15T23:27:48.098248469Z Arv-mount exit error: signal: killed
ncucu-dz642-ejbrs5a65rt8hjq 2018-02-15T23:27:48.100037472Z Uploading 27/0f/3e796a1b7c345e83762fa735bfe1 (1414 bytes)
ncucu-dz642-ejbrs5a65rt8hjq 2018-02-15T23:27:48.102255057Z Uploading 27/13/e855b2f8f1734239f0523b6200ce (1258 bytes)
ncucu-dz642-ejbrs5a65rt8hjq 2018-02-15T23:27:48.104311641Z Uploading 27/17/6699c6daad8aa9991c140bb4dceb (3510 bytes)
ncucu-dz642-ejbrs5a65rt8hjq 2018-02-15T23:27:48.106535847Z Uploading 27/40/aeab0828fa2d4f2789d353a40b8c (3531 bytes)
ncucu-dz642-ejbrs5a65rt8hjq 2018-02-15T23:27:48.108438975Z Uploading 27/56/f6ee4f5780acce31e995443508b6 (280839 bytes)
ncucu-dz642-ejbrs5a65rt8hjq 2018-02-15T23:27:48.111356416Z Uploading 27/66/2470b962dfea212da33f14909142 (2901 bytes)
ncucu-dz642-ejbrs5a65rt8hjq 2018-02-15T23:27:48.113493188Z Uploading 27/6e/5a73fe80f25706975973fca81151 (3633 bytes)
ncucu-dz642-ejbrs5a65rt8hjq 2018-02-15T23:27:48.115871012Z Uploading 27/77/151312e10fd89af514515090fcca (1066 bytes)
ncucu-dz642-ejbrs5a65rt8hjq 2018-02-15T23:27:48.117901158Z Uploading 27/8b/5a30cbb6401633951fabf455eafd (107 bytes)
ncucu-dz642-ejbrs5a65rt8hjq 2018-02-15T23:27:48.119899508Z Uploading 27/bc/09a272621c3414cbe34cd809347d (1687 bytes)
ncucu-dz642-ejbrs5a65rt8hjq 2018-02-15T23:27:48.122210713Z Uploading 27/c3/653c8e9470329db519a7e855887b (2288 bytes)
ncucu-dz642-ejbrs5a65rt8hjq 2018-02-15T23:27:48.124479403Z Uploading 27/c6/2e92a203ea221c1feb53bd57832c (110 bytes)
ncucu-dz642-ejbrs5a65rt8hjq 2018-02-15T23:27:48.126484799Z Uploading 27/db/03ab4ca9e14af33b697876aaa754 (2306 bytes)
ncucu-dz642-ejbrs5a65rt8hjq 2018-02-15T23:27:48.128529112Z Uploading 27/e6/1fe700513169fa02b0a6b9224fb6 (1236 bytes)
ncucu-dz642-ejbrs5a65rt8hjq 2018-02-15T23:27:48.131777506Z Uploading 28/13/badf81db953a968eb7d49bc2882f (5691 bytes)
ncucu-dz642-ejbrs5a65rt8hjq 2018-02-15T23:27:48.134346387Z Uploading 28/28/e63b8edc5e845bf48e75fbad2926 (153799 bytes)
ncucu-dz642-ejbrs5a65rt8hjq 2018-02-15T23:27:48.137404631Z Uploading 28/33/baebca84548dd4e37d79642db779 (3517 bytes)
ncucu-dz642-ejbrs5a65rt8hjq 2018-02-15T23:27:48.139425467Z Uploading 28/3f/8d7892baa81b510a015719ca7b0b (115169878 bytes)
ncucu-dz642-ejbrs5a65rt8hjq 2018-02-15T23:27:48.441013475Z Uploading 28/51/7f9e5682de5e74b42ba7c84a9d41 (157053 bytes)
ncucu-dz642-ejbrs5a65rt8hjq 2018-02-15T23:27:48.443749845Z Uploading 28/5c/47826e725f4442c18898156ac4fa (2134 bytes)
ncucu-dz642-ejbrs5a65rt8hjq 2018-02-15T23:27:48.446060602Z Uploading 28/80/4e3d50bf7764ac6b1897b1128790 (2318 bytes)
ncucu-dz642-ejbrs5a65rt8hjq 2018-02-15T23:27:48.448867967Z Uploading 28/84/95a36e0f4b1fc8e90cff6d4734eb (2024 bytes)
ncucu-dz642-ejbrs5a65rt8hjq 2018-02-15T23:27:48.451210076Z Uploading 28/9a/de2a84e2a669f4950d905de18ef3 (3349 bytes)
ncucu-dz642-ejbrs5a65rt8hjq 2018-02-15T23:27:48.453552225Z Uploading 28/a2/c878efec380df24a5544738eda91 (1064 bytes)
ncucu-dz642-ejbrs5a65rt8hjq 2018-02-15T23:27:48.455813466Z Uploading 28/a6/9648439a2c02d46750eee929b20d (56134 bytes)
ncucu-dz642-ejbrs5a65rt8hjq 2018-02-15T23:27:48.458422277Z Uploading 28/c6/0286011df7faa5b7167d8dc7a5f2 (5985 bytes)
ncucu-dz642-ejbrs5a65rt8hjq 2018-02-15T23:27:48.460823226Z Uploading 28/c9/6decb9c1d76b9a76b1ee75e2421d (2786 bytes)
ncucu-dz642-ejbrs5a65rt8hjq 2018-02-15T23:27:48.463194660Z Uploading 28/d2/54c1b706d0bf3b10df07746930cf (1116 bytes)
ncucu-dz642-ejbrs5a65rt8hjq 2018-02-15T23:27:48.467028813Z Uploading 29/0c/e55097cb95fa8d25b0e618e59c5c (109 bytes)
ncucu-dz642-ejbrs5a65rt8hjq 2018-02-15T23:27:48.550383451Z Uploading 29/1b/cca126e0b971eaa6fde409109815 (1802 bytes)
ncucu-dz642-ejbrs5a65rt8hjq 2018-02-15T23:27:48.553371569Z Uploading 29/29/cff1249a75eefb4cfffd96d62464 (1090 bytes)
ncucu-dz642-ejbrs5a65rt8hjq 2018-02-15T23:27:48.557162276Z Uploading 29/47/ea9bc4c28b64abba44002460a38f (3517 bytes)
ncucu-dz642-ejbrs5a65rt8hjq 2018-02-15T23:27:48.560143019Z Uploading 29/48/653361f974fbed3e26a4dfbf332c (376187 bytes)
ncucu-dz642-ejbrs5a65rt8hjq 2018-02-15T23:27:48.564951400Z Uploading 29/79/a6085bfe28e3ad6f552f361ed74d (48129895 bytes)
ncucu-dz642-ejbrs5a65rt8hjq 2018-02-15T23:27:48.706068567Z Uploading 29/87/b50923434c160ee2c370f1a0665f (1214 bytes)
ncucu-dz642-ejbrs5a65rt8hjq 2018-02-15T23:27:48.710009625Z Uploading 29/96/b120a5a5e15dab6555f0bf92e374 (79590 bytes)
ncucu-dz642-ejbrs5a65rt8hjq 2018-02-15T23:27:48.713794190Z Uploading 29/b2/87f217fd519ac38852bf9f3b89cd (1697 bytes)
ncucu-dz642-ejbrs5a65rt8hjq 2018-02-15T23:27:48.716769995Z Uploading 29/c3/93ec5d6804ee5dc06b435ffc02f5 (5834 bytes)
ncucu-dz642-ejbrs5a65rt8hjq 2018-02-15T23:27:48.719981456Z Uploading 29/d6/d49a19ae259feb012f492ab83ce0 (133535 bytes)
ncucu-dz642-ejbrs5a65rt8hjq 2018-02-15T23:27:48.723532503Z Uploading 29/e7/c064fe54ab6b7fd8accb0363821b (58661 bytes)
ncucu-dz642-ejbrs5a65rt8hjq 2018-02-15T23:27:48.727616356Z Uploading 29/f3/cd44f918aff7d3164e28bfddb98c (1063 bytes)
ncucu-dz642-ejbrs5a65rt8hjq 2018-02-15T23:27:48.732104351Z Uploading 2a/2c/8f5ba64eb0110f2de0d8b74a5d98 (3217 bytes)
ncucu-dz642-ejbrs5a65rt8hjq 2018-02-15T23:27:48.735249938Z Uploading 2a/2e/2e5511bde7fdb8b549aa15447423 (115 bytes)
ncucu-dz642-ejbrs5a65rt8hjq 2018-02-15T23:27:48.740863719Z Uploading 2a/3b/4956ecd118211708b9d7c428818f (1981 bytes)
ncucu-dz642-ejbrs5a65rt8hjq 2018-02-15T23:27:48.743539380Z Uploading 2a/49/da07788e874c277958962ecb0df3 (1216 bytes)
slurmstepd: error: Exceeded step memory limit at some point.
slurmstepd: error: get_exit_code task 0 died by signal

However, the container in question has `"log": null` on the API server, I guess because the arv-mount and/or crunch-run processes were killed off before they could write the log to keep.

I can still see the logs (i.e. the in-process logs) from horizon. The last few lines from the same container are:

2018-02-15T23:27:48.080392297Z Uploading 26/f0/a1a25d70795beec266b79ee9872a (1031 bytes) 
2018-02-15T23:27:48.086622228Z Uploading 27/05/f017f46d301002f0ccf08933080c (2106 bytes) 
2018-02-15T23:27:48.098248469Z Arv-mount exit error: signal: killed 
2018-02-15T23:27:48.100037472Z Uploading 27/0f/3e796a1b7c345e83762fa735bfe1 (1414 bytes) 
2018-02-15T23:27:48.102255057Z Uploading 27/13/e855b2f8f1734239f0523b6200ce (1258 bytes) 
2018-02-15T23:27:48.104311641Z Uploading 27/17/6699c6daad8aa9991c140bb4dceb (3510 bytes) 
2018-02-15T23:27:48.106535847Z Uploading 27/40/aeab0828fa2d4f2789d353a40b8c (3531 bytes) 
2018-02-15T23:27:48.108438975Z Uploading 27/56/f6ee4f5780acce31e995443508b6 (280839 bytes) 
2018-02-15T23:27:48.111356416Z Uploading 27/66/2470b962dfea212da33f14909142 (2901 bytes) 
2018-02-15T23:27:48.113493188Z Uploading 27/6e/5a73fe80f25706975973fca81151 (3633 bytes) 
2018-02-15T23:27:48.115871012Z Uploading 27/77/151312e10fd89af514515090fcca (1066 bytes) 
2018-02-15T23:27:48.117901158Z Uploading 27/8b/5a30cbb6401633951fabf455eafd (107 bytes) 
2018-02-15T23:27:48.119899508Z Uploading 27/bc/09a272621c3414cbe34cd809347d (1687 bytes) 
2018-02-15T23:27:48.122210713Z Uploading 27/c3/653c8e9470329db519a7e855887b (2288 bytes) 
2018-02-15T23:27:48.124479403Z Uploading 27/c6/2e92a203ea221c1feb53bd57832c (110 bytes) 
2018-02-15T23:27:48.126484799Z Uploading 27/db/03ab4ca9e14af33b697876aaa754 (2306 bytes) 
2018-02-15T23:27:48.128529112Z Uploading 27/e6/1fe700513169fa02b0a6b9224fb6 (1236 bytes) 
2018-02-15T23:27:48.131777506Z Uploading 28/13/badf81db953a968eb7d49bc2882f (5691 bytes) 
2018-02-15T23:27:48.134346387Z Uploading 28/28/e63b8edc5e845bf48e75fbad2926 (153799 bytes) 
2018-02-15T23:27:48.137404631Z Uploading 28/33/baebca84548dd4e37d79642db779 (3517 bytes) 
2018-02-15T23:27:48.139425467Z Uploading 28/3f/8d7892baa81b510a015719ca7b0b (115169878 bytes) 
2018-02-15T23:27:48.441013475Z Uploading 28/51/7f9e5682de5e74b42ba7c84a9d41 (157053 bytes) 
2018-02-15T23:27:48.443749845Z Uploading 28/5c/47826e725f4442c18898156ac4fa (2134 bytes) 
2018-02-15T23:27:48.446060602Z Uploading 28/80/4e3d50bf7764ac6b1897b1128790 (2318 bytes) 
2018-02-15T23:27:48.448867967Z Uploading 28/84/95a36e0f4b1fc8e90cff6d4734eb (2024 bytes) 
2018-02-15T23:27:48.451210076Z Uploading 28/9a/de2a84e2a669f4950d905de18ef3 (3349 bytes) 
2018-02-15T23:27:48.453552225Z Uploading 28/a2/c878efec380df24a5544738eda91 (1064 bytes) 
2018-02-15T23:27:48.455813466Z Uploading 28/a6/9648439a2c02d46750eee929b20d (56134 bytes) 
2018-02-15T23:27:48.458422277Z Uploading 28/c6/0286011df7faa5b7167d8dc7a5f2 (5985 bytes) 
2018-02-15T23:27:48.460823226Z Uploading 28/c9/6decb9c1d76b9a76b1ee75e2421d (2786 bytes) 
2018-02-15T23:27:48.463194660Z Uploading 28/d2/54c1b706d0bf3b10df07746930cf (1116 bytes) 
2018-02-15T23:27:48.467028813Z Uploading 29/0c/e55097cb95fa8d25b0e618e59c5c (109 bytes) 


Related issues

Related to Arvados - Bug #13022: crunch-run broken container loopResolvedTom Clegg02/05/2018Actions
Actions #1

Updated by Tom Clegg about 6 years ago

  • Related to Bug #13022: crunch-run broken container loop added
Actions #2

Updated by Tom Clegg about 6 years ago

This might be related to the problem that was fixed in #13022 (but isn't in the stable repo yet). http://apt.arvados.org/pool/jessie-dev/main/c/crunch-run/crunch-run_0.1.20180212022448.f309b87-1_amd64.deb has the fix from #13022.

If crunch-run is getting killed by the OOM-killer (i.e., with SIGKILL) #13022 wouldn't have helped and I'm not sure there's anything we could have done. I think the only way around it is to keep crunch-run out of the memory-limited cgroup so only the container gets killed, not crunch-run itself.

Actions #3

Updated by Tom Morris about 6 years ago

  • Status changed from New to Feedback
  • Assigned To set to Joshua Randall
Actions #4

Updated by Peter Amstutz about 4 years ago

  • Status changed from Feedback to Closed
Actions

Also available in: Atom PDF