Project

General

Profile

Actions

Bug #17347

open

crunch-run --list fatal error out of memory

Added by Ward Vandewege over 1 year ago. Updated over 1 year ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Story points:
-

Description

Observed today, this is from the a-d-c logs, grepped for the IP address of the node with the issue (10.252.254.40):

Feb 05 16:14:27 2xpu4.arvadosapi.com arvados-dispatch-cloud[23616]: {"Address":"10.252.254.40","IdleBehavior":"run","Instance":"i-0f017e863169ad30f","InstanceType":"t3asmall.spot","PID":23616,"State":"booting","level":"info","msg":"instance appeared in cloud","time":"2021-02-05T16:14:27.992218427Z"}
Feb 05 16:15:12 2xpu4.arvadosapi.com arvados-dispatch-cloud[23616]: {"Address":"10.252.254.40","Command":"/bin/ls /arvados-compute-node-boot.complete \u003e/dev/null 2\u003e\u00261","Instance":"i-0f017e863169ad30f","InstanceType":"t3asmall.spot","PID":23616,"level":"info","msg":"boot probe succeeded","stderr":"","stdout":"","time":"2021-02-05T16:15:12.424661492Z"}
Feb 05 16:15:12 2xpu4.arvadosapi.com arvados-dispatch-cloud[23616]: {"Address":"10.252.254.40","Instance":"i-0f017e863169ad30f","InstanceType":"t3asmall.spot","PID":23616,"cmd":"sudo sh -c 'set -e; dstdir=\"/var/lib/arvados/\"; dstfile=\"/var/lib/arvados/crunch-run~cd977c373a01deea67f6021c3575026d\"; mkdir -p \"$dstdir\"; touch \"$dstfile\"; chmod 0755 \"$dstdir\" \"$dstfile\"; cat \u003e\"$dstfile\"'","hash":"cd977c373a01deea67f6021c3575026d","level":"info","msg":"installing runner binary on worker","path":"/var/lib/arvados/crunch-run~cd977c373a01deea67f6021c3575026d","time":"2021-02-05T16:15:12.430553926Z"}
Feb 05 16:15:13 2xpu4.arvadosapi.com arvados-dispatch-cloud[23616]: {"Address":"10.252.254.40","Instance":"i-0f017e863169ad30f","InstanceType":"t3asmall.spot","PID":23616,"ProbeStart":"2021-02-05T16:15:12.131119947Z","level":"info","msg":"instance booted; will try probeRunning","time":"2021-02-05T16:15:13.087060764Z"}
Feb 05 16:15:13 2xpu4.arvadosapi.com arvados-dispatch-cloud[23616]: {"Address":"10.252.254.40","Instance":"i-0f017e863169ad30f","InstanceType":"t3asmall.spot","PID":23616,"ProbeStart":"2021-02-05T16:15:12.131119947Z","RunningContainers":0,"State":"idle","level":"info","msg":"probes succeeded, instance is in service","time":"2021-02-05T16:15:13.148200600Z"}
Feb 05 16:15:13 2xpu4.arvadosapi.com arvados-dispatch-cloud[23616]: {"Address":"10.252.254.40","ContainerUUID":"2xpu4-dz642-bngy3by6mq9muye","Instance":"i-0f017e863169ad30f","InstanceType":"t3asmall.spot","PID":23616,"level":"info","msg":"crunch-run process started","time":"2021-02-05T16:15:13.199224357Z"}
Feb 05 16:17:48 2xpu4.arvadosapi.com arvados-dispatch-cloud[23616]: {"Address":"10.252.254.40","Command":"sudo /var/lib/arvados/crunch-run~cd977c373a01deea67f6021c3575026d --list","Instance":"i-0f017e863169ad30f","InstanceType":"t3asmall.spot","PID":23616,"error":"ssh: command sudo /var/lib/arvados/crunch-run~cd977c373a01deea67f6021c3575026d --list failed","level":"warning","msg":"probe failed","stderr":"","stdout":"","time":"2021-02-05T16:17:48.885442343Z"}
Feb 05 16:17:59 2xpu4.arvadosapi.com arvados-dispatch-cloud[23616]: {"Address":"10.252.254.40","Command":"sudo /var/lib/arvados/crunch-run~cd977c373a01deea67f6021c3575026d --list","Instance":"i-0f017e863169ad30f","InstanceType":"t3asmall.spot","PID":23616,"error":"ssh: command sudo /var/lib/arvados/crunch-run~cd977c373a01deea67f6021c3575026d --list failed","level":"warning","msg":"probe failed","stderr":"","stdout":"","time":"2021-02-05T16:17:59.776651145Z"}
Feb 05 16:18:07 2xpu4.arvadosapi.com arvados-dispatch-cloud[23616]: {"Address":"10.252.254.40","Command":"sudo /var/lib/arvados/crunch-run~cd977c373a01deea67f6021c3575026d --list","Instance":"i-0f017e863169ad30f","InstanceType":"t3asmall.spot","PID":23616,"error":"ssh: command sudo /var/lib/arvados/crunch-run~cd977c373a01deea67f6021c3575026d --list failed","level":"warning","msg":"probe failed","stderr":"","stdout":"","time":"2021-02-05T16:18:07.225690492Z"}
Feb 05 16:18:20 2xpu4.arvadosapi.com arvados-dispatch-cloud[23616]: {"Address":"10.252.254.40","Command":"sudo /var/lib/arvados/crunch-run~cd977c373a01deea67f6021c3575026d --list","Instance":"i-0f017e863169ad30f","InstanceType":"t3asmall.spot","PID":23616,"error":"ssh: command sudo /var/lib/arvados/crunch-run~cd977c373a01deea67f6021c3575026d --list failed","level":"warning","msg":"probe failed","stderr":"","stdout":"","time":"2021-02-05T16:18:20.457022767Z"}
Feb 05 16:18:33 2xpu4.arvadosapi.com arvados-dispatch-cloud[23616]: {"Address":"10.252.254.40","Command":"sudo /var/lib/arvados/crunch-run~cd977c373a01deea67f6021c3575026d --list","Instance":"i-0f017e863169ad30f","InstanceType":"t3asmall.spot","PID":23616,"error":"ssh: command sudo /var/lib/arvados/crunch-run~cd977c373a01deea67f6021c3575026d --list failed","level":"warning","msg":"probe failed","stderr":"","stdout":"","time":"2021-02-05T16:18:33.365603578Z"}
Feb 05 16:18:53 2xpu4.arvadosapi.com arvados-dispatch-cloud[23616]: {"Address":"10.252.254.40","Command":"sudo /var/lib/arvados/crunch-run~cd977c373a01deea67f6021c3575026d --list","Instance":"i-0f017e863169ad30f","InstanceType":"t3asmall.spot","PID":23616,"error":"ssh: command sudo /var/lib/arvados/crunch-run~cd977c373a01deea67f6021c3575026d --list failed","level":"warning","msg":"probe failed","stderr":"","stdout":"","time":"2021-02-05T16:18:53.261792705Z"}
Feb 05 16:19:12 2xpu4.arvadosapi.com arvados-dispatch-cloud[23616]: {"Address":"10.252.254.40","Command":"sudo /var/lib/arvados/crunch-run~cd977c373a01deea67f6021c3575026d --list","Instance":"i-0f017e863169ad30f","InstanceType":"t3asmall.spot","PID":23616,"error":"ssh: command sudo /var/lib/arvados/crunch-run~cd977c373a01deea67f6021c3575026d --list failed","level":"warning","msg":"probe failed","stderr":"","stdout":"","time":"2021-02-05T16:19:12.631116325Z"}
Feb 05 16:19:30 2xpu4.arvadosapi.com arvados-dispatch-cloud[23616]: {"Address":"10.252.254.40","Command":"sudo /var/lib/arvados/crunch-run~cd977c373a01deea67f6021c3575026d --list","Instance":"i-0f017e863169ad30f","InstanceType":"t3asmall.spot","PID":23616,"error":"ssh: command sudo /var/lib/arvados/crunch-run~cd977c373a01deea67f6021c3575026d --list failed","level":"warning","msg":"probe failed","stderr":"","stdout":"","time":"2021-02-05T16:19:30.564261456Z"}
Feb 05 16:19:50 2xpu4.arvadosapi.com arvados-dispatch-cloud[23616]: {"Address":"10.252.254.40","Command":"sudo /var/lib/arvados/crunch-run~cd977c373a01deea67f6021c3575026d --list","Instance":"i-0f017e863169ad30f","InstanceType":"t3asmall.spot","PID":23616,"error":"ssh: command sudo /var/lib/arvados/crunch-run~cd977c373a01deea67f6021c3575026d --list failed","level":"warning","msg":"probe failed","stderr":"","stdout":"","time":"2021-02-05T16:19:50.364492444Z"}
Feb 05 16:20:02 2xpu4.arvadosapi.com arvados-dispatch-cloud[23616]: {"Address":"10.252.254.40","Command":"sudo /var/lib/arvados/crunch-run~cd977c373a01deea67f6021c3575026d --list","Instance":"i-0f017e863169ad30f","InstanceType":"t3asmall.spot","PID":23616,"error":"ssh: command sudo /var/lib/arvados/crunch-run~cd977c373a01deea67f6021c3575026d --list failed","level":"warning","msg":"probe failed","stderr":"","stdout":"","time":"2021-02-05T16:20:02.130543707Z"}
Feb 05 16:20:15 2xpu4.arvadosapi.com arvados-dispatch-cloud[23616]: {"Address":"10.252.254.40","Command":"sudo /var/lib/arvados/crunch-run~cd977c373a01deea67f6021c3575026d --list","Instance":"i-0f017e863169ad30f","InstanceType":"t3asmall.spot","PID":23616,"error":"ssh: command sudo /var/lib/arvados/crunch-run~cd977c373a01deea67f6021c3575026d --list failed","level":"warning","msg":"probe failed","stderr":"","stdout":"","time":"2021-02-05T16:20:15.384496491Z"}
Feb 05 16:20:21 2xpu4.arvadosapi.com arvados-dispatch-cloud[23616]: {"Address":"10.252.254.40","Command":"sudo /var/lib/arvados/crunch-run~cd977c373a01deea67f6021c3575026d --list","Instance":"i-0f017e863169ad30f","InstanceType":"t3asmall.spot","PID":23616,"error":"Process exited with status 2","level":"warning","msg":"probe failed","stderr":"fatal error: runtime: out of memory\n\nruntime stack:\nruntime.throw(0x15ef678, 0x16)\n\t/usr/lib/go-1.15/src/runtime/panic.go:1116 +0x72 fp=0x7ffe9a4de2b0 sp=0x7ffe9a4de280 pc=0x43a2f2\nruntime.sysMap(0xc000000000, 0x4000000, 0x20a8058)\n\t/usr/lib/go-1.15/src/runtime/mem_linux.go:169 +0xc6 fp=0x7ffe9a4de2f0 sp=0x7ffe9a4de2b0 pc=0x41ca46\nruntime.(*mheap).sysAlloc(0x208c4e0, 0x400000, 0x0, 0x0)\n\t/usr/lib/go-1.15/src/runtime/malloc.go:727 +0x1e5 fp=0x7ffe9a4de398 sp=0x7ffe9a4de2f0 pc=0x4101c5\nruntime.(*mheap).grow(0x208c4e0, 0x1, 0x0)\n\t/usr/lib/go-1.15/src/runtime/mheap.go:1344 +0x85 fp=0x7ffe9a4de400 sp=0x7ffe9a4de398 pc=0x42bfa5\nruntime.(*mheap).allocSpan(0x208c4e0, 0x1, 0x30312d7069002a00, 0x20a8068, 0x30342d)\n\t/usr/lib/go-1.15/src/runtime/mheap.go:1160 +0x6b6 fp=0x7ffe9a4de480 sp=0x7ffe9a4de400 pc=0x42bd56\nruntime.(*mheap).alloc.func1()\n\t/usr/lib/go-1.15/src/runtime/mheap.go:907 +0x65 fp=0x7ffe9a4de4d8 sp=0x7ffe9a4de480 pc=0x466785\nruntime.(*mheap).alloc(0x208c4e0, 0x1, 0x4012a, 0x2200000003)\n\t/usr/lib/go-1.15/src/runtime/mheap.go:901 +0x85 fp=0x7ffe9a4de528 sp=0x7ffe9a4de4d8 pc=0x42b225\nruntime.(*mcentral).grow(0x209f398, 0x0)\n\t/usr/lib/go-1.15/src/runtime/mcentral.go:506 +0x7a fp=0x7ffe9a4de570 sp=0x7ffe9a4de528 pc=0x41c41a\nruntime.(*mcentral).cacheSpan(0x209f398, 0x4648d8)\n\t/usr/lib/go-1.15/src/runtime/mcentral.go:177 +0x3e5 fp=0x7ffe9a4de5e8 sp=0x7ffe9a4de570 pc=0x41c1a5\nruntime.(*mcache).refill(0x7f478d3e4108, 0x2a)\n\t/usr/lib/go-1.15/src/runtime/mcache.go:142 +0xa5 fp=0x7ffe9a4de608 sp=0x7ffe9a4de5e8 pc=0x41bb45\nruntime.(*mcache).nextFree(0x7f478d3e4108, 0x2074f2a, 0x7f478d3e4108, 0xfffffffffffffff8, 0x7ffe9a4de698)\n\t/usr/lib/go-1.15/src/runtime/malloc.go:880 +0x8d fp=0x7ffe9a4de640 sp=0x7ffe9a4de608 pc=0x410a4d\nruntime.mallocgc(0x180, 0x15ca820, 0x7ffe9a4de701, 0x7ffe9a4de740)\n\t/usr/lib/go-1.15/src/runtime/malloc.go:1061 +0x834 fp=0x7ffe9a4de6e0 sp=0x7ffe9a4de640 pc=0x411434\nruntime.newobject(0x15ca820, 0x4654c0)\n\t/usr/lib/go-1.15/src/runtime/malloc.go:1195 +0x38 fp=0x7ffe9a4de710 sp=0x7ffe9a4de6e0 pc=0x4118d8\nruntime.malg(0x8000, 0x0)\n\t/usr/lib/go-1.15/src/runtime/proc.go:3520 +0x31 fp=0x7ffe9a4de750 sp=0x7ffe9a4de710 pc=0x444e91\nruntime.mpreinit(0x2074f80)\n\t/usr/lib/go-1.15/src/runtime/os_linux.go:340 +0x29 fp=0x7ffe9a4de770 sp=0x7ffe9a4de750 pc=0x436fa9\nruntime.mcommoninit(0x2074f80, 0xffffffffffffffff)\n\t/usr/lib/go-1.15/src/runtime/proc.go:663 +0xf7 fp=0x7ffe9a4de7b8 sp=0x7ffe9a4de770 pc=0x43e0f7\nruntime.schedinit()\n\t/usr/lib/go-1.15/src/runtime/proc.go:565 +0xa5 fp=0x7ffe9a4de810 sp=0x7ffe9a4de7b8 pc=0x43dc85\nruntime.rt0_go(0x7ffe9a4de918, 0x2, 0x7ffe9a4de918, 0x0, 0x7f478d48509b, 0x7f478d618660, 0x7ffe9a4de918, 0x28d5e1e08, 0x46e480, 0x0, ...)\n\t/usr/lib/go-1.15/src/runtime/asm_amd64.s:214 +0x125 fp=0x7ffe9a4de818 sp=0x7ffe9a4de810 pc=0x46e5c5\n","stdout":"","time":"2021-02-05T16:20:21.269848257Z"}
Feb 05 16:20:46 2xpu4.arvadosapi.com arvados-dispatch-cloud[23616]: {"Address":"10.252.254.40","ContainerUUID":"2xpu4-dz642-bngy3by6mq9muye","Instance":"i-0f017e863169ad30f","InstanceType":"t3asmall.spot","PID":23616,"Reason":"state=Cancelled","level":"info","msg":"killing crunch-run process","time":"2021-02-05T16:20:46.459248369Z"}
Feb 05 16:20:51 2xpu4.arvadosapi.com arvados-dispatch-cloud[23616]: {"Address":"10.252.254.40","ContainerUUID":"2xpu4-dz642-bngy3by6mq9muye","Instance":"i-0f017e863169ad30f","InstanceType":"t3asmall.spot","PID":23616,"Signal":15,"level":"info","msg":"sending signal","time":"2021-02-05T16:20:51.459428284Z"}
Feb 05 16:20:51 2xpu4.arvadosapi.com arvados-dispatch-cloud[23616]: {"Address":"10.252.254.40","ContainerUUID":"2xpu4-dz642-bngy3by6mq9muye","Instance":"i-0f017e863169ad30f","InstanceType":"t3asmall.spot","PID":23616,"level":"info","msg":"crunch-run process ended","time":"2021-02-05T16:20:51.495719425Z"}
Feb 05 16:25:49 2xpu4.arvadosapi.com arvados-dispatch-cloud[23616]: {"Address":"10.252.254.40","IdleBehavior":"run","IdleDuration":309.636266,"Instance":"i-0f017e863169ad30f","InstanceType":"t3asmall.spot","PID":23616,"State":"idle","level":"info","msg":"shutdown worker","time":"2021-02-05T16:25:49.931115324Z"}
Actions #1

Updated by Ward Vandewege over 1 year ago

  • Description updated (diff)
Actions

Also available in: Atom PDF