Bug #13788

crunch-dispatch-slurm fatal error: concurrent map writes

Added by Joshua Randall almost 3 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Start date:
07/19/2018
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-
Release:
Release relationship:
Auto

Description

Our crunch-dispatch-slurm service is occasionally getting a fatal error "concurrent map writes" which triggers a panic:

root@arvados-master-eglyx:~# grep crunch-dispatch-slurm /var/log/syslog | egrep '(fatal|panic)'
Jul 11 01:43:56 arvados-master-eglyx crunch-dispatch-slurm[52796]: fatal error: concurrent map writes2018/07/11 01:43:56 Done monitoring container eglyx-dz642-9bziwtki4fkdxu1
Jul 11 01:43:56 arvados-master-eglyx crunch-dispatch-slurm[52796]: #011/usr/local/go/src/runtime/panic.go:616 +0x81 fp=0xc4225e4c38 sp=0xc4225e4c18 pc=0x42abe1
Jul 11 01:56:17 arvados-master-eglyx crunch-dispatch-slurm[37362]: fatal error: concurrent map writes
Jul 11 01:56:17 arvados-master-eglyx crunch-dispatch-slurm[37362]: #011/usr/local/go/src/runtime/panic.go:616 +0x81 fp=0xc422c6cc38 sp=0xc422c6cc18 pc=0x42abe1
Jul 11 01:57:07 arvados-master-eglyx crunch-dispatch-slurm[45672]: fatal error: concurrent map writes
Jul 11 01:57:07 arvados-master-eglyx crunch-dispatch-slurm[45672]: #011/usr/local/go/src/runtime/panic.go:616 +0x81 fp=0xc4208c2c38 sp=0xc4208c2c18 pc=0x42abe1
Jul 11 04:16:59 arvados-master-eglyx crunch-dispatch-slurm[52806]: fatal error: concurrent map writes
Jul 11 04:16:59 arvados-master-eglyx crunch-dispatch-slurm[52806]: #011/usr/local/go/src/runtime/panic.go:616 +0x81 fp=0xc424164c38 sp=0xc424164c18 pc=0x42abe1
Jul 11 08:24:26 arvados-master-eglyx crunch-dispatch-slurm[40371]: fatal error: concurrent map writes
Jul 11 08:24:26 arvados-master-eglyx crunch-dispatch-slurm[40371]: #011/usr/local/go/src/runtime/panic.go:616 +0x81 fp=0xc422f1ac38 sp=0xc422f1ac18 pc=0x42abe1

Full stack trace logs from the last one above (6MB): https://paste.ubuntu.com/p/N2pdwhwd5d/plain/


Subtasks

Task #13873: review 13788-concurrent-map-writeResolvedTom Clegg

Associated revisions

Revision 8df5b745
Added by Tom Clegg almost 3 years ago

Merge branch '13788-concurrent-map-write'

fixes #13788

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

History

#1 Updated by Joshua Randall almost 3 years ago

This was with crunch-dispatch-slurm 1.1.4.20180524153119-1
Which I believe corresponds to commit: https://github.com/curoverse/arvados/commit/1275e305f6d13785687c72b8609592094a067cef

#2 Updated by Tom Clegg almost 3 years ago

I think these are the relevant stacks

fatal error: concurrent map writes
goroutine 2759368 [running]:
runtime.throw(0x728936, 0x15)
        /usr/local/go/src/runtime/panic.go:616 +0x81 fp=0xc422f1ac38 sp=0xc422f1ac18 pc=0x42abe1
runtime.mapassign(0x6c87a0, 0xc44de74750, 0xc434cb7660, 0x6afc40)
        /usr/local/go/src/runtime/hashmap.go:607 +0x553 fp=0xc422f1acc8 sp=0xc422f1ac38 pc=0x409713
reflect.mapassign(0x6c87a0, 0xc44de74750, 0xc434cb7660, 0xc4416d9ef0)
        /usr/local/go/src/runtime/hashmap.go:1205 +0x3f fp=0xc422f1acf8 sp=0xc422f1acc8 pc=0x40b2af
reflect.Value.SetMapIndex(0x6c87a0, 0xc423cdc438, 0x195, 0x6b3100, 0xc434cb7660, 0x98, 0x710cc0, 0xc4416d9ef0, 0x199)
        /usr/local/go/src/reflect/value.go:1515 +0x1f6 fp=0xc422f1ad68 sp=0xc422f1acf8 pc=0x4a2f26
encoding/json.(*decodeState).object(0xc4476ab6a8, 0x6c87a0, 0xc423cdc438, 0x195)
        /usr/local/go/src/encoding/json/decode.go:813 +0xb6c fp=0xc422f1afe8 sp=0xc422f1ad68 pc=0x5c49fc
encoding/json.(*decodeState).value(0xc4476ab6a8, 0x6c87a0, 0xc423cdc438, 0x195)
        /usr/local/go/src/encoding/json/decode.go:408 +0x2d3 fp=0xc422f1b068 sp=0xc422f1afe8 pc=0x5c2bc3
encoding/json.(*decodeState).object(0xc4476ab6a8, 0x6a4020, 0xc423cdc3c0, 0x16)
        /usr/local/go/src/encoding/json/decode.go:776 +0x132d fp=0xc422f1b2e8 sp=0xc422f1b068 pc=0x5c51bd
encoding/json.(*decodeState).value(0xc4476ab6a8, 0x6a4020, 0xc423cdc3c0, 0x16)
        /usr/local/go/src/encoding/json/decode.go:408 +0x2d3 fp=0xc422f1b368 sp=0xc422f1b2e8 pc=0x5c2bc3
encoding/json.(*decodeState).unmarshal(0xc4476ab6a8, 0x6a4020, 0xc423cdc3c0, 0x0, 0x0)
        /usr/local/go/src/encoding/json/decode.go:189 +0x1e7 fp=0xc422f1b3d8 sp=0xc422f1b368 pc=0x5c21b7
encoding/json.(*Decoder).Decode(0xc4476ab680, 0x6a4020, 0xc423cdc3c0, 0x75ecc0, 0xc43b5cf0c0)
        /usr/local/go/src/encoding/json/stream.go:73 +0x1a7 fp=0xc422f1b418 sp=0xc422f1b3d8 pc=0x5d3967
git.curoverse.com/arvados.git/sdk/go/arvadosclient.(*ArvadosClient).Call(0xc4203da8c0, 0x72397c, 0x3, 0x7252e6, 0xa, 0xc427f4a520, 0x1b, 0x0, 0x0, 0x0, ...)
        /tmp/tmp.DGyedcNfCS/src/git.curoverse.com/arvados.git/sdk/go/arvadosclient/arvadosclient.go:362 +0x199 fp=0xc422f1b4c8 sp=0xc422f1b418 pc=0x676109
git.curoverse.com/arvados.git/sdk/go/arvadosclient.(*ArvadosClient).Get(0xc4203da8c0, 0x7252e6, 0xa, 0xc427f4a520, 0x1b, 0x0, 0x6a4020, 0xc423cdc3c0, 0x6, 0xc42e711410)
        /tmp/tmp.DGyedcNfCS/src/git.curoverse.com/arvados.git/sdk/go/arvadosclient/arvadosclient.go:393 +0xdc fp=0xc422f1b548 sp=0xc422f1b4c8 pc=0x67648c
main.(*Dispatcher).runContainer(0xc4201be000, 0xc42036c210, 0xc427f4a520, 0x1b, 0x39f3d4a8, 0xed2d76546, 0x0, 0xc455a21ce0, 0x3, 0x6, ...)
        /tmp/tmp.DGyedcNfCS/src/git.curoverse.com/arvados.git/services/crunch-dispatch-slurm/crunch-dispatch-slurm.go:312 +0x780 fp=0xc422f1bca8 sp=0xc422f1b548 pc=0x67c8a0
main.(*Dispatcher).(main.runContainer)-fm(0xc42036c210, 0xc427f4a520, 0x1b, 0x39f3d4a8, 0xed2d76546, 0x0, 0xc455a21ce0, 0x3, 0x6, 0xc42e711410, ...)
        /tmp/tmp.DGyedcNfCS/src/git.curoverse.com/arvados.git/services/crunch-dispatch-slurm/crunch-dispatch-slurm.go:166 +0x78 fp=0xc422f1bdc0 sp=0xc422f1bca8 pc=0x680078
git.curoverse.com/arvados.git/sdk/go/dispatch.(*Dispatcher).start.func1(0xc42036c210, 0xc42030e1e0, 0xc43ccc9e10)
        /tmp/tmp.DGyedcNfCS/src/git.curoverse.com/arvados.git/sdk/go/dispatch/dispatch.go:151 +0x9e fp=0xc422f1bfc8 sp=0xc422f1bdc0 pc=0x67968e
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:2361 +0x1 fp=0xc422f1bfd0 sp=0xc422f1bfc8 pc=0x456671
created by git.curoverse.com/arvados.git/sdk/go/dispatch.(*Dispatcher).start
        /tmp/tmp.DGyedcNfCS/src/git.curoverse.com/arvados.git/sdk/go/dispatch/dispatch.go:150 +0x13a

goroutine 2760000 [runnable]:
reflect.Value.Field(0x712b40, 0xc429d90870, 0x199, 0xc, 0x5, 0x2d4, 0x101)
        /usr/local/go/src/reflect/value.go:780 +0x140
encoding/json.(*decodeState).object(0xc43aa7af28, 0x6a4020, 0xc429d90870, 0x16)
        /usr/local/go/src/encoding/json/decode.go:749 +0x292
encoding/json.(*decodeState).value(0xc43aa7af28, 0x6a4020, 0xc429d90870, 0x16)
        /usr/local/go/src/encoding/json/decode.go:408 +0x2d3
encoding/json.(*decodeState).unmarshal(0xc43aa7af28, 0x6a4020, 0xc429d90870, 0x0, 0x0)
        /usr/local/go/src/encoding/json/decode.go:189 +0x1e7
encoding/json.(*Decoder).Decode(0xc43aa7af00, 0x6a4020, 0xc429d90870, 0x75ecc0, 0xc437235700)
        /usr/local/go/src/encoding/json/stream.go:73 +0x1a7
git.curoverse.com/arvados.git/sdk/go/arvadosclient.(*ArvadosClient).Call(0xc4203da8c0, 0x72397c, 0x3, 0x7252e6, 0xa, 0xc44c3798a0, 0x1b, 0x0, 0x0, 0x0, ...)
        /tmp/tmp.DGyedcNfCS/src/git.curoverse.com/arvados.git/sdk/go/arvadosclient/arvadosclient.go:362 +0x199
git.curoverse.com/arvados.git/sdk/go/arvadosclient.(*ArvadosClient).Get(0xc4203da8c0, 0x7252e6, 0xa, 0xc44c3798a0, 0x1b, 0x0, 0x6a4020, 0xc429d90870, 0x6, 0xc43af99110)
        /tmp/tmp.DGyedcNfCS/src/git.curoverse.com/arvados.git/sdk/go/arvadosclient/arvadosclient.go:393 +0xdc
main.(*Dispatcher).runContainer(0xc4201be000, 0xc42036c210, 0xc44c3798a0, 0x1b, 0x1c33e110, 0xed2d77ad6, 0x0, 0xc455a21ce0, 0x3, 0x6, ...)
        /tmp/tmp.DGyedcNfCS/src/git.curoverse.com/arvados.git/services/crunch-dispatch-slurm/crunch-dispatch-slurm.go:312 +0x780
main.(*Dispatcher).(main.runContainer)-fm(0xc42036c210, 0xc44c3798a0, 0x1b, 0x1c33e110, 0xed2d77ad6, 0x0, 0xc455a21ce0, 0x3, 0x6, 0xc43af99110, ...)
        /tmp/tmp.DGyedcNfCS/src/git.curoverse.com/arvados.git/services/crunch-dispatch-slurm/crunch-dispatch-slurm.go:166 +0x78
git.curoverse.com/arvados.git/sdk/go/dispatch.(*Dispatcher).start.func1(0xc42036c210, 0xc422f93b30, 0xc429e0cf00)
        /tmp/tmp.DGyedcNfCS/src/git.curoverse.com/arvados.git/sdk/go/dispatch/dispatch.go:151 +0x9e
created by git.curoverse.com/arvados.git/sdk/go/dispatch.(*Dispatcher).start
        /tmp/tmp.DGyedcNfCS/src/git.curoverse.com/arvados.git/sdk/go/dispatch/dispatch.go:150 +0x13a

goroutine 2619443 [runnable]:
encoding/json.stateBeginValue(0xc43c040408, 0x522, 0x85a)
        /usr/local/go/src/encoding/json/scanner.go:212 +0x459
encoding/json.(*decodeState).value(0xc43c0403e8, 0x0, 0x0, 0x0)
        /usr/local/go/src/encoding/json/decode.go:385 +0xf3
encoding/json.(*decodeState).object(0xc43c0403e8, 0x6a4020, 0xc4251e1e00, 0x16)
        /usr/local/go/src/encoding/json/decode.go:776 +0x132d
encoding/json.(*decodeState).value(0xc43c0403e8, 0x6a4020, 0xc4251e1e00, 0x16)
        /usr/local/go/src/encoding/json/decode.go:408 +0x2d3
encoding/json.(*decodeState).unmarshal(0xc43c0403e8, 0x6a4020, 0xc4251e1e00, 0x0, 0x0)
        /usr/local/go/src/encoding/json/decode.go:189 +0x1e7
encoding/json.(*Decoder).Decode(0xc43c0403c0, 0x6a4020, 0xc4251e1e00, 0x75ecc0, 0xc43d1d3c00)
        /usr/local/go/src/encoding/json/stream.go:73 +0x1a7
git.curoverse.com/arvados.git/sdk/go/arvadosclient.(*ArvadosClient).Call(0xc4203da8c0, 0x72397c, 0x3, 0x7252e6, 0xa, 0xc45d29f940, 0x1b, 0x0, 0x0, 0x0, ...)
        /tmp/tmp.DGyedcNfCS/src/git.curoverse.com/arvados.git/sdk/go/arvadosclient/arvadosclient.go:362 +0x199
git.curoverse.com/arvados.git/sdk/go/arvadosclient.(*ArvadosClient).Get(0xc4203da8c0, 0x7252e6, 0xa, 0xc45d29f940, 0x1b, 0x0, 0x6a4020, 0xc4251e1e00, 0x4, 0xc4524dde00)
        /tmp/tmp.DGyedcNfCS/src/git.curoverse.com/arvados.git/sdk/go/arvadosclient/arvadosclient.go:393 +0xdc
main.(*Dispatcher).runContainer(0xc4201be000, 0xc42036c210, 0xc45d29f940, 0x1b, 0xe292ee0, 0xed2d75ec6, 0x0, 0xc43e655680, 0x3, 0x4, ...)
        /tmp/tmp.DGyedcNfCS/src/git.curoverse.com/arvados.git/services/crunch-dispatch-slurm/crunch-dispatch-slurm.go:312 +0x780
main.(*Dispatcher).(main.runContainer)-fm(0xc42036c210, 0xc45d29f940, 0x1b, 0xe292ee0, 0xed2d75ec6, 0x0, 0xc43e655680, 0x3, 0x4, 0xc4524dde00, ...)
        /tmp/tmp.DGyedcNfCS/src/git.curoverse.com/arvados.git/services/crunch-dispatch-slurm/crunch-dispatch-slurm.go:166 +0x78
git.curoverse.com/arvados.git/sdk/go/dispatch.(*Dispatcher).start.func1(0xc42036c210, 0xc423b9fb30, 0xc44313e5c0)
        /tmp/tmp.DGyedcNfCS/src/git.curoverse.com/arvados.git/sdk/go/dispatch/dispatch.go:151 +0x9e
created by git.curoverse.com/arvados.git/sdk/go/dispatch.(*Dispatcher).start
        /tmp/tmp.DGyedcNfCS/src/git.curoverse.com/arvados.git/sdk/go/dispatch/dispatch.go:150 +0x13a

I think this is dispatch.checkForUpdates' fault: it's reusing a ContainerList object when fetching multiple pages, which means reusing embedded maps (mounts, runtime constraints). The containers are copied by value in checkListForUpdates(), but their maps are copied by reference.

checkForUpdates itself doesn't write the maps concurrently, but the dispatch library passes them to crunch-dispatch-slurm's runContainer in a new goroutine, which writes to the maps when getting final container state.

Aside from the concurrent write, this also means the dispatch library is effectively corrupting container objects after they've been passed to the dispatcher. This can result in c-d-s using the wrong runtime constraints, but this might be rare -- I expect c-d-s usually has enough time to consume all of the runtime_constraints/mounts content from the previous page before an incoming page of containers has a chance to mangle it.

Demonstration at https://play.golang.org/p/zATyWMInm3-?z

#3 Updated by Tom Clegg almost 3 years ago

  • Status changed from New to In Progress
  • Assigned To set to Tom Clegg
  • Target version set to 2018-08-01 Sprint

#4 Updated by Tom Clegg almost 3 years ago

13788-concurrent-map-write @ d91eb0fae8d2ef4c2348769694573bf725ef16d4

#5 Updated by Lucas Di Pentima almost 3 years ago

Sneaky bug. This LGTM, thanks!

#6 Updated by Tom Clegg almost 3 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

#7 Updated by Tom Morris over 2 years ago

  • Release set to 13

Also available in: Atom PDF