Bug #13788
closedcrunch-dispatch-slurm fatal error: concurrent map writes
Description
Our crunch-dispatch-slurm service is occasionally getting a fatal error "concurrent map writes" which triggers a panic:
root@arvados-master-eglyx:~# grep crunch-dispatch-slurm /var/log/syslog | egrep '(fatal|panic)' Jul 11 01:43:56 arvados-master-eglyx crunch-dispatch-slurm[52796]: fatal error: concurrent map writes2018/07/11 01:43:56 Done monitoring container eglyx-dz642-9bziwtki4fkdxu1 Jul 11 01:43:56 arvados-master-eglyx crunch-dispatch-slurm[52796]: #011/usr/local/go/src/runtime/panic.go:616 +0x81 fp=0xc4225e4c38 sp=0xc4225e4c18 pc=0x42abe1 Jul 11 01:56:17 arvados-master-eglyx crunch-dispatch-slurm[37362]: fatal error: concurrent map writes Jul 11 01:56:17 arvados-master-eglyx crunch-dispatch-slurm[37362]: #011/usr/local/go/src/runtime/panic.go:616 +0x81 fp=0xc422c6cc38 sp=0xc422c6cc18 pc=0x42abe1 Jul 11 01:57:07 arvados-master-eglyx crunch-dispatch-slurm[45672]: fatal error: concurrent map writes Jul 11 01:57:07 arvados-master-eglyx crunch-dispatch-slurm[45672]: #011/usr/local/go/src/runtime/panic.go:616 +0x81 fp=0xc4208c2c38 sp=0xc4208c2c18 pc=0x42abe1 Jul 11 04:16:59 arvados-master-eglyx crunch-dispatch-slurm[52806]: fatal error: concurrent map writes Jul 11 04:16:59 arvados-master-eglyx crunch-dispatch-slurm[52806]: #011/usr/local/go/src/runtime/panic.go:616 +0x81 fp=0xc424164c38 sp=0xc424164c18 pc=0x42abe1 Jul 11 08:24:26 arvados-master-eglyx crunch-dispatch-slurm[40371]: fatal error: concurrent map writes Jul 11 08:24:26 arvados-master-eglyx crunch-dispatch-slurm[40371]: #011/usr/local/go/src/runtime/panic.go:616 +0x81 fp=0xc422f1ac38 sp=0xc422f1ac18 pc=0x42abe1
Full stack trace logs from the last one above (6MB): https://paste.ubuntu.com/p/N2pdwhwd5d/plain/
Updated by Joshua Randall over 6 years ago
This was with crunch-dispatch-slurm 1.1.4.20180524153119-1
Which I believe corresponds to commit: https://github.com/curoverse/arvados/commit/1275e305f6d13785687c72b8609592094a067cef
Updated by Tom Clegg over 6 years ago
I think these are the relevant stacks
fatal error: concurrent map writes goroutine 2759368 [running]: runtime.throw(0x728936, 0x15) /usr/local/go/src/runtime/panic.go:616 +0x81 fp=0xc422f1ac38 sp=0xc422f1ac18 pc=0x42abe1 runtime.mapassign(0x6c87a0, 0xc44de74750, 0xc434cb7660, 0x6afc40) /usr/local/go/src/runtime/hashmap.go:607 +0x553 fp=0xc422f1acc8 sp=0xc422f1ac38 pc=0x409713 reflect.mapassign(0x6c87a0, 0xc44de74750, 0xc434cb7660, 0xc4416d9ef0) /usr/local/go/src/runtime/hashmap.go:1205 +0x3f fp=0xc422f1acf8 sp=0xc422f1acc8 pc=0x40b2af reflect.Value.SetMapIndex(0x6c87a0, 0xc423cdc438, 0x195, 0x6b3100, 0xc434cb7660, 0x98, 0x710cc0, 0xc4416d9ef0, 0x199) /usr/local/go/src/reflect/value.go:1515 +0x1f6 fp=0xc422f1ad68 sp=0xc422f1acf8 pc=0x4a2f26 encoding/json.(*decodeState).object(0xc4476ab6a8, 0x6c87a0, 0xc423cdc438, 0x195) /usr/local/go/src/encoding/json/decode.go:813 +0xb6c fp=0xc422f1afe8 sp=0xc422f1ad68 pc=0x5c49fc encoding/json.(*decodeState).value(0xc4476ab6a8, 0x6c87a0, 0xc423cdc438, 0x195) /usr/local/go/src/encoding/json/decode.go:408 +0x2d3 fp=0xc422f1b068 sp=0xc422f1afe8 pc=0x5c2bc3 encoding/json.(*decodeState).object(0xc4476ab6a8, 0x6a4020, 0xc423cdc3c0, 0x16) /usr/local/go/src/encoding/json/decode.go:776 +0x132d fp=0xc422f1b2e8 sp=0xc422f1b068 pc=0x5c51bd encoding/json.(*decodeState).value(0xc4476ab6a8, 0x6a4020, 0xc423cdc3c0, 0x16) /usr/local/go/src/encoding/json/decode.go:408 +0x2d3 fp=0xc422f1b368 sp=0xc422f1b2e8 pc=0x5c2bc3 encoding/json.(*decodeState).unmarshal(0xc4476ab6a8, 0x6a4020, 0xc423cdc3c0, 0x0, 0x0) /usr/local/go/src/encoding/json/decode.go:189 +0x1e7 fp=0xc422f1b3d8 sp=0xc422f1b368 pc=0x5c21b7 encoding/json.(*Decoder).Decode(0xc4476ab680, 0x6a4020, 0xc423cdc3c0, 0x75ecc0, 0xc43b5cf0c0) /usr/local/go/src/encoding/json/stream.go:73 +0x1a7 fp=0xc422f1b418 sp=0xc422f1b3d8 pc=0x5d3967 git.curoverse.com/arvados.git/sdk/go/arvadosclient.(*ArvadosClient).Call(0xc4203da8c0, 0x72397c, 0x3, 0x7252e6, 0xa, 0xc427f4a520, 0x1b, 0x0, 0x0, 0x0, ...) /tmp/tmp.DGyedcNfCS/src/git.curoverse.com/arvados.git/sdk/go/arvadosclient/arvadosclient.go:362 +0x199 fp=0xc422f1b4c8 sp=0xc422f1b418 pc=0x676109 git.curoverse.com/arvados.git/sdk/go/arvadosclient.(*ArvadosClient).Get(0xc4203da8c0, 0x7252e6, 0xa, 0xc427f4a520, 0x1b, 0x0, 0x6a4020, 0xc423cdc3c0, 0x6, 0xc42e711410) /tmp/tmp.DGyedcNfCS/src/git.curoverse.com/arvados.git/sdk/go/arvadosclient/arvadosclient.go:393 +0xdc fp=0xc422f1b548 sp=0xc422f1b4c8 pc=0x67648c main.(*Dispatcher).runContainer(0xc4201be000, 0xc42036c210, 0xc427f4a520, 0x1b, 0x39f3d4a8, 0xed2d76546, 0x0, 0xc455a21ce0, 0x3, 0x6, ...) /tmp/tmp.DGyedcNfCS/src/git.curoverse.com/arvados.git/services/crunch-dispatch-slurm/crunch-dispatch-slurm.go:312 +0x780 fp=0xc422f1bca8 sp=0xc422f1b548 pc=0x67c8a0 main.(*Dispatcher).(main.runContainer)-fm(0xc42036c210, 0xc427f4a520, 0x1b, 0x39f3d4a8, 0xed2d76546, 0x0, 0xc455a21ce0, 0x3, 0x6, 0xc42e711410, ...) /tmp/tmp.DGyedcNfCS/src/git.curoverse.com/arvados.git/services/crunch-dispatch-slurm/crunch-dispatch-slurm.go:166 +0x78 fp=0xc422f1bdc0 sp=0xc422f1bca8 pc=0x680078 git.curoverse.com/arvados.git/sdk/go/dispatch.(*Dispatcher).start.func1(0xc42036c210, 0xc42030e1e0, 0xc43ccc9e10) /tmp/tmp.DGyedcNfCS/src/git.curoverse.com/arvados.git/sdk/go/dispatch/dispatch.go:151 +0x9e fp=0xc422f1bfc8 sp=0xc422f1bdc0 pc=0x67968e runtime.goexit() /usr/local/go/src/runtime/asm_amd64.s:2361 +0x1 fp=0xc422f1bfd0 sp=0xc422f1bfc8 pc=0x456671 created by git.curoverse.com/arvados.git/sdk/go/dispatch.(*Dispatcher).start /tmp/tmp.DGyedcNfCS/src/git.curoverse.com/arvados.git/sdk/go/dispatch/dispatch.go:150 +0x13a goroutine 2760000 [runnable]: reflect.Value.Field(0x712b40, 0xc429d90870, 0x199, 0xc, 0x5, 0x2d4, 0x101) /usr/local/go/src/reflect/value.go:780 +0x140 encoding/json.(*decodeState).object(0xc43aa7af28, 0x6a4020, 0xc429d90870, 0x16) /usr/local/go/src/encoding/json/decode.go:749 +0x292 encoding/json.(*decodeState).value(0xc43aa7af28, 0x6a4020, 0xc429d90870, 0x16) /usr/local/go/src/encoding/json/decode.go:408 +0x2d3 encoding/json.(*decodeState).unmarshal(0xc43aa7af28, 0x6a4020, 0xc429d90870, 0x0, 0x0) /usr/local/go/src/encoding/json/decode.go:189 +0x1e7 encoding/json.(*Decoder).Decode(0xc43aa7af00, 0x6a4020, 0xc429d90870, 0x75ecc0, 0xc437235700) /usr/local/go/src/encoding/json/stream.go:73 +0x1a7 git.curoverse.com/arvados.git/sdk/go/arvadosclient.(*ArvadosClient).Call(0xc4203da8c0, 0x72397c, 0x3, 0x7252e6, 0xa, 0xc44c3798a0, 0x1b, 0x0, 0x0, 0x0, ...) /tmp/tmp.DGyedcNfCS/src/git.curoverse.com/arvados.git/sdk/go/arvadosclient/arvadosclient.go:362 +0x199 git.curoverse.com/arvados.git/sdk/go/arvadosclient.(*ArvadosClient).Get(0xc4203da8c0, 0x7252e6, 0xa, 0xc44c3798a0, 0x1b, 0x0, 0x6a4020, 0xc429d90870, 0x6, 0xc43af99110) /tmp/tmp.DGyedcNfCS/src/git.curoverse.com/arvados.git/sdk/go/arvadosclient/arvadosclient.go:393 +0xdc main.(*Dispatcher).runContainer(0xc4201be000, 0xc42036c210, 0xc44c3798a0, 0x1b, 0x1c33e110, 0xed2d77ad6, 0x0, 0xc455a21ce0, 0x3, 0x6, ...) /tmp/tmp.DGyedcNfCS/src/git.curoverse.com/arvados.git/services/crunch-dispatch-slurm/crunch-dispatch-slurm.go:312 +0x780 main.(*Dispatcher).(main.runContainer)-fm(0xc42036c210, 0xc44c3798a0, 0x1b, 0x1c33e110, 0xed2d77ad6, 0x0, 0xc455a21ce0, 0x3, 0x6, 0xc43af99110, ...) /tmp/tmp.DGyedcNfCS/src/git.curoverse.com/arvados.git/services/crunch-dispatch-slurm/crunch-dispatch-slurm.go:166 +0x78 git.curoverse.com/arvados.git/sdk/go/dispatch.(*Dispatcher).start.func1(0xc42036c210, 0xc422f93b30, 0xc429e0cf00) /tmp/tmp.DGyedcNfCS/src/git.curoverse.com/arvados.git/sdk/go/dispatch/dispatch.go:151 +0x9e created by git.curoverse.com/arvados.git/sdk/go/dispatch.(*Dispatcher).start /tmp/tmp.DGyedcNfCS/src/git.curoverse.com/arvados.git/sdk/go/dispatch/dispatch.go:150 +0x13a goroutine 2619443 [runnable]: encoding/json.stateBeginValue(0xc43c040408, 0x522, 0x85a) /usr/local/go/src/encoding/json/scanner.go:212 +0x459 encoding/json.(*decodeState).value(0xc43c0403e8, 0x0, 0x0, 0x0) /usr/local/go/src/encoding/json/decode.go:385 +0xf3 encoding/json.(*decodeState).object(0xc43c0403e8, 0x6a4020, 0xc4251e1e00, 0x16) /usr/local/go/src/encoding/json/decode.go:776 +0x132d encoding/json.(*decodeState).value(0xc43c0403e8, 0x6a4020, 0xc4251e1e00, 0x16) /usr/local/go/src/encoding/json/decode.go:408 +0x2d3 encoding/json.(*decodeState).unmarshal(0xc43c0403e8, 0x6a4020, 0xc4251e1e00, 0x0, 0x0) /usr/local/go/src/encoding/json/decode.go:189 +0x1e7 encoding/json.(*Decoder).Decode(0xc43c0403c0, 0x6a4020, 0xc4251e1e00, 0x75ecc0, 0xc43d1d3c00) /usr/local/go/src/encoding/json/stream.go:73 +0x1a7 git.curoverse.com/arvados.git/sdk/go/arvadosclient.(*ArvadosClient).Call(0xc4203da8c0, 0x72397c, 0x3, 0x7252e6, 0xa, 0xc45d29f940, 0x1b, 0x0, 0x0, 0x0, ...) /tmp/tmp.DGyedcNfCS/src/git.curoverse.com/arvados.git/sdk/go/arvadosclient/arvadosclient.go:362 +0x199 git.curoverse.com/arvados.git/sdk/go/arvadosclient.(*ArvadosClient).Get(0xc4203da8c0, 0x7252e6, 0xa, 0xc45d29f940, 0x1b, 0x0, 0x6a4020, 0xc4251e1e00, 0x4, 0xc4524dde00) /tmp/tmp.DGyedcNfCS/src/git.curoverse.com/arvados.git/sdk/go/arvadosclient/arvadosclient.go:393 +0xdc main.(*Dispatcher).runContainer(0xc4201be000, 0xc42036c210, 0xc45d29f940, 0x1b, 0xe292ee0, 0xed2d75ec6, 0x0, 0xc43e655680, 0x3, 0x4, ...) /tmp/tmp.DGyedcNfCS/src/git.curoverse.com/arvados.git/services/crunch-dispatch-slurm/crunch-dispatch-slurm.go:312 +0x780 main.(*Dispatcher).(main.runContainer)-fm(0xc42036c210, 0xc45d29f940, 0x1b, 0xe292ee0, 0xed2d75ec6, 0x0, 0xc43e655680, 0x3, 0x4, 0xc4524dde00, ...) /tmp/tmp.DGyedcNfCS/src/git.curoverse.com/arvados.git/services/crunch-dispatch-slurm/crunch-dispatch-slurm.go:166 +0x78 git.curoverse.com/arvados.git/sdk/go/dispatch.(*Dispatcher).start.func1(0xc42036c210, 0xc423b9fb30, 0xc44313e5c0) /tmp/tmp.DGyedcNfCS/src/git.curoverse.com/arvados.git/sdk/go/dispatch/dispatch.go:151 +0x9e created by git.curoverse.com/arvados.git/sdk/go/dispatch.(*Dispatcher).start /tmp/tmp.DGyedcNfCS/src/git.curoverse.com/arvados.git/sdk/go/dispatch/dispatch.go:150 +0x13a
I think this is dispatch.checkForUpdates' fault: it's reusing a ContainerList object when fetching multiple pages, which means reusing embedded maps (mounts, runtime constraints). The containers are copied by value in checkListForUpdates(), but their maps are copied by reference.
checkForUpdates itself doesn't write the maps concurrently, but the dispatch library passes them to crunch-dispatch-slurm's runContainer in a new goroutine, which writes to the maps when getting final container state.
Aside from the concurrent write, this also means the dispatch library is effectively corrupting container objects after they've been passed to the dispatcher. This can result in c-d-s using the wrong runtime constraints, but this might be rare -- I expect c-d-s usually has enough time to consume all of the runtime_constraints/mounts content from the previous page before an incoming page of containers has a chance to mangle it.
Demonstration at https://play.golang.org/p/zATyWMInm3-?z
Updated by Tom Clegg over 6 years ago
- Status changed from New to In Progress
- Assigned To set to Tom Clegg
- Target version set to 2018-08-01 Sprint
Updated by Tom Clegg over 6 years ago
13788-concurrent-map-write @ d91eb0fae8d2ef4c2348769694573bf725ef16d4
Updated by Tom Clegg over 6 years ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|8df5b745955fafef08056a742680979b81faeb1e.