Bug #13788
crunch-dispatch-slurm fatal error: concurrent map writes
100%
Description
Our crunch-dispatch-slurm service is occasionally getting a fatal error "concurrent map writes" which triggers a panic:
root@arvados-master-eglyx:~# grep crunch-dispatch-slurm /var/log/syslog | egrep '(fatal|panic)' Jul 11 01:43:56 arvados-master-eglyx crunch-dispatch-slurm[52796]: fatal error: concurrent map writes2018/07/11 01:43:56 Done monitoring container eglyx-dz642-9bziwtki4fkdxu1 Jul 11 01:43:56 arvados-master-eglyx crunch-dispatch-slurm[52796]: #011/usr/local/go/src/runtime/panic.go:616 +0x81 fp=0xc4225e4c38 sp=0xc4225e4c18 pc=0x42abe1 Jul 11 01:56:17 arvados-master-eglyx crunch-dispatch-slurm[37362]: fatal error: concurrent map writes Jul 11 01:56:17 arvados-master-eglyx crunch-dispatch-slurm[37362]: #011/usr/local/go/src/runtime/panic.go:616 +0x81 fp=0xc422c6cc38 sp=0xc422c6cc18 pc=0x42abe1 Jul 11 01:57:07 arvados-master-eglyx crunch-dispatch-slurm[45672]: fatal error: concurrent map writes Jul 11 01:57:07 arvados-master-eglyx crunch-dispatch-slurm[45672]: #011/usr/local/go/src/runtime/panic.go:616 +0x81 fp=0xc4208c2c38 sp=0xc4208c2c18 pc=0x42abe1 Jul 11 04:16:59 arvados-master-eglyx crunch-dispatch-slurm[52806]: fatal error: concurrent map writes Jul 11 04:16:59 arvados-master-eglyx crunch-dispatch-slurm[52806]: #011/usr/local/go/src/runtime/panic.go:616 +0x81 fp=0xc424164c38 sp=0xc424164c18 pc=0x42abe1 Jul 11 08:24:26 arvados-master-eglyx crunch-dispatch-slurm[40371]: fatal error: concurrent map writes Jul 11 08:24:26 arvados-master-eglyx crunch-dispatch-slurm[40371]: #011/usr/local/go/src/runtime/panic.go:616 +0x81 fp=0xc422f1ac38 sp=0xc422f1ac18 pc=0x42abe1
Full stack trace logs from the last one above (6MB): https://paste.ubuntu.com/p/N2pdwhwd5d/plain/
Subtasks
Associated revisions
History
#1
Updated by Joshua Randall over 2 years ago
This was with crunch-dispatch-slurm 1.1.4.20180524153119-1
Which I believe corresponds to commit: https://github.com/curoverse/arvados/commit/1275e305f6d13785687c72b8609592094a067cef
#2
Updated by Tom Clegg over 2 years ago
I think these are the relevant stacks
fatal error: concurrent map writes goroutine 2759368 [running]: runtime.throw(0x728936, 0x15) /usr/local/go/src/runtime/panic.go:616 +0x81 fp=0xc422f1ac38 sp=0xc422f1ac18 pc=0x42abe1 runtime.mapassign(0x6c87a0, 0xc44de74750, 0xc434cb7660, 0x6afc40) /usr/local/go/src/runtime/hashmap.go:607 +0x553 fp=0xc422f1acc8 sp=0xc422f1ac38 pc=0x409713 reflect.mapassign(0x6c87a0, 0xc44de74750, 0xc434cb7660, 0xc4416d9ef0) /usr/local/go/src/runtime/hashmap.go:1205 +0x3f fp=0xc422f1acf8 sp=0xc422f1acc8 pc=0x40b2af reflect.Value.SetMapIndex(0x6c87a0, 0xc423cdc438, 0x195, 0x6b3100, 0xc434cb7660, 0x98, 0x710cc0, 0xc4416d9ef0, 0x199) /usr/local/go/src/reflect/value.go:1515 +0x1f6 fp=0xc422f1ad68 sp=0xc422f1acf8 pc=0x4a2f26 encoding/json.(*decodeState).object(0xc4476ab6a8, 0x6c87a0, 0xc423cdc438, 0x195) /usr/local/go/src/encoding/json/decode.go:813 +0xb6c fp=0xc422f1afe8 sp=0xc422f1ad68 pc=0x5c49fc encoding/json.(*decodeState).value(0xc4476ab6a8, 0x6c87a0, 0xc423cdc438, 0x195) /usr/local/go/src/encoding/json/decode.go:408 +0x2d3 fp=0xc422f1b068 sp=0xc422f1afe8 pc=0x5c2bc3 encoding/json.(*decodeState).object(0xc4476ab6a8, 0x6a4020, 0xc423cdc3c0, 0x16) /usr/local/go/src/encoding/json/decode.go:776 +0x132d fp=0xc422f1b2e8 sp=0xc422f1b068 pc=0x5c51bd encoding/json.(*decodeState).value(0xc4476ab6a8, 0x6a4020, 0xc423cdc3c0, 0x16) /usr/local/go/src/encoding/json/decode.go:408 +0x2d3 fp=0xc422f1b368 sp=0xc422f1b2e8 pc=0x5c2bc3 encoding/json.(*decodeState).unmarshal(0xc4476ab6a8, 0x6a4020, 0xc423cdc3c0, 0x0, 0x0) /usr/local/go/src/encoding/json/decode.go:189 +0x1e7 fp=0xc422f1b3d8 sp=0xc422f1b368 pc=0x5c21b7 encoding/json.(*Decoder).Decode(0xc4476ab680, 0x6a4020, 0xc423cdc3c0, 0x75ecc0, 0xc43b5cf0c0) /usr/local/go/src/encoding/json/stream.go:73 +0x1a7 fp=0xc422f1b418 sp=0xc422f1b3d8 pc=0x5d3967 git.curoverse.com/arvados.git/sdk/go/arvadosclient.(*ArvadosClient).Call(0xc4203da8c0, 0x72397c, 0x3, 0x7252e6, 0xa, 0xc427f4a520, 0x1b, 0x0, 0x0, 0x0, ...) /tmp/tmp.DGyedcNfCS/src/git.curoverse.com/arvados.git/sdk/go/arvadosclient/arvadosclient.go:362 +0x199 fp=0xc422f1b4c8 sp=0xc422f1b418 pc=0x676109 git.curoverse.com/arvados.git/sdk/go/arvadosclient.(*ArvadosClient).Get(0xc4203da8c0, 0x7252e6, 0xa, 0xc427f4a520, 0x1b, 0x0, 0x6a4020, 0xc423cdc3c0, 0x6, 0xc42e711410) /tmp/tmp.DGyedcNfCS/src/git.curoverse.com/arvados.git/sdk/go/arvadosclient/arvadosclient.go:393 +0xdc fp=0xc422f1b548 sp=0xc422f1b4c8 pc=0x67648c main.(*Dispatcher).runContainer(0xc4201be000, 0xc42036c210, 0xc427f4a520, 0x1b, 0x39f3d4a8, 0xed2d76546, 0x0, 0xc455a21ce0, 0x3, 0x6, ...) /tmp/tmp.DGyedcNfCS/src/git.curoverse.com/arvados.git/services/crunch-dispatch-slurm/crunch-dispatch-slurm.go:312 +0x780 fp=0xc422f1bca8 sp=0xc422f1b548 pc=0x67c8a0 main.(*Dispatcher).(main.runContainer)-fm(0xc42036c210, 0xc427f4a520, 0x1b, 0x39f3d4a8, 0xed2d76546, 0x0, 0xc455a21ce0, 0x3, 0x6, 0xc42e711410, ...) /tmp/tmp.DGyedcNfCS/src/git.curoverse.com/arvados.git/services/crunch-dispatch-slurm/crunch-dispatch-slurm.go:166 +0x78 fp=0xc422f1bdc0 sp=0xc422f1bca8 pc=0x680078 git.curoverse.com/arvados.git/sdk/go/dispatch.(*Dispatcher).start.func1(0xc42036c210, 0xc42030e1e0, 0xc43ccc9e10) /tmp/tmp.DGyedcNfCS/src/git.curoverse.com/arvados.git/sdk/go/dispatch/dispatch.go:151 +0x9e fp=0xc422f1bfc8 sp=0xc422f1bdc0 pc=0x67968e runtime.goexit() /usr/local/go/src/runtime/asm_amd64.s:2361 +0x1 fp=0xc422f1bfd0 sp=0xc422f1bfc8 pc=0x456671 created by git.curoverse.com/arvados.git/sdk/go/dispatch.(*Dispatcher).start /tmp/tmp.DGyedcNfCS/src/git.curoverse.com/arvados.git/sdk/go/dispatch/dispatch.go:150 +0x13a goroutine 2760000 [runnable]: reflect.Value.Field(0x712b40, 0xc429d90870, 0x199, 0xc, 0x5, 0x2d4, 0x101) /usr/local/go/src/reflect/value.go:780 +0x140 encoding/json.(*decodeState).object(0xc43aa7af28, 0x6a4020, 0xc429d90870, 0x16) /usr/local/go/src/encoding/json/decode.go:749 +0x292 encoding/json.(*decodeState).value(0xc43aa7af28, 0x6a4020, 0xc429d90870, 0x16) /usr/local/go/src/encoding/json/decode.go:408 +0x2d3 encoding/json.(*decodeState).unmarshal(0xc43aa7af28, 0x6a4020, 0xc429d90870, 0x0, 0x0) /usr/local/go/src/encoding/json/decode.go:189 +0x1e7 encoding/json.(*Decoder).Decode(0xc43aa7af00, 0x6a4020, 0xc429d90870, 0x75ecc0, 0xc437235700) /usr/local/go/src/encoding/json/stream.go:73 +0x1a7 git.curoverse.com/arvados.git/sdk/go/arvadosclient.(*ArvadosClient).Call(0xc4203da8c0, 0x72397c, 0x3, 0x7252e6, 0xa, 0xc44c3798a0, 0x1b, 0x0, 0x0, 0x0, ...) /tmp/tmp.DGyedcNfCS/src/git.curoverse.com/arvados.git/sdk/go/arvadosclient/arvadosclient.go:362 +0x199 git.curoverse.com/arvados.git/sdk/go/arvadosclient.(*ArvadosClient).Get(0xc4203da8c0, 0x7252e6, 0xa, 0xc44c3798a0, 0x1b, 0x0, 0x6a4020, 0xc429d90870, 0x6, 0xc43af99110) /tmp/tmp.DGyedcNfCS/src/git.curoverse.com/arvados.git/sdk/go/arvadosclient/arvadosclient.go:393 +0xdc main.(*Dispatcher).runContainer(0xc4201be000, 0xc42036c210, 0xc44c3798a0, 0x1b, 0x1c33e110, 0xed2d77ad6, 0x0, 0xc455a21ce0, 0x3, 0x6, ...) /tmp/tmp.DGyedcNfCS/src/git.curoverse.com/arvados.git/services/crunch-dispatch-slurm/crunch-dispatch-slurm.go:312 +0x780 main.(*Dispatcher).(main.runContainer)-fm(0xc42036c210, 0xc44c3798a0, 0x1b, 0x1c33e110, 0xed2d77ad6, 0x0, 0xc455a21ce0, 0x3, 0x6, 0xc43af99110, ...) /tmp/tmp.DGyedcNfCS/src/git.curoverse.com/arvados.git/services/crunch-dispatch-slurm/crunch-dispatch-slurm.go:166 +0x78 git.curoverse.com/arvados.git/sdk/go/dispatch.(*Dispatcher).start.func1(0xc42036c210, 0xc422f93b30, 0xc429e0cf00) /tmp/tmp.DGyedcNfCS/src/git.curoverse.com/arvados.git/sdk/go/dispatch/dispatch.go:151 +0x9e created by git.curoverse.com/arvados.git/sdk/go/dispatch.(*Dispatcher).start /tmp/tmp.DGyedcNfCS/src/git.curoverse.com/arvados.git/sdk/go/dispatch/dispatch.go:150 +0x13a goroutine 2619443 [runnable]: encoding/json.stateBeginValue(0xc43c040408, 0x522, 0x85a) /usr/local/go/src/encoding/json/scanner.go:212 +0x459 encoding/json.(*decodeState).value(0xc43c0403e8, 0x0, 0x0, 0x0) /usr/local/go/src/encoding/json/decode.go:385 +0xf3 encoding/json.(*decodeState).object(0xc43c0403e8, 0x6a4020, 0xc4251e1e00, 0x16) /usr/local/go/src/encoding/json/decode.go:776 +0x132d encoding/json.(*decodeState).value(0xc43c0403e8, 0x6a4020, 0xc4251e1e00, 0x16) /usr/local/go/src/encoding/json/decode.go:408 +0x2d3 encoding/json.(*decodeState).unmarshal(0xc43c0403e8, 0x6a4020, 0xc4251e1e00, 0x0, 0x0) /usr/local/go/src/encoding/json/decode.go:189 +0x1e7 encoding/json.(*Decoder).Decode(0xc43c0403c0, 0x6a4020, 0xc4251e1e00, 0x75ecc0, 0xc43d1d3c00) /usr/local/go/src/encoding/json/stream.go:73 +0x1a7 git.curoverse.com/arvados.git/sdk/go/arvadosclient.(*ArvadosClient).Call(0xc4203da8c0, 0x72397c, 0x3, 0x7252e6, 0xa, 0xc45d29f940, 0x1b, 0x0, 0x0, 0x0, ...) /tmp/tmp.DGyedcNfCS/src/git.curoverse.com/arvados.git/sdk/go/arvadosclient/arvadosclient.go:362 +0x199 git.curoverse.com/arvados.git/sdk/go/arvadosclient.(*ArvadosClient).Get(0xc4203da8c0, 0x7252e6, 0xa, 0xc45d29f940, 0x1b, 0x0, 0x6a4020, 0xc4251e1e00, 0x4, 0xc4524dde00) /tmp/tmp.DGyedcNfCS/src/git.curoverse.com/arvados.git/sdk/go/arvadosclient/arvadosclient.go:393 +0xdc main.(*Dispatcher).runContainer(0xc4201be000, 0xc42036c210, 0xc45d29f940, 0x1b, 0xe292ee0, 0xed2d75ec6, 0x0, 0xc43e655680, 0x3, 0x4, ...) /tmp/tmp.DGyedcNfCS/src/git.curoverse.com/arvados.git/services/crunch-dispatch-slurm/crunch-dispatch-slurm.go:312 +0x780 main.(*Dispatcher).(main.runContainer)-fm(0xc42036c210, 0xc45d29f940, 0x1b, 0xe292ee0, 0xed2d75ec6, 0x0, 0xc43e655680, 0x3, 0x4, 0xc4524dde00, ...) /tmp/tmp.DGyedcNfCS/src/git.curoverse.com/arvados.git/services/crunch-dispatch-slurm/crunch-dispatch-slurm.go:166 +0x78 git.curoverse.com/arvados.git/sdk/go/dispatch.(*Dispatcher).start.func1(0xc42036c210, 0xc423b9fb30, 0xc44313e5c0) /tmp/tmp.DGyedcNfCS/src/git.curoverse.com/arvados.git/sdk/go/dispatch/dispatch.go:151 +0x9e created by git.curoverse.com/arvados.git/sdk/go/dispatch.(*Dispatcher).start /tmp/tmp.DGyedcNfCS/src/git.curoverse.com/arvados.git/sdk/go/dispatch/dispatch.go:150 +0x13a
I think this is dispatch.checkForUpdates' fault: it's reusing a ContainerList object when fetching multiple pages, which means reusing embedded maps (mounts, runtime constraints). The containers are copied by value in checkListForUpdates(), but their maps are copied by reference.
checkForUpdates itself doesn't write the maps concurrently, but the dispatch library passes them to crunch-dispatch-slurm's runContainer in a new goroutine, which writes to the maps when getting final container state.
Aside from the concurrent write, this also means the dispatch library is effectively corrupting container objects after they've been passed to the dispatcher. This can result in c-d-s using the wrong runtime constraints, but this might be rare -- I expect c-d-s usually has enough time to consume all of the runtime_constraints/mounts content from the previous page before an incoming page of containers has a chance to mangle it.
Demonstration at https://play.golang.org/p/zATyWMInm3-?z
#3
Updated by Tom Clegg over 2 years ago
- Status changed from New to In Progress
- Assigned To set to Tom Clegg
- Target version set to 2018-08-01 Sprint
#4
Updated by Tom Clegg over 2 years ago
13788-concurrent-map-write @ d91eb0fae8d2ef4c2348769694573bf725ef16d4
#5
Updated by Lucas Di Pentima over 2 years ago
Sneaky bug. This LGTM, thanks!
#6
Updated by Tom Clegg over 2 years ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|8df5b745955fafef08056a742680979b81faeb1e.
#7
Updated by Tom Morris about 2 years ago
- Release set to 13
Merge branch '13788-concurrent-map-write'
fixes #13788
Arvados-DCO-1.1-Signed-off-by: Tom Clegg <tclegg@veritasgenetics.com>