Project

General

Profile

Actions

Bug #19414

closed

keep-balance panic: concurrent map read and map write

Added by Tom Clegg about 1 month ago. Updated 13 days ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
08/23/2022
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-
Release relationship:
Auto

Description

It looks like this can happen when there is an "unachievable" block (referenced by a collection, but not returned by any keepstore index).

fatal error: concurrent map read and map write

goroutine 33154 [running]:
runtime.throw({0x9ca24b, 0x0})
        /usr/local/go/src/runtime/panic.go:1198 +0x71 fp=0xc125a09768 sp=0xc125a09738 pc=0x437151
runtime.mapaccess1_faststr(0x927060, 0xc125a09870, {0xc106a34bd0, 0x29})
        /usr/local/go/src/runtime/map_faststr.go:21 +0x3a5 fp=0xc125a097d0 sp=0xc125a09768 pc=0x414245
main.(*BlockStateMap).get(...)
        /arvados/services/keep-balance/block_state.go:91
main.(*BlockStateMap).GetConfirmedReplication(0xc42ba0c330, {0xc86d7ee000, 0x9f8bc, 0x1010101}, {0xc0f75aeb00, 0x1, 0xc320947560})
        /arvados/services/keep-balance/block_state.go:154 +0x217 fp=0xc125a09b08 sp=0xc125a097d0 pc=0x8be617
main.(*Balancer).updateCollections.func2()
        /arvados/services/keep-balance/collection.go:187 +0x41a fp=0xc125a09fa8 sp=0xc125a09b08 pc=0x8c099a
main.goSendErr.func1()
        /arvados/services/keep-balance/collection.go:256 +0x26 fp=0xc125a09fe0 sp=0xc125a09fa8 pc=0x8c1866
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc125a09fe8 sp=0xc125a09fe0 pc=0x46a1c1
created by main.goSendErr
        /arvados/services/keep-balance/collection.go:255 +0x77

goroutine 33104 [runnable]:
main.(*BlockStateMap).GetConfirmedReplication(0xc42ba0c330, {0xc867b66000, 0x947a5, 0xc012cf1500}, {0xc0f0864880, 0x1, 0x1})
        /arvados/services/keep-balance/block_state.go:161 +0x49b
main.(*Balancer).updateCollections.func2()
        /arvados/services/keep-balance/collection.go:187 +0x41a
main.goSendErr.func1()
        /arvados/services/keep-balance/collection.go:256 +0x26
created by main.goSendErr
        /arvados/services/keep-balance/collection.go:255 +0x77

goroutine 33157 [runnable]:
main.(*BlockStateMap).GetConfirmedReplication(0xc42ba0c330, {0xc8614be000, 0x9e192, 0x1}, {0xc0e7fd1840, 0x1, 0x4146f9})
        /arvados/services/keep-balance/block_state.go:172 +0x2ec
main.(*Balancer).updateCollections.func2()
        /arvados/services/keep-balance/collection.go:187 +0x41a
main.goSendErr.func1()
        /arvados/services/keep-balance/collection.go:256 +0x26
created by main.goSendErr
        /arvados/services/keep-balance/collection.go:255 +0x77

Files

keep-balance (3.67 MB) keep-balance c3268207eeb57686c29a0f846bba1a7e6f135622-dev Tom Clegg, 08/25/2022 01:53 PM

Subtasks 1 (1 open0 closed)

Task #19415: Review 19414-keep-balance-panicIn ProgressTom Clegg08/23/2022

Actions
Actions #1

Updated by Tom Clegg about 1 month ago

  • Target version set to 2022-08-31 sprint
Actions #2

Updated by Tom Clegg about 1 month ago

Actions #3

Updated by Lucas Di Pentima about 1 month ago

Although I understand the fix, I don't get the "...when NumCPU > 2" comment, why is that necessary for the bug to happen?

LGTM, thanks!

Actions #4

Updated by Tom Clegg about 1 month ago

when NumCPU > 2

GetConfirmedReplication() is called from a goroutine that starts 1x per iteration of this loop in source:services/keep-balance/collection.go:

        // Use about 1 goroutine per 2 CPUs. Based on experiments with                                                                                                        
        // a 2-core host, using more concurrent database                                                                                                                      
        // calls/transactions makes this process slower, not faster.                                                                                                          
        for i := 0; i < runtime.NumCPU()+1/2; i++ {

...which, now that I'm paying attention, means NumCPU+0, but was surely meant to be

        for i := 0; i < (runtime.NumCPU()+1)/2; i++ {

...which means that we have only a single goroutine calling GetConfirmedReplication, and therefore no opportunity for a concurrent map read/write, on a single-core machine (or a 2-core machine, after fixing the parens).

Actions #5

Updated by Tom Clegg about 1 month ago

  • % Done changed from 0 to 100
  • Status changed from In Progress to Resolved

Applied in changeset arvados-private:commit:arvados|69158dc93fdfec57279ba227f872f3a7c01c4e78.

Actions #7

Updated by Peter Amstutz 13 days ago

  • Release set to 53
Actions

Also available in: Atom PDF