Project

General

Profile

Actions

Bug #13513

closed

[keep-balance] hang on ComputeChangeSets

Added by Ward Vandewege almost 6 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Story points:
-
Release:
Release relationship:
Auto

Description

After the merge of 9918-index-timeouts, I'm observing that keep-balance hangs (?) on ComputeChangeSets:

May 22 14:06:37 dhhck.arvadosapi.com keep-balance[11166]: 2018/05/22 14:06:37 dhhck-bi6l4-pkwwh8mhe0qgmu6 (keep2.dhhck.arvadosapi.com:25107, s3): done
May 22 14:08:40 dhhck.arvadosapi.com keep-balance[11166]: 2018/05/22 14:08:40 zzzzz-ivpuk-v2udip63fnkdyxf (s3:///dhhck-keep-0) on dhhck-bi6l4-oynapdlh4hzydcf (keep0.dhhck.arvadosapi.com:25107, s3): add 1043919 replicas to map
May 22 14:08:40 dhhck.arvadosapi.com keep-balance[11166]: 2018/05/22 14:08:40 zzzzz-ivpuk-v2udip63fnkdyxf (s3:///dhhck-keep-0) on dhhck-bi6l4-oynapdlh4hzydcf (keep0.dhhck.arvadosapi.com:25107, s3): done
May 22 14:08:40 dhhck.arvadosapi.com keep-balance[11166]: 2018/05/22 14:08:40 dhhck-bi6l4-oynapdlh4hzydcf (keep0.dhhck.arvadosapi.com:25107, s3): done
May 22 14:08:40 dhhck.arvadosapi.com keep-balance[11166]: 2018/05/22 14:08:40 GetCurrentState: took 10m6.992266703s
May 22 14:08:40 dhhck.arvadosapi.com keep-balance[11166]: 2018/05/22 14:08:40 ComputeChangeSets: start

I stopped it after ~42 minutes.

May 22 14:50:02 dhhck.arvadosapi.com systemd[1]: Stopping Arvados Keep Balance...
May 22 14:50:02 dhhck.arvadosapi.com systemd[1]: Stopped Arvados Keep Balance.

Command line:

/usr/bin/keep-balance -commit-trash

I also tried with -commit-pull enabled, and the behavior was unchanged.

Config file:

# cat /etc/arvados/keep-balance/keep-balance.yml 
###################################################################
#  THIS FILE IS MANAGED BY PUPPET -- CHANGES WILL BE OVERWRITTEN  #
###################################################################
Client:
    APIHost: dhhck.arvadosapi.com:443
    AuthToken: STRIPPED
    Insecure: false
KeepServiceTypes:
    - s3
RunPeriod: 14400s
CollectionBatchSize: 100000
CollectionBuffers: 1000

Bisecting:

0.1.20180322172032.41e612b59-1 (with extra patch to increase timeout to 20 minutes) OK
1.1.4.20180403215323-1 (with extra patch to increase timeout to 20 minutes) OK
1.1.4.20180420195921-1 (with extra patch to increase timeout to 20 minutes) OK
1.1.4.20180426154228-1 (with extra patch to increase timeout to 20 minutes) OK
1.1.4.20180426193406-1 (with extra patch to increase timeout to 20 minutes) HANGS
1.1.4.20180510200716-1 (with extra patch to increase timeout to 20 minutes) HANGS
1.1.4.20180518195015-1 HANGS

So, it looks like the problem was introduced between version 1.1.4.20180426154228-1 (fcfbbddf572db32008fcdc7d0750a13b8d6f3b1c) and version 1.1.4.20180426193406-1 (932e3d6e9a899cc662ea3934b79057d39cd88fed).


Subtasks 1 (0 open1 closed)

Task #13529: Review 13513-balance-deadlockClosedWard Vandewege05/29/2018Actions
Actions

Also available in: Atom PDF