Project

General

Profile

Bug #13513

Updated by Ward Vandewege almost 6 years ago

 
 After the merge of 9918-index-timeouts, I'm observing that keep-balance hangs (?) on ComputeChangeSets: 

 <pre> 
 May 22 14:06:37 dhhck.arvadosapi.com keep-balance[11166]: 2018/05/22 14:06:37 dhhck-bi6l4-pkwwh8mhe0qgmu6 (keep2.dhhck.arvadosapi.com:25107, s3): done 
 May 22 14:08:40 dhhck.arvadosapi.com keep-balance[11166]: 2018/05/22 14:08:40 zzzzz-ivpuk-v2udip63fnkdyxf (s3:///dhhck-keep-0) on dhhck-bi6l4-oynapdlh4hzydcf (keep0.dhhck.arvadosapi.com:25107, s3): add 1043919 replicas to map 
 May 22 14:08:40 dhhck.arvadosapi.com keep-balance[11166]: 2018/05/22 14:08:40 zzzzz-ivpuk-v2udip63fnkdyxf (s3:///dhhck-keep-0) on dhhck-bi6l4-oynapdlh4hzydcf (keep0.dhhck.arvadosapi.com:25107, s3): done 
 May 22 14:08:40 dhhck.arvadosapi.com keep-balance[11166]: 2018/05/22 14:08:40 dhhck-bi6l4-oynapdlh4hzydcf (keep0.dhhck.arvadosapi.com:25107, s3): done 
 May 22 14:08:40 dhhck.arvadosapi.com keep-balance[11166]: 2018/05/22 14:08:40 GetCurrentState: took 10m6.992266703s 
 May 22 14:08:40 dhhck.arvadosapi.com keep-balance[11166]: 2018/05/22 14:08:40 ComputeChangeSets: start 

 </pre> 

 I stopped it after ~42 minutes. 

 <pre> 
 May 22 14:50:02 dhhck.arvadosapi.com systemd[1]: Stopping Arvados Keep Balance... 
 May 22 14:50:02 dhhck.arvadosapi.com systemd[1]: Stopped Arvados Keep Balance. 
 </pre> 

 Command line: 

 <pre> 
 /usr/bin/keep-balance -commit-trash 
 </pre> 

 I also tried with -commit-pull enabled, and the behavior was unchanged. 

 Config file: 

 <pre> 
 # cat /etc/arvados/keep-balance/keep-balance.yml  
 ################################################################### 
 #    THIS FILE IS MANAGED BY PUPPET -- CHANGES WILL BE OVERWRITTEN    # 
 ################################################################### 
 Client: 
     APIHost: dhhck.arvadosapi.com:443 
     AuthToken: STRIPPED 
     Insecure: false 
 KeepServiceTypes: 
     - s3 
 RunPeriod: 14400s 
 CollectionBatchSize: 100000 
 CollectionBuffers: 1000 
 </pre> 

 Bisecting: 

  
 |0.1.20180322172032.41e612b59-1|(with 

   0.1.20180322172032.41e612b59-1 (with extra patch to increase timeout to 20 minutes)|OK| minutes) works fine. 

 |1.1.4.20180510200716-1|(with extra patch to increase timeout to 20 minutes)|HANGS| 
 |1.1.4.20180518195015-1||HANGS| 

Back