Bug #7979
closedPython sdk needs more agressive retries
Description
Pipeline https://workbench.su92l.arvadosapi.com/pipeline_instances/su92l-d1hrv-5izco85vnnq6vt3#Components fails with KeepReadError: failed to read 6c35d441091109f77b5457c40812dbae+26720+Aee4a4920fd4db4112a45df5b58b065d17ea31d4c@5679c987: service http://keep14.su92l.arvadosapi.com:25107/ responded with 404 HTTP/1.1 404 Not Found
nico reports:
keep17.su92l:/home/nico# ls /data/su92l-keep-*/keep/6c3/6c35d441091109f77b5457c40812dbae
/data/su92l-keep-3/keep/6c3/6c35d441091109f77b5457c40812dbae
keep17.su92l:/home/nico# md5sum /data/su92l-keep-*/keep/6c3/6c35d441091109f77b5457c40812dbae
6c35d441091109f77b5457c40812dbae /data/su92l-keep-3/keep/6c3/6c35d441091109f77b5457c40812dbae
keep17.su92l:/home/nico#
But the keepstore never gets it:
grep 6c35d441091109f77b5457c40812dbae /etc/sv/keepstore/log/main/current
1!keep17.su92l:/home/nico#
Which is probably a network glitch or something like that.
Updated by Brett Smith about 9 years ago
Something went wrong here, but it's not clear that retrying would've helped. Open questions:
- Why is only one service listed in the exception? Every service the Keep client queried should be listed. Is there maybe a bug in the retry code where the exception only includes errors from the most recent run of the retry loop? Or is it possible that the client, in fact, only knew about this one service, and so it's the only one it queried? (If so, that would explain why the block wasn't found.)
- Is it possible that the request did in fact make it to keep17, but the logs had been rotated so it wasn't in the current log?