Project

General

Profile

Actions

Bug #9154

closed

data manager sometimes refuses to continue when collections are being written while it is running

Added by Joshua Randall almost 8 years ago. Updated about 7 years ago.

Status:
Closed
Priority:
Normal
Assigned To:
Category:
Keep
Target version:
-
Story points:
-

Description

I've tried running arvados-data-manager several times on our system (it takes ~4 hours to do a single run with 2.4M collections to get through, so it isn't easy to run it very many times) and each time I end up with a "refusing to continue" error.

For example, these are the last few lines of output from my latest run from this morning:

2016/05/06 10:36:38 2366762 collections read, 10000 (9996 new) in last batch, 7423 remaining, 2016-05-06T06:22:49Z latest modified date, 12 72195 29563842 avg,max,total manifest size
2016/05/06 10:37:01 2374191 collections read, 7430 (7429 new) in last batch, 7 remaining, 2016-05-06T09:36:38Z latest modified date, 9 75633 21509238 avg,max,total manifest size
2016/05/06 10:37:01 2374198 collections read, 8 (7 new) in last batch, 0 remaining, 2016-05-06T09:36:50Z latest modified date, 0 4828 33796 avg,max,total manifest size
2016/05/06 10:37:14 singlerun: API server indicates a total of 2374200 collections available up to 2016-05-06T09:36:50Z, but we only retrieved 2374198. Refusing to continue as this could indicate an otherwise undetected failure.

This message is from code that I added to protect data manager from proceeding without a full view of all collections:
https://github.com/curoverse/arvados/blob/1c2af19398b425fb249e6fa8cc909500ce1fa80f/services/datamanager/collection/collection.go#L263-L270

I suspect the reason that it is failing is that collections are being written as data manager is running, and in each case there are a few collections added just after the last batch but before the final query for items_available (and presumably these have the same exact timestamp as the latest collection added).

I would propose fixing this by do one final pass through the collection retrieval loop to request any collections with last modified time == to the latest collection retrieved before checking the total items available with last modified time <= that of the latest collection retrieved.

Actions

Also available in: Atom PDF