Project

General

Profile

Actions

Bug #9154

closed

data manager sometimes refuses to continue when collections are being written while it is running

Added by Joshua Randall almost 8 years ago. Updated about 7 years ago.

Status:
Closed
Priority:
Normal
Assigned To:
Category:
Keep
Target version:
-
Story points:
-

Description

I've tried running arvados-data-manager several times on our system (it takes ~4 hours to do a single run with 2.4M collections to get through, so it isn't easy to run it very many times) and each time I end up with a "refusing to continue" error.

For example, these are the last few lines of output from my latest run from this morning:

2016/05/06 10:36:38 2366762 collections read, 10000 (9996 new) in last batch, 7423 remaining, 2016-05-06T06:22:49Z latest modified date, 12 72195 29563842 avg,max,total manifest size
2016/05/06 10:37:01 2374191 collections read, 7430 (7429 new) in last batch, 7 remaining, 2016-05-06T09:36:38Z latest modified date, 9 75633 21509238 avg,max,total manifest size
2016/05/06 10:37:01 2374198 collections read, 8 (7 new) in last batch, 0 remaining, 2016-05-06T09:36:50Z latest modified date, 0 4828 33796 avg,max,total manifest size
2016/05/06 10:37:14 singlerun: API server indicates a total of 2374200 collections available up to 2016-05-06T09:36:50Z, but we only retrieved 2374198. Refusing to continue as this could indicate an otherwise undetected failure.

This message is from code that I added to protect data manager from proceeding without a full view of all collections:
https://github.com/curoverse/arvados/blob/1c2af19398b425fb249e6fa8cc909500ce1fa80f/services/datamanager/collection/collection.go#L263-L270

I suspect the reason that it is failing is that collections are being written as data manager is running, and in each case there are a few collections added just after the last batch but before the final query for items_available (and presumably these have the same exact timestamp as the latest collection added).

I would propose fixing this by do one final pass through the collection retrieval loop to request any collections with last modified time == to the latest collection retrieved before checking the total items available with last modified time <= that of the latest collection retrieved.

Actions #1

Updated by Joshua Randall almost 8 years ago

  • Status changed from New to In Progress
  • % Done changed from 0 to 100
Actions #2

Updated by Tom Clegg almost 8 years ago

I suspect this isn't caused by a timestamp collision on the last collection.

Data manager is using 1s precision to format its time filters. (This is a bug.)

If the last collection received had modtime 2112-12-21T21:12.211202112Z then our sanity check will count "modified_at <= 2112-12-21T21:12Z" and therefore won't even count the last collection we know about. So it seems the sanity check is being less conservative than we want, and you're really missing 3 collections somewhere, not 2.

Looking for other explanations, it seems that our use of "offset" in this loop is unsafe.

Say b1, b2, b3 have the same modtime (to within one second, because datamanager is filtering dates with 1s precision) and page size is 2.

collections at start    a, b1, b2, b3, c, d, e, f, g, h, i, j
offset=0                a, b1
modtime>=b1, offset=0      b1, b2
                              (b2 is modified now, and therefore becomes "k")
collections database    a, b1,     b3, c, d, e, f, g, h, i, j, k
modtime>=b1, offset=2                  c, d
modtime>=c, offset=0                         e, f

(Oops, never saw b3)

Could this be responsible?

Actions #3

Updated by Joshua Randall almost 8 years ago

I agree the lack of precision in data manager timestamps is a bug, but this final pass through the loop would apply with more precision as well if it is theoretically possible for two collections to have identical high precision timestamps. If the API server or database guarantees that no two collections can have the same high precision timestamp, then this would not be an issue.

If the API server interprets "modified_at <= 2112-12-21T21:12Z" as being equal to "modified_at <= 2112-12-21T21:12.000000000Z", such that it would not match "2112-12-21T21:12.211202112Z", then I agree that is another potential explanation for the errors I observed, and this fix would not solve it.

This fix was assuming that truncated timestamps would be compared only in as much precision as they specify (i.e. it would assume that a query for "modified_at 2112-12-21T21:12Z" would match "2112-12-21T21:12.211202112Z" because they are equal up to the second). If that isn't true, then this fix is pretty much pointless. Obviously if accessing postgres directly, it wouldn't match unless you date_trunc the timestamp column, but it seems reasonable for the API server to implement '' comparisons with truncation such that you could do a query like "modified_at == 2112-12-21" and it would automatically truncate the modified_at column to the appropriate precision to make the comparison make sense. If it doesn't do that, it should probably be an error to specify comparisons against timestamps that are not specified to the full precision available.

Actions #4

Updated by Tom Clegg about 7 years ago

  • Status changed from In Progress to Closed
Actions

Also available in: Atom PDF