Project

General

Profile

Actions

Bug #9435

closed

keep-balance failing when a small page of collections has the same modified_at timestamp

Added by Joshua Randall almost 8 years ago. Updated about 7 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
-
Category:
Keep
Target version:
-
Story points:
-

Description

I got an error from keep-balance:

2016/06/18 10:55:31 collections: 3253996/3273174
2016/06/18 10:57:02 collections: 3278996/3273174
2016/06/18 10:58:31 collections: 3303996/3273174
2016/06/18 11:00:00 collections: 3328996/3273174
2016/06/18 11:01:39 collections: 3353996/3273174
2016/06/18 11:03:11 collections: 3378996/3273174
2016/06/18 11:04:44 collections: 3403996/3273174
2016/06/18 11:06:09 collections: 3428996/3273174
2016/06/18 11:07:42 collections: 3453996/3273174
2016/06/18 11:09:09 collections: 3478996/3273174
2016/06/18 11:09:48 collections: 3490827/3273174
2016/06/18 11:09:52 collections: 3491730/3273174
2016/06/18 11:09:52 collections: 3491822/3273174
2016/06/18 11:09:55 collections: 3491824/3273174
2016/06/18 11:09:55 collections: 3491901/3273174
2016/06/18 11:09:55 collections: 3491904/3273174
2016/06/18 11:09:55 GetCurrentState: took 3h32m5.093893905s
2016/06/18 11:09:55 Run: took 3h32m5.150470067s
2016/06/18 11:09:55 run failed: BUG: Received an entire page with the same modified_at timestamp (2016-06-18 10:09:55.410114 +0000 UTC), cannot make
 progress


Related issues

Related to Arvados - Bug #9496: [API] Ensure 'items_available' and 'items' (in list responses) both reflect the same database stateClosed06/27/2016Actions
Actions #1

Updated by Tom Clegg almost 8 years ago

It looks like the last page received had 3 items, and the repeated timestamp is very new, so presumably the reason we got 3 items was that there are no more recent ones to get.

I think this means we need an exception to that starvation check: it's perfectly fine to receive a page of identical timestamps if we don't need to get the next page (i.e., if there is no next page).

Actions #2

Updated by Tom Clegg almost 8 years ago

Tom Clegg wrote:

it's perfectly fine to receive a page of identical timestamps if we don't need to get the next page (i.e., if there is no next page).

We already have this check:

        if last.ModifiedAt == nil || *last.ModifiedAt == filterTime {
            if page.ItemsAvailable > len(page.Items) {
                // TODO: use "mtime=X && UUID>Y" 
                // filters to get all collections with
                // this timestamp, then use "mtime>X" 
                // to get the next timestamp.
                return fmt.Errorf("BUG: Received an entire page with the same modified_at timestamp (%v), cannot make progress", filterTime)
            }
            break
        }

I wonder if this is a race on the API server, when a collection is added or modified between the "get items" query and the "count items for items_available" query, so they don't reflect the same database state.

Actions #3

Updated by Tom Clegg almost 8 years ago

See https://www.postgresql.org/docs/9.1/static/transaction-iso.html → 13.2.2. Repeatable Read Isolation Level

Using "repeatable read" for every API transaction sounds like it would be nice and safe, but the locking is optimistic, so we would need auto-retry and associated safeguards. OTOH, using it for "index" should be relatively unobtrusive.
  • "Applications using this level must be prepared to retry transactions due to serialization failures."
  • "Note that only updating transactions might need to be retried; read-only transactions will never have serialization conflicts."
Actions #4

Updated by Tom Clegg almost 8 years ago

We should also implement the "mtime=X and UUID>Y" fallback described in the comment, like we did in datamanager. This is useful even with the transaction behavior fixed: it will still be possible to get a full page of results with identical timestamps.

Actions #5

Updated by Tom Clegg about 7 years ago

  • Status changed from New to Resolved

In #9998 keep-balance started using count="none" and accommodating multiple pages of identical timestamps, so this shouldn't affect keep-balance any more. #9496 remains.

Actions

Also available in: Atom PDF