Bug #9435
closedkeep-balance failing when a small page of collections has the same modified_at timestamp
Description
I got an error from keep-balance:
2016/06/18 10:55:31 collections: 3253996/3273174 2016/06/18 10:57:02 collections: 3278996/3273174 2016/06/18 10:58:31 collections: 3303996/3273174 2016/06/18 11:00:00 collections: 3328996/3273174 2016/06/18 11:01:39 collections: 3353996/3273174 2016/06/18 11:03:11 collections: 3378996/3273174 2016/06/18 11:04:44 collections: 3403996/3273174 2016/06/18 11:06:09 collections: 3428996/3273174 2016/06/18 11:07:42 collections: 3453996/3273174 2016/06/18 11:09:09 collections: 3478996/3273174 2016/06/18 11:09:48 collections: 3490827/3273174 2016/06/18 11:09:52 collections: 3491730/3273174 2016/06/18 11:09:52 collections: 3491822/3273174 2016/06/18 11:09:55 collections: 3491824/3273174 2016/06/18 11:09:55 collections: 3491901/3273174 2016/06/18 11:09:55 collections: 3491904/3273174 2016/06/18 11:09:55 GetCurrentState: took 3h32m5.093893905s 2016/06/18 11:09:55 Run: took 3h32m5.150470067s 2016/06/18 11:09:55 run failed: BUG: Received an entire page with the same modified_at timestamp (2016-06-18 10:09:55.410114 +0000 UTC), cannot make progress
Related issues
Updated by Tom Clegg almost 8 years ago
It looks like the last page received had 3 items, and the repeated timestamp is very new, so presumably the reason we got 3 items was that there are no more recent ones to get.
I think this means we need an exception to that starvation check: it's perfectly fine to receive a page of identical timestamps if we don't need to get the next page (i.e., if there is no next page).
Updated by Tom Clegg almost 8 years ago
Tom Clegg wrote:
it's perfectly fine to receive a page of identical timestamps if we don't need to get the next page (i.e., if there is no next page).
We already have this check:
if last.ModifiedAt == nil || *last.ModifiedAt == filterTime { if page.ItemsAvailable > len(page.Items) { // TODO: use "mtime=X && UUID>Y" // filters to get all collections with // this timestamp, then use "mtime>X" // to get the next timestamp. return fmt.Errorf("BUG: Received an entire page with the same modified_at timestamp (%v), cannot make progress", filterTime) } break }
I wonder if this is a race on the API server, when a collection is added or modified between the "get items" query and the "count items for items_available" query, so they don't reflect the same database state.
Updated by Tom Clegg almost 8 years ago
See https://www.postgresql.org/docs/9.1/static/transaction-iso.html → 13.2.2. Repeatable Read Isolation Level
Using "repeatable read" for every API transaction sounds like it would be nice and safe, but the locking is optimistic, so we would need auto-retry and associated safeguards. OTOH, using it for "index" should be relatively unobtrusive.- "Applications using this level must be prepared to retry transactions due to serialization failures."
- "Note that only updating transactions might need to be retried; read-only transactions will never have serialization conflicts."
Updated by Tom Clegg almost 8 years ago
We should also implement the "mtime=X and UUID>Y" fallback described in the comment, like we did in datamanager. This is useful even with the transaction behavior fixed: it will still be possible to get a full page of results with identical timestamps.