Bug #17293
Updated by Peter Amstutz almost 4 years ago
Customer reported that a long-lived (multi-day) workflow failed and was automatically restarted. This turned out to be due to getting 404 errors when trying to update the log collection. It was determined that the log collection had been put in the trash. It seems like there is a bug where the @trash_at/delete_at@ times were intended to be pushed into the future each time the log collection is updated, but there seems to be some interaction where that is not working as intended. doesn't actually work: crunchrun.go: <pre> func (runner *ContainerRunner) saveLogCollection(final bool) (response arvados.Collection, err error) { ... if final { updates["is_trashed"] = true } else { exp := time.Now().Add(crunchLogUpdatePeriod * 24) updates["trash_at"] = exp updates["delete_at"] = exp } </pre> trashable.rb: <pre> def default_trash_interval ... elsif delete_at_changed? && delete_at >= trash_at # Fix delete_at if needed, so it's not earlier than the expiry # time on any permission tokens that might have been given out. # In any case there are no signatures expiring after now+TTL. # Also, if the existing trash_at time has already passed, we # know we haven't given out any signatures since then. earliest_delete = [ @validation_timestamp, trash_at_was, ].compact.min + Rails.configuration.Collections.BlobSigningTTL # The previous value of delete_at is also an upper bound on the # longest-lived permission token. For example, if TTL=14, # trash_at_was=now-7, delete_at_was=now+7, then it is safe to # set trash_at=now+6, delete_at=now+8. earliest_delete = [earliest_delete, delete_at_was].compact.min # If delete_at is too soon, use the earliest possible time. if delete_at < earliest_delete self.delete_at = earliest_delete end end </pre>