Bug #17293
Updated by Peter Amstutz almost 4 years ago
Customer reported that a long-lived (multi-day) workflow failed and was automatically restarted. This turned out to be due to getting 404 errors when trying to update the log collection. It was determined that the log collection had been put in the trash. It seems like there is a bug where the @trash_at/delete_at@ times were intended to be pushed into the future each time the log collection is updated, but there seems to be some interaction where that is not working as intended. crunchrun.go: <pre> func (runner *ContainerRunner) saveLogCollection(final bool) (response arvados.Collection, err error) { ... mt, err := runner.LogCollection.MarshalManifest(".") if err != nil { err = fmt.Errorf("error creating log manifest: %v", err) return } updates := arvadosclient.Dict{ "name": "logs for " + runner.Container.UUID, "manifest_text": mt, } if final { updates["is_trashed"] = true } else { exp := time.Now().Add(crunchLogUpdatePeriod * 24) updates["trash_at"] = exp updates["delete_at"] = exp } </pre> trashable.rb: There were reports of keep write failures. If it failed 24 consecutive times, then it would hit the <pre> def default_trash_interval ... elsif delete_at_changed? && delete_at >= trash_at time. It might be a good idea to tweak # Fix delete_at if needed, so it's not earlier than the behavior so expiry # time on any permission tokens that might have been given out. # In any case there are no signatures expiring after now+TTL. # Also, if there the existing trash_at time has already passed, we # know we haven't given out any signatures since then. earliest_delete = [ @validation_timestamp, trash_at_was, ].compact.min + Rails.configuration.Collections.BlobSigningTTL # The previous value of delete_at is also an error writing upper bound on the manifest and @runner.logUUID != ""@ # longest-lived permission token. For example, if TTL=14, # trash_at_was=now-7, delete_at_was=now+7, then it proceeds is safe to updates # set trash_at=now+6, delete_at=now+8. earliest_delete = [earliest_delete, delete_at_was].compact.min # If delete_at is too soon, use the collection anyway but does not include the manifest text in the update. earliest possible time. if delete_at < earliest_delete self.delete_at = earliest_delete end end </pre>