Bug #17293
closedLong-running container log got trashed unexpectedly
Description
Customer reported that a long-lived (multi-day) workflow failed and was automatically restarted. This turned out to be due to getting 404 errors when trying to update the log collection.
It was determined that the log collection had been put in the trash. It seems like there is a bug where the trash_at/delete_at
times were intended to be pushed into the future each time the log collection is updated, but there seems to be some interaction where that is not working as intended.
crunchrun.go:
func (runner *ContainerRunner) saveLogCollection(final bool) (response arvados.Collection, err error) { ... mt, err := runner.LogCollection.MarshalManifest(".") if err != nil { err = fmt.Errorf("error creating log manifest: %v", err) return } updates := arvadosclient.Dict{ "name": "logs for " + runner.Container.UUID, "manifest_text": mt, } if final { updates["is_trashed"] = true } else { exp := time.Now().Add(crunchLogUpdatePeriod * 24) updates["trash_at"] = exp updates["delete_at"] = exp }
There were reports of keep write failures. If it failed 24 consecutive times, then it would hit the trash_at time. It might be a good idea to tweak the behavior so that if there is an error writing the manifest and runner.logUUID != ""
then it proceeds to updates the collection anyway but does not include the manifest text in the update.
Update: this seems to be the case:
Indeed keep was unable to write for over 12 hours. It seems this was a consequence of how we start/stop keep on the compute nodes for Arvados jobs and an issue we had in Slurm where it was unresponsive for some time. Keep may have been prematurely stopped on the compute node causing the issue. It was then started later (too late), and that caused the issue. I'll be working on fixing the way we stop keep to be more robust to some Slurm errors.
We should adjust the behavior as described so that it doesn't get trashed by accident this way.
Updated by Peter Amstutz almost 4 years ago
- Subject changed from Logs of long-running containers get trashed unexpectedly to Long-running container log got trashed unexpectedly
Updated by Peter Amstutz almost 4 years ago
- Description updated (diff)
Originally I was suspicious of this code, but now I think it is fine:
def default_trash_interval ... elsif delete_at_changed? && delete_at >= trash_at # Fix delete_at if needed, so it's not earlier than the expiry # time on any permission tokens that might have been given out. # In any case there are no signatures expiring after now+TTL. # Also, if the existing trash_at time has already passed, we # know we haven't given out any signatures since then. earliest_delete = [ @validation_timestamp, trash_at_was, ].compact.min + Rails.configuration.Collections.BlobSigningTTL # The previous value of delete_at is also an upper bound on the # longest-lived permission token. For example, if TTL=14, # trash_at_was=now-7, delete_at_was=now+7, then it is safe to # set trash_at=now+6, delete_at=now+8. earliest_delete = [earliest_delete, delete_at_was].compact.min # If delete_at is too soon, use the earliest possible time. if delete_at < earliest_delete self.delete_at = earliest_delete end end
Updated by Peter Amstutz almost 4 years ago
17293-save-logs @ 3cb8be1e32a1b2e477501c38cefa4c48bbbb8c29
Updated by Peter Amstutz almost 4 years ago
- Status changed from New to In Progress
Updated by Tom Clegg almost 4 years ago
LGTM. Just one thought: did you check that the trash/delete times do in fact get pushed ahead in an update even when the update attribute values are all identical to the previous version? (I think they do, but I haven't gone digging for a test to confirm it.)
Updated by Tom Clegg almost 4 years ago
Tom Clegg wrote:
check that the trash/delete times do in fact get pushed ahead in an update even when the update attribute values are all identical to the previous version?
Duh, sorry, "git diff" just wasn't showing me the lines where crunchrun is explicitly changing the trash/delete attrs, we're not relying on controller/railsapi to extend them implicitly. Clearly it's fine.
Updated by Peter Amstutz almost 4 years ago
- Status changed from In Progress to Resolved