Project

General

Profile

Actions

Bug #17293

closed

Long-running container log got trashed unexpectedly

Added by Peter Amstutz over 3 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Story points:
-
Release relationship:
Auto

Description

Customer reported that a long-lived (multi-day) workflow failed and was automatically restarted. This turned out to be due to getting 404 errors when trying to update the log collection.

It was determined that the log collection had been put in the trash. It seems like there is a bug where the trash_at/delete_at times were intended to be pushed into the future each time the log collection is updated, but there seems to be some interaction where that is not working as intended.

crunchrun.go:

func (runner *ContainerRunner) saveLogCollection(final bool) (response arvados.Collection, err error) {
...
    mt, err := runner.LogCollection.MarshalManifest(".")
    if err != nil {
        err = fmt.Errorf("error creating log manifest: %v", err)
        return
    }
    updates := arvadosclient.Dict{
        "name":          "logs for " + runner.Container.UUID,
        "manifest_text": mt,
    }
    if final {
        updates["is_trashed"] = true
    } else {
        exp := time.Now().Add(crunchLogUpdatePeriod * 24)
        updates["trash_at"] = exp
        updates["delete_at"] = exp
    }

There were reports of keep write failures. If it failed 24 consecutive times, then it would hit the trash_at time. It might be a good idea to tweak the behavior so that if there is an error writing the manifest and runner.logUUID != "" then it proceeds to updates the collection anyway but does not include the manifest text in the update.

Update: this seems to be the case:

Indeed keep was unable to write for over 12 hours. It seems this was a consequence of how we start/stop keep on the compute nodes for Arvados jobs and an issue we had in Slurm where it was unresponsive for some time. Keep may have been prematurely stopped on the compute node causing the issue. It was then started later (too late), and that caused the issue. I'll be working on fixing the way we stop keep to be more robust to some Slurm errors.

We should adjust the behavior as described so that it doesn't get trashed by accident this way.


Subtasks 1 (0 open1 closed)

Task #17297: Review 17293-save-logsResolvedTom Clegg01/29/2021Actions
Actions

Also available in: Atom PDF