Bug #17293

Long-running container log got trashed unexpectedly

Added by Peter Amstutz 3 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
01/29/2021
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-
Release relationship:
Auto

Description

Customer reported that a long-lived (multi-day) workflow failed and was automatically restarted. This turned out to be due to getting 404 errors when trying to update the log collection.

It was determined that the log collection had been put in the trash. It seems like there is a bug where the trash_at/delete_at times were intended to be pushed into the future each time the log collection is updated, but there seems to be some interaction where that is not working as intended.

crunchrun.go:

func (runner *ContainerRunner) saveLogCollection(final bool) (response arvados.Collection, err error) {
...
    mt, err := runner.LogCollection.MarshalManifest(".")
    if err != nil {
        err = fmt.Errorf("error creating log manifest: %v", err)
        return
    }
    updates := arvadosclient.Dict{
        "name":          "logs for " + runner.Container.UUID,
        "manifest_text": mt,
    }
    if final {
        updates["is_trashed"] = true
    } else {
        exp := time.Now().Add(crunchLogUpdatePeriod * 24)
        updates["trash_at"] = exp
        updates["delete_at"] = exp
    }

There were reports of keep write failures. If it failed 24 consecutive times, then it would hit the trash_at time. It might be a good idea to tweak the behavior so that if there is an error writing the manifest and runner.logUUID != "" then it proceeds to updates the collection anyway but does not include the manifest text in the update.

Update: this seems to be the case:

Indeed keep was unable to write for over 12 hours. It seems this was a consequence of how we start/stop keep on the compute nodes for Arvados jobs and an issue we had in Slurm where it was unresponsive for some time. Keep may have been prematurely stopped on the compute node causing the issue. It was then started later (too late), and that caused the issue. I'll be working on fixing the way we stop keep to be more robust to some Slurm errors.

We should adjust the behavior as described so that it doesn't get trashed by accident this way.


Subtasks

Task #17297: Review 17293-save-logsResolvedTom Clegg

Associated revisions

Revision 3d1a90b9
Added by Peter Amstutz 3 months ago

Merge branch '17293-save-logs' refs #17293

Arvados-DCO-1.1-Signed-off-by: Peter Amstutz <>

History

#1 Updated by Peter Amstutz 3 months ago

  • Description updated (diff)

#2 Updated by Peter Amstutz 3 months ago

  • Description updated (diff)

#3 Updated by Peter Amstutz 3 months ago

  • Subject changed from Logs of long-running containers get trashed unexpectedly to Long-running container log got trashed unexpectedly

#4 Updated by Peter Amstutz 3 months ago

  • Description updated (diff)

Originally I was suspicious of this code, but now I think it is fine:

  def default_trash_interval
    ...

    elsif delete_at_changed? && delete_at >= trash_at
      # Fix delete_at if needed, so it's not earlier than the expiry
      # time on any permission tokens that might have been given out.

      # In any case there are no signatures expiring after now+TTL.
      # Also, if the existing trash_at time has already passed, we
      # know we haven't given out any signatures since then.
      earliest_delete = [
        @validation_timestamp,
        trash_at_was,
      ].compact.min + Rails.configuration.Collections.BlobSigningTTL

      # The previous value of delete_at is also an upper bound on the
      # longest-lived permission token. For example, if TTL=14,
      # trash_at_was=now-7, delete_at_was=now+7, then it is safe to
      # set trash_at=now+6, delete_at=now+8.
      earliest_delete = [earliest_delete, delete_at_was].compact.min

      # If delete_at is too soon, use the earliest possible time.
      if delete_at < earliest_delete
        self.delete_at = earliest_delete
      end
    end

#5 Updated by Peter Amstutz 3 months ago

  • Description updated (diff)

#6 Updated by Peter Amstutz 3 months ago

  • Assigned To set to Peter Amstutz

#8 Updated by Peter Amstutz 3 months ago

  • Status changed from New to In Progress

#9 Updated by Tom Clegg 3 months ago

LGTM. Just one thought: did you check that the trash/delete times do in fact get pushed ahead in an update even when the update attribute values are all identical to the previous version? (I think they do, but I haven't gone digging for a test to confirm it.)

#10 Updated by Tom Clegg 3 months ago

Tom Clegg wrote:

check that the trash/delete times do in fact get pushed ahead in an update even when the update attribute values are all identical to the previous version?

Duh, sorry, "git diff" just wasn't showing me the lines where crunchrun is explicitly changing the trash/delete attrs, we're not relying on controller/railsapi to extend them implicitly. Clearly it's fine.

#11 Updated by Peter Amstutz 3 months ago

  • Status changed from In Progress to Resolved

#12 Updated by Peter Amstutz about 2 months ago

  • Release set to 37

Also available in: Atom PDF