Project

General

Profile

Actions

Bug #21190

open

crunch-run abandoned but process still 'Running'

Added by Peter Amstutz 6 months ago. Updated about 2 months ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
Crunch
Target version:
Story points:
-

Description

An instance was shut down due to multiple failed probes, but the container that was running on the instance remained in 'Running' state. Shouldn't have the dispatcher marked the container as cancelled? Or rescheduled it?

Nov  9 05:42:18 ip-172-25-144-184 arvados-dispatch-cloud[185885]: {"ClusterID":"xxxxx","ContainerUUID":"xxxxx-dz642-m4el5nk6i59n69p","InstanceType":"m4large","PID":185885,"Priority":562948253941774988,"State":"Queued","level":"info","msg":"adding container to queue","time":"2023-11-09T05:42:18.948527443Z"}
Nov  9 05:55:33 ip-172-25-144-184 arvados-dispatch-cloud[185885]: {"Address":"172.25.138.199","ClusterID":"xxxxx","IdleBehavior":"run","Instance":"i-0da7c27cae59bb18b","InstanceType":"m4large","PID":185885,"State":"booting","level":"info","msg":"instance appeared in cloud","time":"2023-11-09T05:55:33.929679063Z"}
Nov  9 05:57:01 ip-172-25-144-184 arvados-dispatch-cloud[185885]: {"Address":"172.25.138.199","ClusterID":"xxxxx","Command":"systemctl is-system-running","Instance":"i-0da7c27cae59bb18b","InstanceType":"m4large","PID":185885,"level":"info","msg":"boot probe succeeded","stderr":"","stdout":"running\n","time":"2023-11-09T05:57:01.197430954Z"}
Nov  9 05:57:01 ip-172-25-144-184 arvados-dispatch-cloud[185885]: {"Address":"172.25.138.199","ClusterID":"xxxxx","ContainerUUID":"xxxxx-dz642-m4el5nk6i59n69p","Instance":"i-0da7c27cae59bb18b","InstanceType":"m4large","PID":185885,"level":"info","msg":"crunch-run process started","time":"2023-11-09T05:57:01.649759362Z"}
Nov  9 05:57:01 ip-172-25-144-184 arvados-dispatch-cloud[185885]: {"Address":"172.25.138.199","ClusterID":"xxxxx","ContainerUUID":"xxxxx-dz642-m4el5nk6i59n69p","Instance":"i-0da7c27cae59bb18b","InstanceType":"m4large","PID":185885,"level":"info","msg":"crunch-run process started","time":"2023-11-09T05:57:01.649759362Z"}
Nov  9 05:57:01 ip-172-25-144-184 arvados-dispatch-cloud[185885]: {"Address":"172.25.138.199","ClusterID":"xxxxx","Instance":"i-0da7c27cae59bb18b","InstanceType":"m4large","PID":185885,"cmd":"sudo sh -c 'set -e; dstdir=\"/tmp/arvados-crunch-run/\"; dstfile=\"/tmp/arvados-crunch-run/crunch-run~493931196c222d127ddac3e798507810\"; mkdir -p \"$dstdir\"; touch \"$dstfile\"; chmod 0755 \"$dstdir\" \"$dstfile\"; cat \u003e\"$dstfile\"'","hash":"493931196c222d127ddac3e798507810","level":"info","msg":"installing runner binary on worker","path":"/tmp/arvados-crunch-run/crunch-run~493931196c222d127ddac3e798507810","time":"2023-11-09T05:57:01.203216535Z"}
Nov  9 05:57:01 ip-172-25-144-184 arvados-dispatch-cloud[185885]: {"Address":"172.25.138.199","ClusterID":"xxxxx","Instance":"i-0da7c27cae59bb18b","InstanceType":"m4large","PID":185885,"ProbeStart":"2023-11-09T05:57:01.190511228Z","level":"info","msg":"instance booted; will try probeRunning","time":"2023-11-09T05:57:01.572380813Z"}
Nov  9 05:57:01 ip-172-25-144-184 arvados-dispatch-cloud[185885]: {"Address":"172.25.138.199","ClusterID":"xxxxx","Instance":"i-0da7c27cae59bb18b","InstanceType":"m4large","PID":185885,"ProbeStart":"2023-11-09T05:57:01.190511228Z","RunningContainers":0,"State":"idle","level":"info","msg":"probes succeeded, instance is in service","time":"2023-11-09T05:57:01.607380217Z"}
Nov  9 06:23:40 ip-172-25-144-184 arvados-dispatch-cloud[185885]: {"Address":"172.25.138.199","ClusterID":"xxxxx","Command":"sudo /tmp/arvados-crunch-run/crunch-run~493931196c222d127ddac3e798507810 --list","Instance":"i-0da7c27cae59bb18b","InstanceType":"m4large","PID":185885,"error":"dial tcp 172.25.138.199:22: i/o timeout","level":"warning","msg":"probe failed","stderr":"","stdout":"","time":"2023-11-09T06:23:40.129518363Z"}
Nov  9 06:24:49 ip-172-25-144-184 arvados-dispatch-cloud[185885]: {"Address":"172.25.138.199","ClusterID":"xxxxx","Command":"sudo /tmp/arvados-crunch-run/crunch-run~493931196c222d127ddac3e798507810 --list","Instance":"i-0da7c27cae59bb18b","InstanceType":"m4large","PID":185885,"error":"dial tcp 172.25.138.199:22: i/o timeout","level":"warning","msg":"probe failed","stderr":"","stdout":"","time":"2023-11-09T06:24:49.340496750Z"}
Nov  9 06:24:49 ip-172-25-144-184 arvados-dispatch-cloud[185885]: {"Address":"172.25.138.199","ClusterID":"xxxxx","Duration":1034553577012,"Instance":"i-0da7c27cae59bb18b","InstanceType":"m4large","PID":185885,"Since":"2023-11-09T06:06:34.786359676Z","State":"running","level":"warning","msg":"instance unresponsive, shutting down","time":"2023-11-09T06:24:49.340590651Z"}
Nov  9 06:25:27 ip-172-25-144-184 arvados-dispatch-cloud[185885]: {"Address":"172.25.138.199","ClusterID":"xxxxx","ContainerUUID":"xxxxx-dz642-m4el5nk6i59n69p","Instance":"i-0da7c27cae59bb18b","InstanceType":"m4large","PID":185885,"level":"info","msg":"crunch-run process abandoned","time":"2023-11-09T06:25:27.132571823Z"}
Nov  9 06:25:27 ip-172-25-144-184 arvados-dispatch-cloud[185885]: {"Address":"172.25.138.199","ClusterID":"xxxxx","ContainerUUID":"xxxxx-dz642-m4el5nk6i59n69p","Instance":"i-0da7c27cae59bb18b","InstanceType":"m4large","PID":185885,"level":"info","msg":"crunch-run process abandoned","time":"2023-11-09T06:25:27.132571823Z"}
Nov  9 06:25:27 ip-172-25-144-184 arvados-dispatch-cloud[185885]: {"ClusterID":"xxxxx","Instance":"i-0da7c27cae59bb18b","PID":185885,"WorkerState":"shutdown","level":"info","msg":"instance disappeared in cloud","time":"2023-11-09T06:25:27.129985051Z"}
Nov  9 18:10:22 ip-172-25-144-184 arvados-dispatch-cloud[245221]: {"ClusterID":"xxxxx","ContainerUUID":"xxxxx-dz642-m4el5nk6i59n69p","InstanceType":"m4large","PID":245221,"Priority":562948253941774988,"State":"Running","level":"info","msg":"adding container to queue","time":"2023-11-09T18:10:22.189212178Z"}
Nov  9 19:52:17 ip-172-25-144-184 arvados-dispatch-cloud[251616]: {"ClusterID":"xxxxx","ContainerUUID":"xxxxx-dz642-m4el5nk6i59n69p","InstanceType":"m4large","PID":251616,"Priority":562948253941774988,"State":"Running","level":"info","msg":"adding container to queue","time":"2023-11-09T19:52:17.781172241Z"}
Nov  9 20:11:47 ip-172-25-144-184 arvados-dispatch-cloud[252730]: {"ClusterID":"xxxxx","ContainerUUID":"xxxxx-dz642-m4el5nk6i59n69p","InstanceTypes":"m4large, c4.large","PID":252730,"Priority":562948253941774988,"State":"Running","level":"info","msg":"adding container to queue","time":"2023-11-09T20:11:47.677520485Z"}
Nov  9 20:32:09 ip-172-25-144-184 arvados-dispatch-cloud[255676]: {"ClusterID":"xxxxx","ContainerUUID":"xxxxx-dz642-m4el5nk6i59n69p","InstanceTypes":"c5alarge, c5large, m5alarge, m4large, c4.large","PID":255676,"Priority":562948253941774988,"State":"Running","level":"info","msg":"adding container to queue","time":"2023-11-09T20:32:09.325508159Z"}
Nov  9 20:44:20 ip-172-25-144-184 arvados-dispatch-cloud[258192]: {"ClusterID":"xxxxx","ContainerUUID":"xxxxx-dz642-m4el5nk6i59n69p","InstanceType":"c5alarge","PID":258192,"Priority":562948253941774988,"State":"Running","level":"info","msg":"adding container to queue","time":"2023-11-09T20:44:20.215070124Z"}
Nov  9 20:44:49 ip-172-25-144-184 arvados-dispatch-cloud[258553]: {"ClusterID":"xxxxx","ContainerUUID":"xxxxx-dz642-m4el5nk6i59n69p","InstanceTypes":"c5alarge, c5large, m5alarge, m4large, c4.large","PID":258553,"Priority":562948253941774988,"State":"Running","level":"info","msg":"adding container to queue","time":"2023-11-09T20:44:49.877242519Z"}
Actions #1

Updated by Peter Amstutz 6 months ago

  • Description updated (diff)
Actions #2

Updated by Peter Amstutz 6 months ago

  • Description updated (diff)
Actions #3

Updated by Tom Clegg 6 months ago

  • Description updated (diff)
Actions #4

Updated by Peter Amstutz 6 months ago

I can't seem to force it into a cancel state, even as admin:

$ arv container update -u zzzzz-dz642-m4el5nk6i59n69p -c '{"state": "Cancelled"}'
Error: #<ActiveRecord::RecordInvalid: Validation failed: Subrequests cost cannot be modified in state 'Running' (761.9680225033519, 762.0638903677408)> (req-1gksgpn70gx7a1ro3ugi)
Actions #5

Updated by Tom Clegg 6 months ago

source:lib/dispatchcloud/scheduler/sync.go should be taking care of it. Perhaps there are instances that a-d-c can't communicate with or shut down? Is that related to the fact that a-d-c is restarting a lot (judging by the PIDs in the logs)?

                case arvados.ContainerStateRunning:
                        if !running {
                                if !anyUnknownWorkers {
                                        go sch.cancel(uuid, "not running on any worker")
                                }
                        } else if !exited.IsZero() && qUpdated.After(exited) {
                                go sch.cancel(uuid, "state=Running after crunch-run exited")
                        } else if ent.Container.Priority == 0 {
                                go sch.kill(uuid, "priority=0")
                        }
Actions #6

Updated by Tom Clegg 6 months ago

Peter Amstutz wrote in #note-4:

Subrequests cost cannot be modified in state 'Running' (761.9680225033519, 762.0638903677408)

Ah. That seems like a good clue.

Actions #7

Updated by Peter Amstutz 6 months ago

Tom Clegg wrote in #note-6:

Peter Amstutz wrote in #note-4:

Subrequests cost cannot be modified in state 'Running' (761.9680225033519, 762.0638903677408)

Ah. That seems like a good clue.

That's the error if I use my personal account. If I uses SystemRootToken, then I am able to cancel the container.

Actions #8

Updated by Tom Clegg 6 months ago

We might need something like this. Haven't come up with a test case yet.

diff --git a/services/api/app/models/container.rb b/services/api/app/models/container.rb
index ee338b81ff..4756bab696 100644
--- a/services/api/app/models/container.rb
+++ b/services/api/app/models/container.rb
@@ -585,7 +585,13 @@ class Container < ArvadosModel
     elsif self.locked_by_uuid && self.locked_by_uuid != current_api_client_authorization.andand.uuid
       # When locked, progress fields cannot be updated by the wrong
       # dispatcher, even though it has admin privileges.
-      permitted = permitted - progress_attrs
+      #
+      # :subrequests_cost is exempt, though: a different admin token
+      # (other than our locker) can cancel one of our child
+      # containers, and when that results in one of our child
+      # container requests being finalized, it should update our
+      # subrequests_cost.
+      permitted = permitted - [progress_attrs - :subrequests_cost]
     end
     check_update_whitelist permitted
   end
Actions #9

Updated by Tom Clegg 6 months ago

Meanwhile, the fact that a-d-c didn't even log "cancelling container because not running on any worker" suggests something else is wrong.

func (sch *Scheduler) cancel(uuid string, reason string) {
        if !sch.uuidLock(uuid, "cancel") {
                return
        }
        defer sch.uuidUnlock(uuid)
        logger := sch.logger.WithField("ContainerUUID", uuid)
        logger.Infof("cancelling container because %s", reason)
        err := sch.queue.Cancel(uuid)
        if err != nil {
                logger.WithError(err).Print("error cancelling container")
        }
}
Actions #10

Updated by Tom Clegg 6 months ago

I'd be interested to see the last lines logged by each of the 6 PIDs mentioned in the last 6 log entries quoted. Perhaps a-d-c is crashing on a related issue?

Actions #11

Updated by Peter Amstutz 6 months ago

I checked the logs, I think all the restarts were intentional, it wasn't crashing.

But it makes sense that it would log the container once for each restart, since the first thing it does is look at the queue and get oriented.

Actions #12

Updated by Peter Amstutz 6 months ago

But it should have determined that the container wasn't running anywhere and cancelled it, so it is still a mystery why it didn't do so. Unless it was attempting to cancel it and failing, but we didn't log the errors.

Actions #14

Updated by Peter Amstutz 5 months ago

  • Target version changed from Development 2023-11-29 sprint to Development 2024-01-17 sprint
Actions #15

Updated by Peter Amstutz 5 months ago

  • Target version changed from Development 2024-01-17 sprint to Development 2024-01-31 sprint
Actions #16

Updated by Peter Amstutz 5 months ago

  • Target version changed from Development 2024-01-31 sprint to Development 2024-02-14 sprint
Actions #17

Updated by Peter Amstutz 3 months ago

  • Target version changed from Development 2024-02-14 sprint to Development 2024-02-28 sprint
Actions #18

Updated by Peter Amstutz 3 months ago

  • Target version changed from Development 2024-02-28 sprint to To be scheduled
Actions #19

Updated by Peter Amstutz about 2 months ago

  • Target version changed from To be scheduled to Future
Actions

Also available in: Atom PDF