Bug #13982

crunch-dispatch-slurm / Go SDK Dispatcher not able to "let go" of deleted containers

Added by Joshua Randall 4 months ago. Updated 4 months ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
Crunch
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
Story points:
-

Description

For the past few weeks, I have been running a modified crunch-dispatch-slurm (c-d-s) with additional debug printing as well as the fixes for #13933.

I have recently noticed that sometimes the main dispatch loop appears to get stuck repeatedly trying to "let go" of some containers because they are not returned by a query (i.e. the containers have been deleted).

An excerpt of the log output from my modified c-d-s:

Aug  7 19:53:44 arvados-master-eglyx crunch-dispatch-slurm[23753]: 2018/08/07 19:53:44 Dispatcher.Run() After adding missed containers, have 8 containers todo
Aug  7 19:53:44 arvados-master-eglyx crunch-dispatch-slurm[23753]: 2018/08/07 19:53:44 Dispatcher.Run() Container "eglyx-dz642-3m6rxfesjg46r6z" not returned by any query, stopping tracking.
Aug  7 19:53:44 arvados-master-eglyx crunch-dispatch-slurm[23753]: 2018/08/07 19:53:44 Dispatcher.Run() Container "eglyx-dz642-hce23ganevtm9h2" not returned by any query, stopping tracking.
Aug  7 19:53:44 arvados-master-eglyx crunch-dispatch-slurm[23753]: 2018/08/07 19:53:44 Dispatcher.Run() Container "eglyx-dz642-a5d4fk5f2q652wh" not returned by any query, stopping tracking.
Aug  7 19:53:44 arvados-master-eglyx crunch-dispatch-slurm[23753]: 2018/08/07 19:53:44 Dispatcher.Run() Container "eglyx-dz642-6inq8sqsqesxa65" not returned by any query, stopping tracking.
Aug  7 19:53:44 arvados-master-eglyx crunch-dispatch-slurm[23753]: 2018/08/07 19:53:44 Dispatcher.Run() Container "eglyx-dz642-2btvtutchcid052" not returned by any query, stopping tracking.
Aug  7 19:53:44 arvados-master-eglyx crunch-dispatch-slurm[23753]: 2018/08/07 19:53:44 Dispatcher.Run() Container "eglyx-dz642-pyg35pf0z752ous" not returned by any query, stopping tracking.
Aug  7 19:53:44 arvados-master-eglyx crunch-dispatch-slurm[23753]: 2018/08/07 19:53:44 Dispatcher.Run() Container "eglyx-dz642-5gxl64dp40q2nob" not returned by any query, stopping tracking.
Aug  7 19:53:44 arvados-master-eglyx crunch-dispatch-slurm[23753]: 2018/08/07 19:53:44 Dispatcher.Run() Container "eglyx-dz642-9u1kctybowstc3p" not returned by any query, stopping tracking.

then ~75s later:

Aug  7 19:55:10 arvados-master-eglyx crunch-dispatch-slurm[23753]: 2018/08/07 19:55:10 Dispatcher.Run() After adding missed containers, have 8 containers todo
Aug  7 19:55:10 arvados-master-eglyx crunch-dispatch-slurm[23753]: 2018/08/07 19:55:10 Dispatcher.Run() Container "eglyx-dz642-pyg35pf0z752ous" not returned by any query, stopping tracking.
Aug  7 19:55:10 arvados-master-eglyx crunch-dispatch-slurm[23753]: 2018/08/07 19:55:10 Dispatcher.Run() Container "eglyx-dz642-a5d4fk5f2q652wh" not returned by any query, stopping tracking.
Aug  7 19:55:10 arvados-master-eglyx crunch-dispatch-slurm[23753]: 2018/08/07 19:55:10 Dispatcher.Run() Container "eglyx-dz642-9u1kctybowstc3p" not returned by any query, stopping tracking.
Aug  7 19:55:10 arvados-master-eglyx crunch-dispatch-slurm[23753]: 2018/08/07 19:55:10 Dispatcher.Run() Container "eglyx-dz642-6inq8sqsqesxa65" not returned by any query, stopping tracking.
Aug  7 19:55:10 arvados-master-eglyx crunch-dispatch-slurm[23753]: 2018/08/07 19:55:10 Dispatcher.Run() Container "eglyx-dz642-hce23ganevtm9h2" not returned by any query, stopping tracking.
Aug  7 19:55:10 arvados-master-eglyx crunch-dispatch-slurm[23753]: 2018/08/07 19:55:10 Dispatcher.Run() Container "eglyx-dz642-3m6rxfesjg46r6z" not returned by any query, stopping tracking.
Aug  7 19:55:10 arvados-master-eglyx crunch-dispatch-slurm[23753]: 2018/08/07 19:55:10 Dispatcher.Run() Container "eglyx-dz642-2btvtutchcid052" not returned by any query, stopping tracking.
Aug  7 19:55:10 arvados-master-eglyx crunch-dispatch-slurm[23753]: 2018/08/07 19:55:10 Dispatcher.Run() Container "eglyx-dz642-5gxl64dp40q2nob" not returned by any query, stopping tracking.

then again 75s later at the end of the next Dispatcher.Run() loop:

Aug  7 19:56:25 arvados-master-eglyx crunch-dispatch-slurm[23753]: 2018/08/07 19:56:25 Dispatcher.Run() After adding missed containers, have 8 containers todo
Aug  7 19:56:25 arvados-master-eglyx crunch-dispatch-slurm[23753]: 2018/08/07 19:56:25 Dispatcher.Run() Container "eglyx-dz642-2btvtutchcid052" not returned by any query, stopping tracking.
Aug  7 19:56:25 arvados-master-eglyx crunch-dispatch-slurm[23753]: 2018/08/07 19:56:25 Dispatcher.Run() Container "eglyx-dz642-a5d4fk5f2q652wh" not returned by any query, stopping tracking.
Aug  7 19:56:25 arvados-master-eglyx crunch-dispatch-slurm[23753]: 2018/08/07 19:56:25 Dispatcher.Run() Container "eglyx-dz642-pyg35pf0z752ous" not returned by any query, stopping tracking.
Aug  7 19:56:25 arvados-master-eglyx crunch-dispatch-slurm[23753]: 2018/08/07 19:56:25 Dispatcher.Run() Container "eglyx-dz642-hce23ganevtm9h2" not returned by any query, stopping tracking.
Aug  7 19:56:25 arvados-master-eglyx crunch-dispatch-slurm[23753]: 2018/08/07 19:56:25 Dispatcher.Run() Container "eglyx-dz642-9u1kctybowstc3p" not returned by any query, stopping tracking.
Aug  7 19:56:25 arvados-master-eglyx crunch-dispatch-slurm[23753]: 2018/08/07 19:56:25 Dispatcher.Run() Container "eglyx-dz642-5gxl64dp40q2nob" not returned by any query, stopping tracking.
Aug  7 19:56:25 arvados-master-eglyx crunch-dispatch-slurm[23753]: 2018/08/07 19:56:25 Dispatcher.Run() Container "eglyx-dz642-6inq8sqsqesxa65" not returned by any query, stopping tracking.
Aug  7 19:56:25 arvados-master-eglyx crunch-dispatch-slurm[23753]: 2018/08/07 19:56:25 Dispatcher.Run() Container "eglyx-dz642-3m6rxfesjg46r6z" not returned by any query, stopping tracking.

This appears to happen every time through the loop, so it appears that the stopping tracking is not having any effect.

The relevant section of code in my modified dispatch.go is:

                log.Printf("Dispatcher.Run() After adding missed containers, have %d containers todo", len(todo))

                if !querySuccess {
                        // There was an error in one of the previous queries, we probably
                        // didn't see all the containers we should have, so don't shut down
                        // the missed containers.
                        log.Printf("Dispatcher.Run() failure in query checking for updates to missed")
                        continue
                }

                // Containers that I know about that didn't show up in any
                // query should be let go.
                for uuid, tracker := range todo {
                        log.Printf("Dispatcher.Run() Container %q not returned by any query, stopping tracking.", uuid)
                        tracker.close()
                }

Full crunch-dispatch-logs grepped for one of these eight containers shows that this does indeed happen every time through the loop:

Aug  7 19:00:53 arvados-master-eglyx crunch-dispatch-slurm[23753]: 2018/08/07 19:00:53 Dispatcher.checkListForUpdates() starting locked/running container eglyx-dz642-hce23ganevtm9h2
Aug  7 19:00:53 arvados-master-eglyx crunch-dispatch-slurm[23753]: 2018/08/07 19:00:53 Start monitoring container eglyx-dz642-hce23ganevtm9h2 in state "Running" 
Aug  7 19:07:30 arvados-master-eglyx crunch-dispatch-slurm[23753]: 2018/08/07 19:07:30 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-hce23ganevtm9h2
Aug  7 19:11:23 arvados-master-eglyx crunch-dispatch-slurm[23753]: 2018/08/07 19:11:23 Dispatcher.Run() Container "eglyx-dz642-hce23ganevtm9h2" not returned by any query, stopping tracking.
Aug  7 19:13:44 arvados-master-eglyx crunch-dispatch-slurm[23753]: 2018/08/07 19:13:44 Dispatcher.Run() Container "eglyx-dz642-hce23ganevtm9h2" not returned by any query, stopping tracking.
Aug  7 19:16:06 arvados-master-eglyx crunch-dispatch-slurm[23753]: 2018/08/07 19:16:06 Dispatcher.Run() Container "eglyx-dz642-hce23ganevtm9h2" not returned by any query, stopping tracking.
Aug  7 19:18:27 arvados-master-eglyx crunch-dispatch-slurm[23753]: 2018/08/07 19:18:27 Dispatcher.Run() Container "eglyx-dz642-hce23ganevtm9h2" not returned by any query, stopping tracking.
Aug  7 19:20:48 arvados-master-eglyx crunch-dispatch-slurm[23753]: 2018/08/07 19:20:48 Dispatcher.Run() Container "eglyx-dz642-hce23ganevtm9h2" not returned by any query, stopping tracking.
Aug  7 19:23:13 arvados-master-eglyx crunch-dispatch-slurm[23753]: 2018/08/07 19:23:13 Dispatcher.Run() Container "eglyx-dz642-hce23ganevtm9h2" not returned by any query, stopping tracking.
Aug  7 19:25:39 arvados-master-eglyx crunch-dispatch-slurm[23753]: 2018/08/07 19:25:39 Dispatcher.Run() Container "eglyx-dz642-hce23ganevtm9h2" not returned by any query, stopping tracking.
Aug  7 19:28:08 arvados-master-eglyx crunch-dispatch-slurm[23753]: 2018/08/07 19:28:08 Dispatcher.Run() Container "eglyx-dz642-hce23ganevtm9h2" not returned by any query, stopping tracking.
Aug  7 19:30:37 arvados-master-eglyx crunch-dispatch-slurm[23753]: 2018/08/07 19:30:37 Dispatcher.Run() Container "eglyx-dz642-hce23ganevtm9h2" not returned by any query, stopping tracking.
Aug  7 19:33:14 arvados-master-eglyx crunch-dispatch-slurm[23753]: 2018/08/07 19:33:14 Dispatcher.Run() Container "eglyx-dz642-hce23ganevtm9h2" not returned by any query, stopping tracking.
Aug  7 19:35:35 arvados-master-eglyx crunch-dispatch-slurm[23753]: 2018/08/07 19:35:35 Dispatcher.Run() Container "eglyx-dz642-hce23ganevtm9h2" not returned by any query, stopping tracking.
Aug  7 19:36:48 arvados-master-eglyx crunch-dispatch-slurm[23753]: 2018/08/07 19:36:48 Dispatcher.Run() Container "eglyx-dz642-hce23ganevtm9h2" not returned by any query, stopping tracking.
Aug  7 19:38:06 arvados-master-eglyx crunch-dispatch-slurm[23753]: 2018/08/07 19:38:06 Dispatcher.Run() Container "eglyx-dz642-hce23ganevtm9h2" not returned by any query, stopping tracking.
Aug  7 19:39:20 arvados-master-eglyx crunch-dispatch-slurm[23753]: 2018/08/07 19:39:20 Dispatcher.Run() Container "eglyx-dz642-hce23ganevtm9h2" not returned by any query, stopping tracking.
Aug  7 19:40:37 arvados-master-eglyx crunch-dispatch-slurm[23753]: 2018/08/07 19:40:37 Dispatcher.Run() Container "eglyx-dz642-hce23ganevtm9h2" not returned by any query, stopping tracking.
Aug  7 19:41:53 arvados-master-eglyx crunch-dispatch-slurm[23753]: 2018/08/07 19:41:53 Dispatcher.Run() Container "eglyx-dz642-hce23ganevtm9h2" not returned by any query, stopping tracking.
Aug  7 19:43:15 arvados-master-eglyx crunch-dispatch-slurm[23753]: 2018/08/07 19:43:15 Dispatcher.Run() Container "eglyx-dz642-hce23ganevtm9h2" not returned by any query, stopping tracking.
Aug  7 19:44:33 arvados-master-eglyx crunch-dispatch-slurm[23753]: 2018/08/07 19:44:33 Dispatcher.Run() Container "eglyx-dz642-hce23ganevtm9h2" not returned by any query, stopping tracking.
Aug  7 19:45:57 arvados-master-eglyx crunch-dispatch-slurm[23753]: 2018/08/07 19:45:57 Dispatcher.Run() Container "eglyx-dz642-hce23ganevtm9h2" not returned by any query, stopping tracking.
Aug  7 19:47:15 arvados-master-eglyx crunch-dispatch-slurm[23753]: 2018/08/07 19:47:15 Dispatcher.Run() Container "eglyx-dz642-hce23ganevtm9h2" not returned by any query, stopping tracking.
Aug  7 19:48:32 arvados-master-eglyx crunch-dispatch-slurm[23753]: 2018/08/07 19:48:32 Dispatcher.Run() Container "eglyx-dz642-hce23ganevtm9h2" not returned by any query, stopping tracking.
Aug  7 19:49:58 arvados-master-eglyx crunch-dispatch-slurm[23753]: 2018/08/07 19:49:58 Dispatcher.Run() Container "eglyx-dz642-hce23ganevtm9h2" not returned by any query, stopping tracking.
Aug  7 19:51:16 arvados-master-eglyx crunch-dispatch-slurm[23753]: 2018/08/07 19:51:16 Dispatcher.Run() Container "eglyx-dz642-hce23ganevtm9h2" not returned by any query, stopping tracking.
Aug  7 19:52:27 arvados-master-eglyx crunch-dispatch-slurm[23753]: 2018/08/07 19:52:27 Dispatcher.Run() Container "eglyx-dz642-hce23ganevtm9h2" not returned by any query, stopping tracking.
Aug  7 19:53:44 arvados-master-eglyx crunch-dispatch-slurm[23753]: 2018/08/07 19:53:44 Dispatcher.Run() Container "eglyx-dz642-hce23ganevtm9h2" not returned by any query, stopping tracking.
Aug  7 19:55:10 arvados-master-eglyx crunch-dispatch-slurm[23753]: 2018/08/07 19:55:10 Dispatcher.Run() Container "eglyx-dz642-hce23ganevtm9h2" not returned by any query, stopping tracking.
Aug  7 19:56:25 arvados-master-eglyx crunch-dispatch-slurm[23753]: 2018/08/07 19:56:25 Dispatcher.Run() Container "eglyx-dz642-hce23ganevtm9h2" not returned by any query, stopping tracking.
Aug  7 19:57:36 arvados-master-eglyx crunch-dispatch-slurm[23753]: 2018/08/07 19:57:36 Dispatcher.Run() Container "eglyx-dz642-hce23ganevtm9h2" not returned by any query, stopping tracking.
Aug  7 19:58:56 arvados-master-eglyx crunch-dispatch-slurm[23753]: 2018/08/07 19:58:56 Dispatcher.Run() Container "eglyx-dz642-hce23ganevtm9h2" not returned by any query, stopping tracking.

All eight containers had a line of the form "debug: runner is handling updates slowly, discarded previous update for ..." before entering this loop, although that is pretty common so I'm not sure if it is relevant.

I would note that our current omnibus workaround for c-d-s issues (calling `systemctl restart crunch-dispatch-slurm` every hour on the hour) is preventing this from becoming a major issue as the new invocation of c-d-s does not seem to be concerned with the containers that the previous incarnation was repeatedly trying to close.

History

#1 Updated by Joshua Randall 4 months ago

Just found crunch-dispatch-slurm with 491 such containers, apparently also never going to let them go until I restarted it.

#2 Updated by Tom Morris 4 months ago

  • Target version set to To Be Groomed

Also available in: Atom PDF