Idea #7475
closed[Node manager] Better communication when job is unsatisfiable
Description
When a job cannot be satisfied by node manager, it will be queued forever with no feedback to the user (and almost no feedback to the admin, either). There are two distinct cases:
1) A job's min_nodes
request is greater than node manager's configured max_nodes
. In this case, node manager silently skips over the job with no feedback as to why no nodes are being started.
2) A job's resource requirements for a single node exceed the available cloud node size. In this case, the only indication this is a problem is a message of "job XXX not satisfiable" in the node manager log (and even then only if debug logging is turned on).
If a job request cannot be satisfied under its current configuration, node manager should have some way of signaling this to the user.
Updated by Peter Amstutz over 9 years ago
- Tracker changed from Bug to Idea
- Description updated (diff)
Updated by Brett Smith over 9 years ago
This can't just be Node Manager's job though, right? The system needs to know what Node Manager is willing to do, but any of these problems can also arise on static clusters that aren't even running Node Manager.
Updated by Peter Amstutz over 9 years ago
Yes, that's true. I think the right long term solution is for crunch v2 to combine the jobs of crunch-dispatch and node manager into one process, because otherwise neither process has quite enough information to be able to tell the user what's actually going on.
In the short term, there's still benefit in making incremental improvements to node manager for cloud installs.
Updated by Tom Clegg almost 9 years ago
It seems like Nodemanager should emit a log (with object_uuid == job uuid) and cancel the job.
If we start telling crunch-dispatch whether nodemanager is running, in cases where nodemanager isn't running, crunch-dispatch could emit a log and cancel the job if it's unsatisfiable with the current set of (alive?) slurm nodes.
Short of running nodemanager on static clusters (add a slurm driver?) it seems like we need the logic in both places if we want to fix the bug in both types of install.
Updated by Tom Clegg over 7 years ago
For crunch2, when node manager is not in use, sbatch rejects unsatisfiable jobs and the user gets an error -- however, crunch-dispatch-slurm will keep retrying forever. This infinite-retry problem will be mostly addressed by #9688, but ideally crunch-dispatch-slurm should also recognize the "unsatisfiable job" error as a non-retryable error, and tell the API server that it won't be re-attempted (if crunch-dispatch-slurm assumes/knows that it is the only dispatcher, it can indicate this by cancelling the container).
Updated by Tom Morris over 7 years ago
- Target version changed from 2017-05-24 sprint to 2017-06-07 sprint
Updated by Lucas Di Pentima over 7 years ago
- Assigned To set to Lucas Di Pentima
Updated by Lucas Di Pentima over 7 years ago
- Target version changed from 2017-06-07 sprint to 2017-06-21 sprint
Updated by Lucas Di Pentima over 7 years ago
- Status changed from New to In Progress
Updated by Lucas Di Pentima over 7 years ago
- Target version changed from 2017-06-21 sprint to 2017-07-05 sprint
Updated by Lucas Di Pentima over 7 years ago
Updates @ 3dad67f27
Test run: https://ci.curoverse.com/job/developer-run-tests/376/
Modified ServerCalculator.servers_for_queue()
so that it also returns a dict
with information about unsatisfiable jobs that should be cancelled by its caller.
Updated some tests that started failing because of this change.
New tests pending.
Updated by Lucas Di Pentima over 7 years ago
New updates at f77d08dd5
Test run: https://ci.curoverse.com/job/developer-run-tests/377/
- Enhanced error checking when trying to emit a log and cancel an unsatisfiable job.
- Added test cases.
Updated by Peter Amstutz over 7 years ago
7475-nodemgr-unsatisfiable-job-comms @ f77d08dd57a1021525717c8669296eb3e463c5f7
- In _got_response, the uuid can be either a job or a container. It needs to look at the type field of the uuid. This is only valid if the uuid is for a job:
self._client.jobs().cancel(uuid=job_uuid).execute()
If the uuid is for a container and self.slurm_queue
is true, it should do this:
subprocess.check_call(['scancel', '--name='+uuid])
This may require a stub to ensure that tests don't try to call the real scancel
.
I'd like to see an integration test, if it isn't too much work. Upon seeing the log message about an unsatisfiable job/container, it should check that (a) the expected log message was added and (b) the job was cancelled/scancel was called.
Updated by Lucas Di Pentima over 7 years ago
- Target version changed from 2017-07-05 sprint to 2017-07-19 sprint
Updated by Lucas Di Pentima over 7 years ago
Updates at f507162f3
Test run: https://ci.curoverse.com/job/developer-run-tests/378/
Added support for unsatisfiable containers. Updated unit test to cover both cases.
Pending: integration test.
Updated by Peter Amstutz over 7 years ago
Writing an integration test:
Start by copying "test_single_node_azure".
The format of the test case is (steps, checks, driver, jobs, cloud).
For the first step, instead of set_squeue
you'll need a new function like set_queue_unsatisfiable
. This should do something like echo '99|100|100|%s|%s' (this would be a job that requests 99 cores).
This function should use update_script
to create a stub for scancel
. The stub script should do something to record that it was called, like writing a file.
The next line should have a regex to match the error message that node manager puts out when the job is can't be satisfied.
This should call a function that checks the API server logs table that the right log message was added.
It should also check for the presence of the file that indicates scancel was called. The function is supposed to return 0 for success and 1 for failure.
That's it. You don't need any other steps. For checks (if they match, that is a failure). You might want to have "Cloud node is now paired ..." as a negative check.
Updated by Lucas Di Pentima over 7 years ago
Updates at 7d4a10bcc
Added integration test following the above instructions.
Updated by Lucas Di Pentima over 7 years ago
- Target version changed from 2017-07-19 sprint to 2017-08-02 sprint
Updated by Peter Amstutz over 7 years ago
Start test_hit_quota test_hit_quota passed Start test_multiple_nodes Traceback (most recent call last): File "tests/integration_test.py", line 441, in <module> main() File "tests/integration_test.py", line 431, in main code += run_test(t, *tests[t]) File "tests/integration_test.py", line 244, in run_test shutil.rmtree(os.path.dirname(unsatisfiable_job_scancelled)) File "/usr/lib/python2.7/shutil.py", line 239, in rmtree onerror(os.listdir, path, sys.exc_info()) File "/usr/lib/python2.7/shutil.py", line 237, in rmtree names = os.listdir(path) OSError: [Errno 2] No such file or directory: '/tmp/tmp59u2RS'
I think you want global unsatisfiable_job_scancelled
and then create the tempdir in run_test()
Updated by Lucas Di Pentima over 7 years ago
Sorry, I thought I tested it before pushing.
Updated at 3e46aaf64
Test run: https://ci.curoverse.com/job/developer-run-tests/406/
Updated by Lucas Di Pentima over 7 years ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|commit:c0e203e7f3e9e40736eac63cbe440d5e46e379c0.