[Node manager] Better communication when job is unsatisfiable
When a job cannot be satisfied by node manager, it will be queued forever with no feedback to the user (and almost no feedback to the admin, either). There are two distinct cases:
1) A job's
min_nodes request is greater than node manager's configured
max_nodes. In this case, node manager silently skips over the job with no feedback as to why no nodes are being started.
2) A job's resource requirements for a single node exceed the available cloud node size. In this case, the only indication this is a problem is a message of "job XXX not satisfiable" in the node manager log (and even then only if debug logging is turned on).
If a job request cannot be satisfied under its current configuration, node manager should have some way of signaling this to the user.
#3 Updated by Peter Amstutz over 2 years ago
Yes, that's true. I think the right long term solution is for crunch v2 to combine the jobs of crunch-dispatch and node manager into one process, because otherwise neither process has quite enough information to be able to tell the user what's actually going on.
In the short term, there's still benefit in making incremental improvements to node manager for cloud installs.
#4 Updated by Tom Clegg almost 2 years ago
It seems like Nodemanager should emit a log (with object_uuid == job uuid) and cancel the job.
If we start telling crunch-dispatch whether nodemanager is running, in cases where nodemanager isn't running, crunch-dispatch could emit a log and cancel the job if it's unsatisfiable with the current set of (alive?) slurm nodes.
Short of running nodemanager on static clusters (add a slurm driver?) it seems like we need the logic in both places if we want to fix the bug in both types of install.
For crunch2, when node manager is not in use, sbatch rejects unsatisfiable jobs and the user gets an error -- however, crunch-dispatch-slurm will keep retrying forever. This infinite-retry problem will be mostly addressed by #9688, but ideally crunch-dispatch-slurm should also recognize the "unsatisfiable job" error as a non-retryable error, and tell the API server that it won't be re-attempted (if crunch-dispatch-slurm assumes/knows that it is the only dispatcher, it can indicate this by cancelling the container).
#13 Updated by Lucas Di Pentima 7 months ago
Updates @ 3dad67f27
Test run: https://ci.curoverse.com/job/developer-run-tests/376/
ServerCalculator.servers_for_queue() so that it also returns a
dict with information about unsatisfiable jobs that should be cancelled by its caller.
Updated some tests that started failing because of this change.
New tests pending.
#14 Updated by Lucas Di Pentima 7 months ago
New updates at f77d08dd5
Test run: https://ci.curoverse.com/job/developer-run-tests/377/
- Enhanced error checking when trying to emit a log and cancel an unsatisfiable job.
- Added test cases.
#15 Updated by Peter Amstutz 7 months ago
7475-nodemgr-unsatisfiable-job-comms @ f77d08dd57a1021525717c8669296eb3e463c5f7
- In _got_response, the uuid can be either a job or a container. It needs to look at the type field of the uuid. This is only valid if the uuid is for a job:
If the uuid is for a container and
self.slurm_queue is true, it should do this:
This may require a stub to ensure that tests don't try to call the real
I'd like to see an integration test, if it isn't too much work. Upon seeing the log message about an unsatisfiable job/container, it should check that (a) the expected log message was added and (b) the job was cancelled/scancel was called.
#17 Updated by Lucas Di Pentima 7 months ago
Updates at f507162f3
Test run: https://ci.curoverse.com/job/developer-run-tests/378/
Added support for unsatisfiable containers. Updated unit test to cover both cases.
Pending: integration test.
#18 Updated by Peter Amstutz 6 months ago
Writing an integration test:
Start by copying "test_single_node_azure".
The format of the test case is (steps, checks, driver, jobs, cloud).
For the first step, instead of
set_squeue you'll need a new function like
set_queue_unsatisfiable. This should do something like echo '99|100|100|%s|%s' (this would be a job that requests 99 cores).
This function should use
update_script to create a stub for
scancel. The stub script should do something to record that it was called, like writing a file.
The next line should have a regex to match the error message that node manager puts out when the job is can't be satisfied.
This should call a function that checks the API server logs table that the right log message was added.
It should also check for the presence of the file that indicates scancel was called. The function is supposed to return 0 for success and 1 for failure.
That's it. You don't need any other steps. For checks (if they match, that is a failure). You might want to have "Cloud node is now paired ..." as a negative check.
#21 Updated by Peter Amstutz 6 months ago
Start test_hit_quota test_hit_quota passed Start test_multiple_nodes Traceback (most recent call last): File "tests/integration_test.py", line 441, in <module> main() File "tests/integration_test.py", line 431, in main code += run_test(t, *tests[t]) File "tests/integration_test.py", line 244, in run_test shutil.rmtree(os.path.dirname(unsatisfiable_job_scancelled)) File "/usr/lib/python2.7/shutil.py", line 239, in rmtree onerror(os.listdir, path, sys.exc_info()) File "/usr/lib/python2.7/shutil.py", line 237, in rmtree names = os.listdir(path) OSError: [Errno 2] No such file or directory: '/tmp/tmp59u2RS'
I think you want
global unsatisfiable_job_scancelled and then create the tempdir in
#22 Updated by Lucas Di Pentima 6 months ago
Sorry, I thought I tested it before pushing.
Updated at 3e46aaf64
Test run: https://ci.curoverse.com/job/developer-run-tests/406/