Story #7475

[Node manager] Better communication when job is unsatisfiable

Added by Peter Amstutz almost 2 years ago. Updated about 1 month ago.

Status:ResolvedStart date:07/05/2017
Priority:NormalDue date:
Assignee:Lucas Di Pentima% Done:

100%

Category:-
Target version:2017-08-02 sprint
Story points1.0Remaining (hours)0.00 hour
Velocity based estimate-

Description

When a job cannot be satisfied by node manager, it will be queued forever with no feedback to the user (and almost no feedback to the admin, either). There are two distinct cases:

1) A job's min_nodes request is greater than node manager's configured max_nodes. In this case, node manager silently skips over the job with no feedback as to why no nodes are being started.
2) A job's resource requirements for a single node exceed the available cloud node size. In this case, the only indication this is a problem is a message of "job XXX not satisfiable" in the node manager log (and even then only if debug logging is turned on).

If a job request cannot be satisfied under its current configuration, node manager should have some way of signaling this to the user.


Subtasks

Task #11759: Review 7475-nodemgr-unsatisfiable-job-commsResolvedPeter Amstutz


Related issues

Related to Arvados - Bug #9354: [workbench] make quota errors available to the user New 06/06/2016
Duplicated by Arvados - Feature #9023: If a component is not satisfiable, report and fail the job Duplicate 04/20/2016

Associated revisions

Revision c0e203e7
Added by Lucas Di Pentima about 1 month ago

Merge branch '7475-nodemgr-unsatisfiable-job-comms'
Closes #7475

Arvados-DCO-1.1-Signed-off-by: Lucas Di Pentima <>

History

#1 Updated by Peter Amstutz almost 2 years ago

  • Tracker changed from Bug to Story
  • Description updated (diff)

#2 Updated by Brett Smith almost 2 years ago

This can't just be Node Manager's job though, right? The system needs to know what Node Manager is willing to do, but any of these problems can also arise on static clusters that aren't even running Node Manager.

#3 Updated by Peter Amstutz almost 2 years ago

Yes, that's true. I think the right long term solution is for crunch v2 to combine the jobs of crunch-dispatch and node manager into one process, because otherwise neither process has quite enough information to be able to tell the user what's actually going on.

In the short term, there's still benefit in making incremental improvements to node manager for cloud installs.

#4 Updated by Tom Clegg over 1 year ago

It seems like Nodemanager should emit a log (with object_uuid == job uuid) and cancel the job.

If we start telling crunch-dispatch whether nodemanager is running, in cases where nodemanager isn't running, crunch-dispatch could emit a log and cancel the job if it's unsatisfiable with the current set of (alive?) slurm nodes.

Short of running nodemanager on static clusters (add a slurm driver?) it seems like we need the logic in both places if we want to fix the bug in both types of install.

#5 Updated by Tom Morris 5 months ago

  • Target version set to 2017-05-24 sprint

#6 Updated by Tom Morris 5 months ago

  • Story points set to 1.0

#7 Updated by Tom Clegg 5 months ago

For crunch2, when node manager is not in use, sbatch rejects unsatisfiable jobs and the user gets an error -- however, crunch-dispatch-slurm will keep retrying forever. This infinite-retry problem will be mostly addressed by #9688, but ideally crunch-dispatch-slurm should also recognize the "unsatisfiable job" error as a non-retryable error, and tell the API server that it won't be re-attempted (if crunch-dispatch-slurm assumes/knows that it is the only dispatcher, it can indicate this by cancelling the container).

#8 Updated by Tom Morris 5 months ago

  • Target version changed from 2017-05-24 sprint to 2017-06-07 sprint

#9 Updated by Lucas Di Pentima 4 months ago

  • Assignee set to Lucas Di Pentima

#10 Updated by Lucas Di Pentima 4 months ago

  • Target version changed from 2017-06-07 sprint to 2017-06-21 sprint

#11 Updated by Lucas Di Pentima 3 months ago

  • Status changed from New to In Progress

#12 Updated by Lucas Di Pentima 3 months ago

  • Target version changed from 2017-06-21 sprint to 2017-07-05 sprint

#13 Updated by Lucas Di Pentima 3 months ago

Updates @ 3dad67f27
Test run: https://ci.curoverse.com/job/developer-run-tests/376/

Modified ServerCalculator.servers_for_queue() so that it also returns a dict with information about unsatisfiable jobs that should be cancelled by its caller.
Updated some tests that started failing because of this change.
New tests pending.

#14 Updated by Lucas Di Pentima 3 months ago

New updates at f77d08dd5
Test run: https://ci.curoverse.com/job/developer-run-tests/377/

  • Enhanced error checking when trying to emit a log and cancel an unsatisfiable job.
  • Added test cases.

#15 Updated by Peter Amstutz 3 months ago

7475-nodemgr-unsatisfiable-job-comms @ f77d08dd57a1021525717c8669296eb3e463c5f7

  • In _got_response, the uuid can be either a job or a container. It needs to look at the type field of the uuid. This is only valid if the uuid is for a job:
    self._client.jobs().cancel(uuid=job_uuid).execute()

If the uuid is for a container and self.slurm_queue is true, it should do this:

     subprocess.check_call(['scancel', '--name='+uuid])

This may require a stub to ensure that tests don't try to call the real scancel.

I'd like to see an integration test, if it isn't too much work. Upon seeing the log message about an unsatisfiable job/container, it should check that (a) the expected log message was added and (b) the job was cancelled/scancel was called.

#16 Updated by Lucas Di Pentima 3 months ago

  • Target version changed from 2017-07-05 sprint to 2017-07-19 sprint

#17 Updated by Lucas Di Pentima 3 months ago

Updates at f507162f3
Test run: https://ci.curoverse.com/job/developer-run-tests/378/

Added support for unsatisfiable containers. Updated unit test to cover both cases.
Pending: integration test.

#18 Updated by Peter Amstutz 3 months ago

Writing an integration test:

Start by copying "test_single_node_azure".

The format of the test case is (steps, checks, driver, jobs, cloud).

For the first step, instead of set_squeue you'll need a new function like set_queue_unsatisfiable. This should do something like echo '99|100|100|%s|%s' (this would be a job that requests 99 cores).

This function should use update_script to create a stub for scancel. The stub script should do something to record that it was called, like writing a file.

The next line should have a regex to match the error message that node manager puts out when the job is can't be satisfied.
This should call a function that checks the API server logs table that the right log message was added.
It should also check for the presence of the file that indicates scancel was called. The function is supposed to return 0 for success and 1 for failure.

That's it. You don't need any other steps. For checks (if they match, that is a failure). You might want to have "Cloud node is now paired ..." as a negative check.

#19 Updated by Lucas Di Pentima 3 months ago

Updates at 7d4a10bcc

Added integration test following the above instructions.

#20 Updated by Lucas Di Pentima 2 months ago

  • Target version changed from 2017-07-19 sprint to 2017-08-02 sprint

#21 Updated by Peter Amstutz about 1 month ago

Start test_hit_quota
test_hit_quota passed
Start test_multiple_nodes
Traceback (most recent call last):
  File "tests/integration_test.py", line 441, in <module>
    main()
  File "tests/integration_test.py", line 431, in main
    code += run_test(t, *tests[t])
  File "tests/integration_test.py", line 244, in run_test
    shutil.rmtree(os.path.dirname(unsatisfiable_job_scancelled))
  File "/usr/lib/python2.7/shutil.py", line 239, in rmtree
    onerror(os.listdir, path, sys.exc_info())
  File "/usr/lib/python2.7/shutil.py", line 237, in rmtree
    names = os.listdir(path)
OSError: [Errno 2] No such file or directory: '/tmp/tmp59u2RS'

I think you want global unsatisfiable_job_scancelled and then create the tempdir in run_test()

#22 Updated by Lucas Di Pentima about 1 month ago

Sorry, I thought I tested it before pushing.

Updated at 3e46aaf64
Test run: https://ci.curoverse.com/job/developer-run-tests/406/

#23 Updated by Lucas Di Pentima about 1 month ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:c0e203e7f3e9e40736eac63cbe440d5e46e379c0.

Also available in: Atom PDF