Project

General

Profile

Actions

Idea #8000

closed

[Node Manager] Shut down nodes in SLURM 'down' state

Added by Peter Amstutz over 8 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
-
Category:
Node Manager
Target version:
-
Story points:
-

Description

Apparently node manager only shuts down nodes that are "idle" in slurm, if they are "down" then they don't get shut down?

2015-12-11_20:41:05.08909 2015-12-11 20:41:05 arvnodeman.cloud_nodes[11545] DEBUG: CloudNodeListMonitorActor (at 140548410010704) got response with 1 items
2015-12-11_20:41:05.09007 2015-12-11 20:41:05 arvnodeman.daemon[11545] INFO: Registering new cloud node /subscriptions/a731f419-596b-4b64-a278-364e76506b06/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-tj4hwdsw3yjiyjt-c97qk
2015-12-11_20:41:05.09273 2015-12-11 20:41:05 pykka[11545] DEBUG: Registered ComputeNodeMonitorActor (urn:uuid:83697dab-e718-4fd5-8595-b6563015585c)
2015-12-11_20:41:05.09280 2015-12-11 20:41:05 pykka[11545] DEBUG: Starting ComputeNodeMonitorActor (urn:uuid:83697dab-e718-4fd5-8595-b6563015585c)
2015-12-11_20:41:05.09391 2015-12-11 20:41:05 arvnodeman.computenode[11545] DEBUG: Node /subscriptions/a731f419-596b-4b64-a278-364e76506b06/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-tj4hwdsw3yjiyjt-c97qk suggesting shutdown.
2015-12-11_20:41:05.09584 2015-12-11 20:41:05 arvnodeman.cloud_nodes[11545] DEBUG: <pykka.proxy._CallableProxy object at 0x7fd3f81b0850> subscribed to events for '/subscriptions/a731f419-596b-4b64-a278-364e76506b06/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-tj4hwdsw3yjiyjt-c97qk'
2015-12-11_20:41:05.09804 2015-12-11 20:41:05 arvnodeman.daemon[11545] INFO: Cloud node /subscriptions/a731f419-596b-4b64-a278-364e76506b06/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-tj4hwdsw3yjiyjt-c97qk has associated with Arvados node c97qk-7ekkf-tj4hwdsw3yjiyjt
2015-12-11_20:41:05.09921 2015-12-11 20:41:05 arvnodeman.computenode[11545] DEBUG: Node /subscriptions/a731f419-596b-4b64-a278-364e76506b06/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-tj4hwdsw3yjiyjt-c97qk shutdown window open but node busy.
2015-12-11_20:41:05.10064 2015-12-11 20:41:05 arvnodeman.arvados_nodes[11545] DEBUG: <pykka.proxy._CallableProxy object at 0x7fd3f8e11250> subscribed to events for 'c97qk-7ekkf-tj4hwdsw3yjiyjt'
$ arv node get -u c97qk-7ekkf-tj4hwdsw3yjiyjt
{
 "href":"/nodes/c97qk-7ekkf-tj4hwdsw3yjiyjt",
 "kind":"arvados#node",
 "etag":"984qlz3msed6utdnndclhuz0o",
 "uuid":"c97qk-7ekkf-tj4hwdsw3yjiyjt",
 "owner_uuid":"c97qk-tpzed-000000000000000",
 "created_at":"2015-09-09T14:26:19.832861000Z",
 "modified_by_client_uuid":null,
 "modified_by_user_uuid":"c97qk-tpzed-000000000000000",
 "modified_at":"2015-12-11T20:58:01.734010000Z",
 "hostname":"compute0",
 "domain":"c97qk.arvadosapi.com",
 "ip_address":"10.25.64.10",
 "last_ping_at":"2015-12-11T20:58:01.734010000Z",
 "slot_number":0,
 "status":"running",
 "job_uuid":null,
 "crunch_worker_state":"down",
 "properties":{
  "cloud_node":{
   "price":0,
   "size":"Standard_D1" 
  },
  "total_cpu_cores":1,
  "total_ram_mb":3442,
  "total_scratch_mb":51172
 },
 "first_ping_at":"2015-12-08T02:17:01.949316000Z",
 "info":{
  "ec2_instance_id":"/subscriptions/a731f419-596b-4b64-a278-364e76506b06/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-tj4hwdsw3yjiyjt-c97qk",
  "last_action":"Prepared by Node Manager",
  "ping_secret":"35vaizroj3kkoqzm2vad92t6fewg7hbdix8jgj0wpklh3rdo4v",
  "slurm_state":"down" 
 },
 "nameservers":[
  "10.25.0.6" 
 ]
}
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up   infinite      2 drain* compute[2-3]
compute*     up   infinite    252  down* compute[1,4-14,16-255]
compute*     up   infinite      1   idle compute15
compute*     up   infinite      1   down compute0

Related issues

Related to Arvados - Bug #8799: [Node manager] nodes in slurm drained state are counted as "up" but not candidates for shut downResolvedPeter Amstutz04/06/2016Actions
Related to Arvados - Bug #8953: [Node manager] can not shut down nodes anymoreResolvedBrett Smith04/13/2016Actions
Actions

Also available in: Atom PDF