Project

General

Profile

Actions

Idea #11139

closed

[Node manager] Expected MemTotal for each cloud node size

Added by Peter Amstutz almost 8 years ago. Updated almost 8 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
03/07/2017
Due date:
Story points:
0.5

Description

There's a discrepancy between the RAM of a VM used to choose what size node to boot for a job, and the actual amount of memory available to the job. If a job falls in the "donut hole", the job will be unable to run because the request is larger than the actual memory available, but node manager won't boot up a properly sized node because it will believe that the job is satisfied.

tetron@compute3.c97qk:/usr/local/share/arvados-compute-ping-controller.d$ awk '($1 == "MemTotal:"){print ($2 / 1024)}' </proc/meminfo
3440.54
df -m /tmp | perl -e '
> my $index = index(<>, " 1M-blocks ");
> substr(<>, 0, $index + 10) =~ / (\d+)$/;
> print "$1\n";
> '
51170
tetron@compute3.c97qk:/usr/local/share/arvados-compute-ping-controller.d$ sinfo -n compute3 --format "%c %m %d" 
CPUS MEMORY TMP_DISK
1 3440 51169
>>> szd["Standard_D1_v2"]
<NodeSize: id=Standard_D1_v2, name=Standard_D1_v2, ram=3584 disk=50 bandwidth=0 price=0 driver=Azure Virtual machines ...>
>>> 

For Standard_D1_v2 there is a ~144 MiB discrepancy between the advertised RAM size and the amount of RAM considered available by Linux.

CPUS MEMORY TMP_DISK
2 6968 102344
<NodeSize: id=Standard_D2_v2, name=Standard_D2_v2, ram=7168 disk=100 bandwidth=0 price=0 driver=Azure Virtual machines ...>

For Standard_D1_v2 it is 200 MiB.

Based on discussion: node manager should reduce the RAM size for node by 5% from the "sticker value" in the ServerCalculator (jobqueue.py)

The scale factor should be settable in the configuration file.


Subtasks 1 (0 open1 closed)

Task #11200: Review 11139-nodemanager-mem-scale-factorResolvedPeter Amstutz03/07/2017Actions
Actions

Also available in: Atom PDF