Project

General

Profile

Actions

Feature #10218

closed

[Crunch2] Gather and record cloud/physical node information for each container

Added by Ward Vandewege over 7 years ago. Updated about 7 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Story points:
1.0
Release:
Release relationship:
Auto

Description

Crunch2 already computes the runtime_constraints for a container based on defaults and what is in the corresponding container_request object.

It would be useful to have more detail about the physical or cloud node that executes the container. Specifically, in the cloud, it would be helpful if this was a Dv2 node, or an A node, for example, because those cores are not equal.

In an on-prem environment (as well as in the cloud, actually), it would be useful to know the cpu model.

Maybe crunch-run can record some basic facts about the hardware its running on and record those in the container object for future inspection.

Things to log as a starting point:

crunch-run version

copy of the container record as "container.json"

$ uname -a
$ cat /proc/cpuinfo
$ cat /proc/meminfo
$ df # for available scratch space

contents of /var/tmp/arv/node/data

Logged using standard logging structure a la crunchstat, etc.


Subtasks 1 (0 open1 closed)

Task #11199: Review 10218-record-node-infoResolvedTom Clegg03/20/2017Actions

Related issues

Related to Arvados - Idea #7711: [Node Manager] Record a node's cloud cost information in the Arvados node record's propertiesResolvedTom Clegg11/03/2015Actions
Related to Arvados - Feature #10217: [Crunch1] Log the node properties at the start of a jobClosedActions
Is duplicate of Arvados - Feature #8196: [Crunch] Jobs should log the hardware of their compute node(s)Duplicate01/12/2016Actions
Actions #1

Updated by Ward Vandewege over 7 years ago

  • Tracker changed from Bug to Feature
  • Description updated (diff)
Actions #2

Updated by Tom Clegg over 7 years ago

In crunch-run, if $SLURM_STEP_NODELIST is set, log the output of

sinfo --long --Node --exact --nodes $SLURM_STEP_NODELIST

Example:

Wed Oct 12 16:47:26 2016
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT FEATURES REASON
compute0 1 compute* allocated 20 1:20:1 32177 0 32177 (null) none
compute0 1 crypto allocated 20 1:20:1 32177 0 32177 (null) none
Actions #3

Updated by Peter Amstutz about 7 years ago

Things to consider logging:

$ uname -a
$ cat /proc/cpuinfo
$ cat /proc/meminfo

Actions #4

Updated by Tom Morris about 7 years ago

  • Target version set to 2017-03-29 sprint
Actions #5

Updated by Tom Morris about 7 years ago

  • Target version changed from 2017-03-29 sprint to 2017-03-15 sprint
Actions #6

Updated by Tom Morris about 7 years ago

  • Description updated (diff)
  • Story points set to 1.0
Actions #7

Updated by Peter Amstutz about 7 years ago

  • Description updated (diff)
Actions #8

Updated by Peter Amstutz about 7 years ago

  • Description updated (diff)
Actions #9

Updated by Peter Amstutz about 7 years ago

  • Assigned To set to Peter Amstutz
Actions #10

Updated by Lucas Di Pentima about 7 years ago

  • Assigned To changed from Peter Amstutz to Lucas Di Pentima
Actions #11

Updated by Lucas Di Pentima about 7 years ago

  • Status changed from New to In Progress
Actions #12

Updated by Lucas Di Pentima about 7 years ago

  • Target version changed from 2017-03-15 sprint to 2017-03-29 sprint
Actions #13

Updated by Lucas Di Pentima about 7 years ago

Updates in branch 10218-record-node-info at b30e81e
Test run: https://ci.curoverse.com/job/developer-run-tests/194/

Logged output of the following commands:

  • cat /proc/cpuinfo
  • cat /proc/meminfo
  • df -m
  • uname -a

...on node-info.log file inside the log collection.

Also, the container record is being written as an indented JSON file at container.json inside the log collection.

Actions #14

Updated by Tom Clegg about 7 years ago

Is it necessary to split JSON into multiple lines rather than just writing it as one chunk? If we write it as one chunk, the test could verify that it's valid JSON. Could use something like

enc := json.NewEncoder(logger)
enc.SetIndent("", "    ")
err := enc.Encode(runner.Container)

I think it would be better to do a separate API call to fetch the container record into a map[string]interface{}, rather than using the runner.Container object. As a debugging device, seeing the whole record as delivered by the API server might be more useful than seeing only the parts that the runner.Container object knows how to load.

I seem to have left a debug printf in my commit 386faadf691e444b71d6c96e7c00792d9a0ba2c7 (oops)

Actions #15

Updated by Lucas Di Pentima about 7 years ago

Updates at 5976c75
Test run: https://ci.curoverse.com/job/developer-run-tests/200/

Now the whole container record is fetched from the API server, and saved on the log collection after being formatted using indentation.
Updated tests to reflect the addition of the CallRaw() func to IArvadosClient interface.
Also, removed the debug printf.

Actions #16

Updated by Lucas Di Pentima about 7 years ago

More updates: a54e888
Test run: https://ci.curoverse.com/job/developer-run-tests/202/
Re-ran failed test: https://ci.curoverse.com/job/developer-run-tests-apps-workbench-functionals/203/

Added df commands executions for recording free space and free inodes of / and /tmp.
Combined stdout & stderr outputs when running the commands for better debugging.

Actions #17

Updated by Tom Clegg about 7 years ago

Any reason not to combine the paths like {"df", "-m", "/", os.TempDir()}?

LogContainerRecord() should call defer reader.Close() after checking errors from CallRaw.

the rest lgtm

Actions #18

Updated by Lucas Di Pentima about 7 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:dc6c3fccb583ae98eee808addb526c45ebdbf2c6.

Actions

Also available in: Atom PDF