Feature #10218

[Crunch2] Gather and record cloud/physical node information for each container

Added by Ward Vandewege 5 months ago. Updated about 18 hours ago.

Status:ResolvedStart date:03/20/2017
Priority:NormalDue date:
Assignee:Lucas Di Pentima% Done:


Target version:2017-03-29 sprint
Story points1.0Remaining (hours)0.00 hour
Velocity based estimate-
ReleaseCrunch v2


Crunch2 already computes the runtime_constraints for a container based on defaults and what is in the corresponding container_request object.

It would be useful to have more detail about the physical or cloud node that executes the container. Specifically, in the cloud, it would be helpful if this was a Dv2 node, or an A node, for example, because those cores are not equal.

In an on-prem environment (as well as in the cloud, actually), it would be useful to know the cpu model.

Maybe crunch-run can record some basic facts about the hardware its running on and record those in the container object for future inspection.

Things to log as a starting point:

crunch-run version

copy of the container record as "container.json"

$ uname -a
$ cat /proc/cpuinfo
$ cat /proc/meminfo
$ df # for available scratch space

contents of /var/tmp/arv/node/data

Logged using standard logging structure a la crunchstat, etc.


Task #11199: Review 10218-record-node-infoIn ProgressTom Clegg

Related issues

Related to Arvados - Story #7711: [Node Manager] Record a node's cloud cost information in ... Resolved 11/03/2015
Related to Arvados - Feature #10217: [Crunch1] Log the node properties at the start of a job New
Duplicates Arvados - Feature #8196: [Crunch] Jobs should log the hardware of their compute no... Duplicate 01/12/2016

Associated revisions

Revision dc6c3fcc
Added by Lucas Di Pentima about 22 hours ago

Merge branch '10218-record-node-info'
Closes #10218


#1 Updated by Ward Vandewege 5 months ago

  • Tracker changed from Bug to Feature
  • Description updated (diff)

#2 Updated by Tom Clegg 5 months ago

In crunch-run, if $SLURM_STEP_NODELIST is set, log the output of

sinfo --long --Node --exact --nodes $SLURM_STEP_NODELIST


Wed Oct 12 16:47:26 2016
compute0 1 compute* allocated 20 1:20:1 32177 0 32177 (null) none
compute0 1 crypto allocated 20 1:20:1 32177 0 32177 (null) none

#3 Updated by Peter Amstutz about 1 month ago

Things to consider logging:

$ uname -a
$ cat /proc/cpuinfo
$ cat /proc/meminfo

#4 Updated by Tom Morris 30 days ago

  • Target version set to 2017-03-29 sprint

#5 Updated by Tom Morris 30 days ago

  • Target version changed from 2017-03-29 sprint to 2017-03-15 sprint

#6 Updated by Tom Morris 30 days ago

  • Description updated (diff)
  • Story points set to 1.0

#7 Updated by Peter Amstutz 29 days ago

  • Description updated (diff)

#8 Updated by Peter Amstutz 29 days ago

  • Description updated (diff)

#9 Updated by Peter Amstutz 22 days ago

  • Assignee set to Peter Amstutz

#10 Updated by Lucas Di Pentima 22 days ago

  • Assignee changed from Peter Amstutz to Lucas Di Pentima

#11 Updated by Lucas Di Pentima 15 days ago

  • Status changed from New to In Progress

#12 Updated by Lucas Di Pentima 8 days ago

  • Target version changed from 2017-03-15 sprint to 2017-03-29 sprint

#13 Updated by Lucas Di Pentima 6 days ago

Updates in branch 10218-record-node-info at b30e81e
Test run: https://ci.curoverse.com/job/developer-run-tests/194/

Logged output of the following commands:

  • cat /proc/cpuinfo
  • cat /proc/meminfo
  • df -m
  • uname -a

...on node-info.log file inside the log collection.

Also, the container record is being written as an indented JSON file at container.json inside the log collection.

#14 Updated by Tom Clegg 3 days ago

Is it necessary to split JSON into multiple lines rather than just writing it as one chunk? If we write it as one chunk, the test could verify that it's valid JSON. Could use something like

enc := json.NewEncoder(logger)
enc.SetIndent("", "    ")
err := enc.Encode(runner.Container)

I think it would be better to do a separate API call to fetch the container record into a map[string]interface{}, rather than using the runner.Container object. As a debugging device, seeing the whole record as delivered by the API server might be more useful than seeing only the parts that the runner.Container object knows how to load.

I seem to have left a debug printf in my commit 386faadf691e444b71d6c96e7c00792d9a0ba2c7 (oops)

#15 Updated by Lucas Di Pentima 1 day ago

Updates at 5976c75
Test run: https://ci.curoverse.com/job/developer-run-tests/200/

Now the whole container record is fetched from the API server, and saved on the log collection after being formatted using indentation.
Updated tests to reflect the addition of the CallRaw() func to IArvadosClient interface.
Also, removed the debug printf.

#16 Updated by Lucas Di Pentima about 23 hours ago

More updates: a54e888
Test run: https://ci.curoverse.com/job/developer-run-tests/202/
Re-ran failed test: https://ci.curoverse.com/job/developer-run-tests-apps-workbench-functionals/203/

Added df commands executions for recording free space and free inodes of / and /tmp.
Combined stdout & stderr outputs when running the commands for better debugging.

#17 Updated by Tom Clegg about 23 hours ago

Any reason not to combine the paths like {"df", "-m", "/", os.TempDir()}?

LogContainerRecord() should call defer reader.Close() after checking errors from CallRaw.

the rest lgtm

#18 Updated by Lucas Di Pentima about 22 hours ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:dc6c3fccb583ae98eee808addb526c45ebdbf2c6.

Also available in: Atom PDF