Project

General

Profile

Actions

Feature #19744

closed

Run crunchstat automatically from workflow runner

Added by Peter Amstutz almost 2 years ago. Updated 7 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
CWL
Story points:
-
Release:
Release relationship:
Auto

Description

When a workflow step completes, we want to have access to the performance report and graphs from crunchstat.

I think a relatively quick version of this would be:

  1. Run crunchstat-summary on workflow steps after they finish, render the HTML and add it to the logs collection with a well known name (e.g. crunchstat-results.html)

Files

test-provision-debian11-622.log (1.61 MB) test-provision-debian11-622.log Brett Smith, 03/11/2024 09:12 PM

Subtasks 1 (0 open1 closed)

Task #21500: Review 19744-acr-crunchstatResolvedPeter Amstutz03/19/2024Actions

Related issues

Related to Arvados - Feature #21442: Collection metadata can specify a "primary document" which is displayed by workbench in an iframeNewActions
Related to Arvados - Bug #10531: [crunchstat-summary] HTML plots should be a superset of text format outputResolvedActions
Related to Arvados - Bug #20804: crunchstat-summary should use container logs API, not CollectionReader and logs tableNewActions
Related to Arvados Epics - Idea #16517: Workflow resource usage, runtime, and cost visibility and forecastingIn ProgressActions
Related to Arvados - Bug #21601: fpm virtualenv packages not using branch versions for dependenciesResolvedBrett SmithActions
Actions #1

Updated by Peter Amstutz almost 2 years ago

  • Category set to Workbench2
  • Description updated (diff)
  • Subject changed from Run crunchstat-summary at the end of a workflow run to Display crunchstat metrics in a graph
  • Tracker changed from Bug to Feature
Actions #2

Updated by Peter Amstutz over 1 year ago

  • Release set to 60
Actions #3

Updated by Peter Amstutz 9 months ago

  • Related to Feature #21442: Collection metadata can specify a "primary document" which is displayed by workbench in an iframe added
Actions #4

Updated by Peter Amstutz 9 months ago

  • Release deleted (60)
  • Target version set to Future
  • Description updated (diff)
Actions #5

Updated by Peter Amstutz 9 months ago

  • Related to Bug #10531: [crunchstat-summary] HTML plots should be a superset of text format output added
Actions #6

Updated by Peter Amstutz 9 months ago

  • Related to Bug #20804: crunchstat-summary should use container logs API, not CollectionReader and logs table added
Actions #7

Updated by Peter Amstutz 9 months ago

  • Subject changed from Display crunchstat metrics in a graph to Display crunchstat graph in Workbench
Actions #8

Updated by Peter Amstutz 9 months ago

  • Related to Idea #16517: Workflow resource usage, runtime, and cost visibility and forecasting added
Actions #9

Updated by Peter Amstutz 9 months ago

  • Description updated (diff)
Actions #10

Updated by Peter Amstutz 9 months ago

  • Subject changed from Display crunchstat graph in Workbench to Run crunchstat automatically from workflow runner
Actions #11

Updated by Peter Amstutz 9 months ago

  • Target version changed from Future to Development 2024-02-14 sprint
Actions #12

Updated by Peter Amstutz 9 months ago

  • Assigned To set to Peter Amstutz
  • Category changed from Workbench2 to CWL
Actions #13

Updated by Peter Amstutz 9 months ago

  • Status changed from New to In Progress
Actions #14

Updated by Peter Amstutz 9 months ago

  • Target version changed from Development 2024-02-14 sprint to Development 2024-02-28 sprint
Actions #15

Updated by Sarah Zaranek 9 months ago

To fix the legend issue:

You can add this to options:
labelsSeparateLines: true,

And probably want to make it in its own div -- something like that
labelsDiv: document.getElementById('chart_title_legend')

https://dygraphs.com/options.html

Actions #16

Updated by Sarah Zaranek 9 months ago

Also a range selector -- https://rstudio.github.io/dygraphs/gallery-range-selector.html
might be nice

showRangeSelector: tr

Actions #17

Updated by Peter Amstutz 9 months ago

pirca-xvhdp-td87hdk7h5uk5yi got stuck at the end, not sure why, so there's an issue here.

Actions #18

Updated by Peter Amstutz 9 months ago

19744-acr-crunchstat @ 616d135e77a3e81220c6194494efa718888a9c9c

developer-run-tests: #4056

  • All agreed upon points are implemented / addressed.
    • now generates crunchstat_report.html and adds it to the log collection. see below for details.
  • Anything not implemented (discovered or discussed during work) has a follow-up story.
    • no
  • Code is tested and passing, both automated and manual, what manual testing was done is described
    • unit tests passing locally. have done a bunch of manual inspection of the reports to make sure they are styled properly and reporting what I want them to report
  • Documentation has been updated.
    • n/a
  • Behaves appropriately at the intended scale (describe intended scale).
    • adds a little bit of overhead to arvados-cwl-runner each time a task completes. However, since arvados-cwl-runner spends most of its time sitting idle, I think this is fine.
  • Considered backwards and forwards compatibility issues between client and server.
    • this branch removes long obsolete "jobs" and "pipeline template" support from crunchstat-summary, since these are not used and are pending final deletion from the API, this will not affect any users
  • Follows our coding standards and GUI style guidelines.
    • report styling is designed to match workbench styling as much as possible.

The initial implementation of this branch went pretty quickly, importing crunchstat from arvados-cwl-runner and running it when a container finished turned out to be fairly straightforward. Having the link show up on workbench depending on whether the report file was present or not was a little bit more work simply because it had to be pulled through a number of layers.

What ended up taking significantly more time was performing a lot of deferred maintenance on crunchstat itself. Because it will be much more accessible to users, I didn't feel like the old minimal HTML report was going to cut it. I ended up making the following changes:

  • Incorporating the text report into the HTML report. This involved refactoring the report generation so that it could incorporated into the HTML template
  • Adding styling so that the report resembles the workbench material UI "paper and card" style
  • Adjusted the layout of the charts so that the legend does not overlap the content
  • Incorporated reporting of the actual, requested, and maximum available CPU and RAM
  • Adjusted the CPU, RAM and keep cache recommendations so they only appear when usage is below 50% (because instances sizes are typically powers of 2, lowering a resource request when utilization is already over 50% is less likely to lead to savings)
  • Fixed a few other false positives, such as warning about missing data when runtime is less than 20 seconds (since it polls every 10s, you may have 1 or 0 data points, so not having data is expected.)
  • When summarizing an entire workflow at once (all steps) I fixed the threading strategy to use the python standard library ThreadPoolExecutor; the way it worked before (creating a new thread each time) was a resource leak
Actions #19

Updated by Lucas Di Pentima 9 months ago

Some comments:

  • Do you think an unforeseen issue with crunchstat-summary could render workflow running inoperable? Maybe adding a --disable-crunchstat option to a-c-r would turn useful for future proofing.
  • arvcontainer.py: Not sure if this naming was on purpose, but I think 'summerizer' is a typo.
  • executor.py: something "value1" or something "value2" could be rewritten to: something in ["value1", "value2"] -- for better readability
  • dygraphs.js: There's commented out code (lines 43-45) that could be removed.
  • summarizer.py: I think the commented out code from lines 533-543 could also be removed.
  • I think that some documentation pages might need updates:
Actions #20

Updated by Peter Amstutz 8 months ago

19744-acr-crunchstat @ 2512a633dac10249c351b474b80807725246144a

developer-run-tests: #4061

Still need to update documentation.

Still want to do a couple of things:

  • I want to surface feedback about under-utilized nodes, but I don't think posting them as warnings (that show up at the top of the process panel) is the best way to do it.
  • I also want to tweak the RAM recommender to identify situations where the RAM request is less than 60% of the assigned instance type and the RAM usage is less than 80% (because lowing the RAM request is likely to make it eligible to run on a smaller node).
Actions #21

Updated by Peter Amstutz 8 months ago

  • Target version changed from Development 2024-02-28 sprint to Development 2024-03-13 sprint
Actions #23

Updated by Peter Amstutz 8 months ago

Lucas Di Pentima wrote in #note-19:

Some comments:

  • Do you think an unforeseen issue with crunchstat-summary could render workflow running inoperable? Maybe adding a --disable-crunchstat option to a-c-r would turn useful for future proofing.

Added explicit enable/disable options.

  • arvcontainer.py: Not sure if this naming was on purpose, but I think 'summerizer' is a typo.

Fixed.

  • executor.py: something "value1" or something "value2" could be rewritten to: something in ["value1", "value2"] -- for better readability

Fixed.

  • dygraphs.js: There's commented out code (lines 43-45) that could be removed.

Fixed.

  • summarizer.py: I think the commented out code from lines 533-543 could also be removed.

Fixed.

Updated.

Also changed: the "recommendations" produced by crunchstat-summary no longer recommend specific values, they just notify that your nodes are under-utilized you when CPU or RAM usage is < 50%

The resource under-utilized notifications for each step are collected and printed out at workflow completion.

Actions #24

Updated by Peter Amstutz 8 months ago

  • Release set to 69
Actions #25

Updated by Lucas Di Pentima 8 months ago

This LGTM, thanks!

Actions #26

Updated by Brett Smith 8 months ago

This merge has broken the build. For example, in arvbox, where we try to build and install Python packages from the sdk service:

ERROR: Cannot install arvados-cwl-runner, arvados-cwl-runner==2.8.0.dev20240307150318, arvados-fuse==2.8.0.dev20240305163122 and arvados-python-client 2.8.0.dev20240229182711 (from /tmp/py_sdist.SSxBkSoN/arvados-python-client-2.8.0.dev20240229182711.tar.gz) because these package versions have conflicting dependencies.

The conflict is caused by:
    The user requested arvados-python-client 2.8.0.dev20240229182711 (from /tmp/py_sdist.SSxBkSoN/arvados-python-client-2.8.0.dev20240229182711.tar.gz)
    arvados-cwl-runner 2.8.0.dev20240307150318 depends on arvados-python-client<=2.8.0.dev20240307150318
    arvados-fuse 2.8.0.dev20240305163122 depends on arvados-python-client<=2.8.0.dev20240305163122
    crunchstat-summary 2.7.1 depends on arvados-python-client==2.7.1

In short, anywhere in our stack where we build and install a development version of a-c-r also needs to be taught to build and install a development version of crunchstat-summary along with it.

Actions #27

Updated by Brett Smith 8 months ago

Another instance: when the test-provision jobs build a compute image, that image ends up with the release version of the SDK, rather than the current development one:

Getting the user API TOKEN
No existing token found for user 'admin' (user_uuid: 'dbn11-tpzed-k3l5i4srw411nr9'). Creating token
API TOKEN FOR user 'admin': '5oq292rvp6dwzl9spdtdaktlqrdkuzrprngj0ge100rn8ko8go'.
Switching to user 'admin'
Running test CWL workflow
INFO /usr/bin/cwl-runner 2.8.0.dev20240307150318, arvados-python-client 2.7.1, cwltool 3.1.20230601100705

This causes the workflow to fail when it tries to build a Docker image but does not have the fix from #21417:

2024-03-11 20:52:36 cwltool[41629] ERROR: Workflow error:
"filename '11c94a930eeaab893ef215d4fa917532efe9aaffe4b4589b15e5d81daa82432d.json' not found" 
Traceback (most recent call last):
  File "/usr/lib/python3-arvados-cwl-runner/lib/python3.9/site-packages/arvados_cwl/arvdocker.py", line 141, in arv_docker_get_image
    arvados.commands.keepdocker.main(args, stdout=sys.stderr, install_sig_handlers=False, api=api_client)
  File "/usr/lib/python3-arvados-cwl-runner/lib/python3.9/site-packages/arvados/commands/keepdocker.py", line 546, in main
    json_file = image_tar.extractfile(image_tar.getmember(json_filename))
  File "/usr/lib/python3.9/tarfile.py", line 1790, in getmember
    raise KeyError("filename %r not found" % name)
KeyError: "filename '11c94a930eeaab893ef215d4fa917532efe9aaffe4b4589b15e5d81daa82432d.json' not found" 

The difference between the previous error and this one is the previous error tried to install specific versions of sdk/cwl and sdk/python in one pip command, and that was irreconcilable. I'm guessing this stack only tries to install sdk/cwl, possibly after sdk/python, and pip resolves the situation by downgrading the Python SDK.

Actions #28

Updated by Peter Amstutz 8 months ago

  • Target version changed from Development 2024-03-13 sprint to Development 2024-03-27 sprint
Actions #29

Updated by Peter Amstutz 8 months ago

Brett Smith wrote in #note-27:

Another instance: when the test-provision jobs build a compute image, that image ends up with the release version of the SDK, rather than the current development one:

[...]

This causes the workflow to fail when it tries to build a Docker image but does not have the fix from #21417:

[...]

The difference between the previous error and this one is the previous error tried to install specific versions of sdk/cwl and sdk/python in one pip command, and that was irreconcilable. I'm guessing this stack only tries to install sdk/cwl, possibly after sdk/python, and pip resolves the situation by downgrading the Python SDK.

test-provision, to the best of my knowledge, is building from the OS dev packages. So this might be a packaging problem. Still trying to chase this one down.

Actions #30

Updated by Peter Amstutz 8 months ago

So I looked at the dev debian package, and it seems that although it installs the dev version of arvados-cwl-runner, it has 2.7.1 version of both arvados-python-client and crunchstat-summary. Looking fpm_build_virtualenv_worker I don't see anything that ensures that the dev package build pulls in dev versions of the other python modules besides the one being built.

I don't understand why it doesn't raise a version conflict error, though.

Actions #31

Updated by Peter Amstutz 8 months ago

  • Related to Bug #21601: fpm virtualenv packages not using branch versions for dependencies added
Actions #33

Updated by Peter Amstutz 8 months ago

  • Release deleted (69)
Actions #34

Updated by Peter Amstutz 8 months ago

  • Release set to 70
Actions #35

Updated by Peter Amstutz 8 months ago

  • Target version changed from Development 2024-03-27 sprint to Development 2024-04-10 sprint
Actions #36

Updated by Peter Amstutz 7 months ago

  • Target version changed from Development 2024-04-10 sprint to Development 2024-04-24 sprint
Actions #37

Updated by Peter Amstutz 7 months ago

  • Status changed from In Progress to Resolved
Actions #38

Updated by Peter Amstutz 7 months ago

  • Target version changed from Development 2024-04-24 sprint to Development 2024-04-10 sprint
Actions

Also available in: Atom PDF