Feature #19744
closedRun crunchstat automatically from workflow runner
Description
When a workflow step completes, we want to have access to the performance report and graphs from crunchstat.
I think a relatively quick version of this would be:
- Run crunchstat-summary on workflow steps after they finish, render the HTML and add it to the logs collection with a well known name (e.g. crunchstat-results.html)
Files
Updated by Peter Amstutz about 2 years ago
- Category set to Workbench2
- Description updated (diff)
- Subject changed from Run crunchstat-summary at the end of a workflow run to Display crunchstat metrics in a graph
- Tracker changed from Bug to Feature
Updated by Peter Amstutz 11 months ago
- Related to Feature #21442: Collection metadata can specify a "primary document" which is displayed by workbench in an iframe added
Updated by Peter Amstutz 11 months ago
- Release deleted (
60) - Target version set to Future
- Description updated (diff)
Updated by Peter Amstutz 11 months ago
- Related to Bug #10531: [crunchstat-summary] HTML plots should be a superset of text format output added
Updated by Peter Amstutz 11 months ago
- Related to Bug #20804: crunchstat-summary should use container logs API, not CollectionReader and logs table added
Updated by Peter Amstutz 11 months ago
- Subject changed from Display crunchstat metrics in a graph to Display crunchstat graph in Workbench
Updated by Peter Amstutz 11 months ago
- Related to Idea #16517: Workflow resource usage, runtime, and cost visibility and forecasting added
Updated by Peter Amstutz 11 months ago
- Subject changed from Display crunchstat graph in Workbench to Run crunchstat automatically from workflow runner
Updated by Peter Amstutz 11 months ago
- Target version changed from Future to Development 2024-02-14 sprint
Updated by Peter Amstutz 11 months ago
- Assigned To set to Peter Amstutz
- Category changed from Workbench2 to CWL
Updated by Peter Amstutz 11 months ago
- Target version changed from Development 2024-02-14 sprint to Development 2024-02-28 sprint
Updated by Sarah Zaranek 11 months ago
To fix the legend issue:
You can add this to options:
labelsSeparateLines: true,
And probably want to make it in its own div -- something like that
labelsDiv: document.getElementById('chart_title_legend')
Updated by Sarah Zaranek 11 months ago
Also a range selector -- https://rstudio.github.io/dygraphs/gallery-range-selector.html
might be nice
showRangeSelector: tr
Updated by Peter Amstutz 11 months ago
pirca-xvhdp-td87hdk7h5uk5yi got stuck at the end, not sure why, so there's an issue here.
Updated by Peter Amstutz 11 months ago
19744-acr-crunchstat @ 616d135e77a3e81220c6194494efa718888a9c9c
- All agreed upon points are implemented / addressed.
- now generates crunchstat_report.html and adds it to the log collection. see below for details.
- Anything not implemented (discovered or discussed during work) has a follow-up story.
- no
- Code is tested and passing, both automated and manual, what manual testing was done is described
- unit tests passing locally. have done a bunch of manual inspection of the reports to make sure they are styled properly and reporting what I want them to report
- Documentation has been updated.
- n/a
- Behaves appropriately at the intended scale (describe intended scale).
- adds a little bit of overhead to arvados-cwl-runner each time a task completes. However, since arvados-cwl-runner spends most of its time sitting idle, I think this is fine.
- Considered backwards and forwards compatibility issues between client and server.
- this branch removes long obsolete "jobs" and "pipeline template" support from crunchstat-summary, since these are not used and are pending final deletion from the API, this will not affect any users
- Follows our coding standards and GUI style guidelines.
- report styling is designed to match workbench styling as much as possible.
The initial implementation of this branch went pretty quickly, importing crunchstat from arvados-cwl-runner and running it when a container finished turned out to be fairly straightforward. Having the link show up on workbench depending on whether the report file was present or not was a little bit more work simply because it had to be pulled through a number of layers.
What ended up taking significantly more time was performing a lot of deferred maintenance on crunchstat itself. Because it will be much more accessible to users, I didn't feel like the old minimal HTML report was going to cut it. I ended up making the following changes:
- Incorporating the text report into the HTML report. This involved refactoring the report generation so that it could incorporated into the HTML template
- Adding styling so that the report resembles the workbench material UI "paper and card" style
- Adjusted the layout of the charts so that the legend does not overlap the content
- Incorporated reporting of the actual, requested, and maximum available CPU and RAM
- Adjusted the CPU, RAM and keep cache recommendations so they only appear when usage is below 50% (because instances sizes are typically powers of 2, lowering a resource request when utilization is already over 50% is less likely to lead to savings)
- Fixed a few other false positives, such as warning about missing data when runtime is less than 20 seconds (since it polls every 10s, you may have 1 or 0 data points, so not having data is expected.)
- When summarizing an entire workflow at once (all steps) I fixed the threading strategy to use the python standard library ThreadPoolExecutor; the way it worked before (creating a new thread each time) was a resource leak
Updated by Lucas Di Pentima 10 months ago
Some comments:
- Do you think an unforeseen issue with
crunchstat-summary
could render workflow running inoperable? Maybe adding a--disable-crunchstat
option to a-c-r would turn useful for future proofing. arvcontainer.py
: Not sure if this naming was on purpose, but I think 'summerizer' is a typo.executor.py
:something "value1" or something "value2"
could be rewritten to:something in ["value1", "value2"]
-- for better readabilitydygraphs.js
: There's commented out code (lines 43-45) that could be removed.summarizer.py
: I think the commented out code from lines 533-543 could also be removed.- I think that some documentation pages might need updates:
- https://doc.arvados.org/v2.7/user/tutorials/wgs-tutorial.html - "Examining a Finished Workflow" could mention this new html report file.
- https://doc.arvados.org/v2.7/user/cwl/crunchstat-summary.html - Could also mention that this is run automatically - The report screenshots might also need an update because of the new styling.
Updated by Peter Amstutz 10 months ago
19744-acr-crunchstat @ 2512a633dac10249c351b474b80807725246144a
Still need to update documentation.
Still want to do a couple of things:
- I want to surface feedback about under-utilized nodes, but I don't think posting them as warnings (that show up at the top of the process panel) is the best way to do it.
- I also want to tweak the RAM recommender to identify situations where the RAM request is less than 60% of the assigned instance type and the RAM usage is less than 80% (because lowing the RAM request is likely to make it eligible to run on a smaller node).
Updated by Peter Amstutz 10 months ago
- Target version changed from Development 2024-02-28 sprint to Development 2024-03-13 sprint
Updated by Peter Amstutz 10 months ago
19744-acr-crunchstat @ 58942469c2e85ddf9e1b369ea16d7c0c837eec63
Updated by Peter Amstutz 10 months ago
Lucas Di Pentima wrote in #note-19:
Some comments:
- Do you think an unforeseen issue with
crunchstat-summary
could render workflow running inoperable? Maybe adding a--disable-crunchstat
option to a-c-r would turn useful for future proofing.
Added explicit enable/disable options.
arvcontainer.py
: Not sure if this naming was on purpose, but I think 'summerizer' is a typo.
Fixed.
executor.py
:something "value1" or something "value2"
could be rewritten to:something in ["value1", "value2"]
-- for better readability
Fixed.
dygraphs.js
: There's commented out code (lines 43-45) that could be removed.
Fixed.
summarizer.py
: I think the commented out code from lines 533-543 could also be removed.
Fixed.
- I think that some documentation pages might need updates:
- https://doc.arvados.org/v2.7/user/tutorials/wgs-tutorial.html - "Examining a Finished Workflow" could mention this new html report file.
- https://doc.arvados.org/v2.7/user/cwl/crunchstat-summary.html - Could also mention that this is run automatically - The report screenshots might also need an update because of the new styling.
Updated.
Also changed: the "recommendations" produced by crunchstat-summary no longer recommend specific values, they just notify that your nodes are under-utilized you when CPU or RAM usage is < 50%
The resource under-utilized notifications for each step are collected and printed out at workflow completion.
Updated by Brett Smith 10 months ago
This merge has broken the build. For example, in arvbox, where we try to build and install Python packages from the sdk service:
ERROR: Cannot install arvados-cwl-runner, arvados-cwl-runner==2.8.0.dev20240307150318, arvados-fuse==2.8.0.dev20240305163122 and arvados-python-client 2.8.0.dev20240229182711 (from /tmp/py_sdist.SSxBkSoN/arvados-python-client-2.8.0.dev20240229182711.tar.gz) because these package versions have conflicting dependencies. The conflict is caused by: The user requested arvados-python-client 2.8.0.dev20240229182711 (from /tmp/py_sdist.SSxBkSoN/arvados-python-client-2.8.0.dev20240229182711.tar.gz) arvados-cwl-runner 2.8.0.dev20240307150318 depends on arvados-python-client<=2.8.0.dev20240307150318 arvados-fuse 2.8.0.dev20240305163122 depends on arvados-python-client<=2.8.0.dev20240305163122 crunchstat-summary 2.7.1 depends on arvados-python-client==2.7.1
In short, anywhere in our stack where we build and install a development version of a-c-r also needs to be taught to build and install a development version of crunchstat-summary along with it.
Updated by Brett Smith 10 months ago
Another instance: when the test-provision jobs build a compute image, that image ends up with the release version of the SDK, rather than the current development one:
Getting the user API TOKEN No existing token found for user 'admin' (user_uuid: 'dbn11-tpzed-k3l5i4srw411nr9'). Creating token API TOKEN FOR user 'admin': '5oq292rvp6dwzl9spdtdaktlqrdkuzrprngj0ge100rn8ko8go'. Switching to user 'admin' Running test CWL workflow INFO /usr/bin/cwl-runner 2.8.0.dev20240307150318, arvados-python-client 2.7.1, cwltool 3.1.20230601100705
This causes the workflow to fail when it tries to build a Docker image but does not have the fix from #21417:
2024-03-11 20:52:36 cwltool[41629] ERROR: Workflow error: "filename '11c94a930eeaab893ef215d4fa917532efe9aaffe4b4589b15e5d81daa82432d.json' not found" Traceback (most recent call last): File "/usr/lib/python3-arvados-cwl-runner/lib/python3.9/site-packages/arvados_cwl/arvdocker.py", line 141, in arv_docker_get_image arvados.commands.keepdocker.main(args, stdout=sys.stderr, install_sig_handlers=False, api=api_client) File "/usr/lib/python3-arvados-cwl-runner/lib/python3.9/site-packages/arvados/commands/keepdocker.py", line 546, in main json_file = image_tar.extractfile(image_tar.getmember(json_filename)) File "/usr/lib/python3.9/tarfile.py", line 1790, in getmember raise KeyError("filename %r not found" % name) KeyError: "filename '11c94a930eeaab893ef215d4fa917532efe9aaffe4b4589b15e5d81daa82432d.json' not found"
The difference between the previous error and this one is the previous error tried to install specific versions of sdk/cwl
and sdk/python
in one pip command, and that was irreconcilable. I'm guessing this stack only tries to install sdk/cwl
, possibly after sdk/python
, and pip resolves the situation by downgrading the Python SDK.
Updated by Peter Amstutz 10 months ago
- Target version changed from Development 2024-03-13 sprint to Development 2024-03-27 sprint
Updated by Peter Amstutz 10 months ago
Brett Smith wrote in #note-27:
Another instance: when the test-provision jobs build a compute image, that image ends up with the release version of the SDK, rather than the current development one:
[...]
This causes the workflow to fail when it tries to build a Docker image but does not have the fix from #21417:
[...]
The difference between the previous error and this one is the previous error tried to install specific versions of
sdk/cwl
andsdk/python
in one pip command, and that was irreconcilable. I'm guessing this stack only tries to installsdk/cwl
, possibly aftersdk/python
, and pip resolves the situation by downgrading the Python SDK.
test-provision, to the best of my knowledge, is building from the OS dev packages. So this might be a packaging problem. Still trying to chase this one down.
Updated by Peter Amstutz 10 months ago
So I looked at the dev debian package, and it seems that although it installs the dev version of arvados-cwl-runner, it has 2.7.1 version of both arvados-python-client and crunchstat-summary. Looking fpm_build_virtualenv_worker
I don't see anything that ensures that the dev package build pulls in dev versions of the other python modules besides the one being built.
I don't understand why it doesn't raise a version conflict error, though.
Updated by Peter Amstutz 10 months ago
- Related to Bug #21601: fpm virtualenv packages not using branch versions for dependencies added
Updated by Peter Amstutz 10 months ago
Updated by Peter Amstutz 9 months ago
- Target version changed from Development 2024-03-27 sprint to Development 2024-04-10 sprint
Updated by Peter Amstutz 9 months ago
- Target version changed from Development 2024-04-10 sprint to Development 2024-04-24 sprint
Updated by Peter Amstutz 9 months ago
- Status changed from In Progress to Resolved
Updated by Peter Amstutz 9 months ago
- Target version changed from Development 2024-04-24 sprint to Development 2024-04-10 sprint