Feature #19744
closedRun crunchstat automatically from workflow runner
Description
When a workflow step completes, we want to have access to the performance report and graphs from crunchstat.
I think a relatively quick version of this would be:
- Run crunchstat-summary on workflow steps after they finish, render the HTML and add it to the logs collection with a well known name (e.g. crunchstat-results.html)
Files
Related issues
Updated by Peter Amstutz almost 2 years ago
- Category set to Workbench2
- Description updated (diff)
- Subject changed from Run crunchstat-summary at the end of a workflow run to Display crunchstat metrics in a graph
- Tracker changed from Bug to Feature
Updated by Peter Amstutz 9 months ago
- Related to Feature #21442: Collection metadata can specify a "primary document" which is displayed by workbench in an iframe added
Updated by Peter Amstutz 9 months ago
- Release deleted (
60) - Target version set to Future
- Description updated (diff)
Updated by Peter Amstutz 9 months ago
- Related to Bug #10531: [crunchstat-summary] HTML plots should be a superset of text format output added
Updated by Peter Amstutz 9 months ago
- Related to Bug #20804: crunchstat-summary should use container logs API, not CollectionReader and logs table added
Updated by Peter Amstutz 9 months ago
- Subject changed from Display crunchstat metrics in a graph to Display crunchstat graph in Workbench
Updated by Peter Amstutz 9 months ago
- Related to Idea #16517: Workflow resource usage, runtime, and cost visibility and forecasting added
Updated by Peter Amstutz 9 months ago
- Subject changed from Display crunchstat graph in Workbench to Run crunchstat automatically from workflow runner
Updated by Peter Amstutz 9 months ago
- Target version changed from Future to Development 2024-02-14 sprint
Updated by Peter Amstutz 9 months ago
- Assigned To set to Peter Amstutz
- Category changed from Workbench2 to CWL
Updated by Peter Amstutz 9 months ago
- Target version changed from Development 2024-02-14 sprint to Development 2024-02-28 sprint
Updated by Sarah Zaranek 9 months ago
To fix the legend issue:
You can add this to options:
labelsSeparateLines: true,
And probably want to make it in its own div -- something like that
labelsDiv: document.getElementById('chart_title_legend')
Updated by Sarah Zaranek 9 months ago
Also a range selector -- https://rstudio.github.io/dygraphs/gallery-range-selector.html
might be nice
showRangeSelector: tr
Updated by Peter Amstutz 9 months ago
pirca-xvhdp-td87hdk7h5uk5yi got stuck at the end, not sure why, so there's an issue here.
Updated by Peter Amstutz 9 months ago
19744-acr-crunchstat @ 616d135e77a3e81220c6194494efa718888a9c9c
- All agreed upon points are implemented / addressed.
- now generates crunchstat_report.html and adds it to the log collection. see below for details.
- Anything not implemented (discovered or discussed during work) has a follow-up story.
- no
- Code is tested and passing, both automated and manual, what manual testing was done is described
- unit tests passing locally. have done a bunch of manual inspection of the reports to make sure they are styled properly and reporting what I want them to report
- Documentation has been updated.
- n/a
- Behaves appropriately at the intended scale (describe intended scale).
- adds a little bit of overhead to arvados-cwl-runner each time a task completes. However, since arvados-cwl-runner spends most of its time sitting idle, I think this is fine.
- Considered backwards and forwards compatibility issues between client and server.
- this branch removes long obsolete "jobs" and "pipeline template" support from crunchstat-summary, since these are not used and are pending final deletion from the API, this will not affect any users
- Follows our coding standards and GUI style guidelines.
- report styling is designed to match workbench styling as much as possible.
The initial implementation of this branch went pretty quickly, importing crunchstat from arvados-cwl-runner and running it when a container finished turned out to be fairly straightforward. Having the link show up on workbench depending on whether the report file was present or not was a little bit more work simply because it had to be pulled through a number of layers.
What ended up taking significantly more time was performing a lot of deferred maintenance on crunchstat itself. Because it will be much more accessible to users, I didn't feel like the old minimal HTML report was going to cut it. I ended up making the following changes:
- Incorporating the text report into the HTML report. This involved refactoring the report generation so that it could incorporated into the HTML template
- Adding styling so that the report resembles the workbench material UI "paper and card" style
- Adjusted the layout of the charts so that the legend does not overlap the content
- Incorporated reporting of the actual, requested, and maximum available CPU and RAM
- Adjusted the CPU, RAM and keep cache recommendations so they only appear when usage is below 50% (because instances sizes are typically powers of 2, lowering a resource request when utilization is already over 50% is less likely to lead to savings)
- Fixed a few other false positives, such as warning about missing data when runtime is less than 20 seconds (since it polls every 10s, you may have 1 or 0 data points, so not having data is expected.)
- When summarizing an entire workflow at once (all steps) I fixed the threading strategy to use the python standard library ThreadPoolExecutor; the way it worked before (creating a new thread each time) was a resource leak
Updated by Lucas Di Pentima 9 months ago
Some comments:
- Do you think an unforeseen issue with
crunchstat-summary
could render workflow running inoperable? Maybe adding a--disable-crunchstat
option to a-c-r would turn useful for future proofing. arvcontainer.py
: Not sure if this naming was on purpose, but I think 'summerizer' is a typo.executor.py
:something "value1" or something "value2"
could be rewritten to:something in ["value1", "value2"]
-- for better readabilitydygraphs.js
: There's commented out code (lines 43-45) that could be removed.summarizer.py
: I think the commented out code from lines 533-543 could also be removed.- I think that some documentation pages might need updates:
- https://doc.arvados.org/v2.7/user/tutorials/wgs-tutorial.html - "Examining a Finished Workflow" could mention this new html report file.
- https://doc.arvados.org/v2.7/user/cwl/crunchstat-summary.html - Could also mention that this is run automatically - The report screenshots might also need an update because of the new styling.
Updated by Peter Amstutz 8 months ago
19744-acr-crunchstat @ 2512a633dac10249c351b474b80807725246144a
Still need to update documentation.
Still want to do a couple of things:
- I want to surface feedback about under-utilized nodes, but I don't think posting them as warnings (that show up at the top of the process panel) is the best way to do it.
- I also want to tweak the RAM recommender to identify situations where the RAM request is less than 60% of the assigned instance type and the RAM usage is less than 80% (because lowing the RAM request is likely to make it eligible to run on a smaller node).
Updated by Peter Amstutz 8 months ago
- Target version changed from Development 2024-02-28 sprint to Development 2024-03-13 sprint
Updated by Peter Amstutz 8 months ago
19744-acr-crunchstat @ 58942469c2e85ddf9e1b369ea16d7c0c837eec63
Updated by Peter Amstutz 8 months ago
Lucas Di Pentima wrote in #note-19:
Some comments:
- Do you think an unforeseen issue with
crunchstat-summary
could render workflow running inoperable? Maybe adding a--disable-crunchstat
option to a-c-r would turn useful for future proofing.
Added explicit enable/disable options.
arvcontainer.py
: Not sure if this naming was on purpose, but I think 'summerizer' is a typo.
Fixed.
executor.py
:something "value1" or something "value2"
could be rewritten to:something in ["value1", "value2"]
-- for better readability
Fixed.
dygraphs.js
: There's commented out code (lines 43-45) that could be removed.
Fixed.
summarizer.py
: I think the commented out code from lines 533-543 could also be removed.
Fixed.
- I think that some documentation pages might need updates:
- https://doc.arvados.org/v2.7/user/tutorials/wgs-tutorial.html - "Examining a Finished Workflow" could mention this new html report file.
- https://doc.arvados.org/v2.7/user/cwl/crunchstat-summary.html - Could also mention that this is run automatically - The report screenshots might also need an update because of the new styling.
Updated.
Also changed: the "recommendations" produced by crunchstat-summary no longer recommend specific values, they just notify that your nodes are under-utilized you when CPU or RAM usage is < 50%
The resource under-utilized notifications for each step are collected and printed out at workflow completion.
Updated by Brett Smith 8 months ago
This merge has broken the build. For example, in arvbox, where we try to build and install Python packages from the sdk service:
ERROR: Cannot install arvados-cwl-runner, arvados-cwl-runner==2.8.0.dev20240307150318, arvados-fuse==2.8.0.dev20240305163122 and arvados-python-client 2.8.0.dev20240229182711 (from /tmp/py_sdist.SSxBkSoN/arvados-python-client-2.8.0.dev20240229182711.tar.gz) because these package versions have conflicting dependencies. The conflict is caused by: The user requested arvados-python-client 2.8.0.dev20240229182711 (from /tmp/py_sdist.SSxBkSoN/arvados-python-client-2.8.0.dev20240229182711.tar.gz) arvados-cwl-runner 2.8.0.dev20240307150318 depends on arvados-python-client<=2.8.0.dev20240307150318 arvados-fuse 2.8.0.dev20240305163122 depends on arvados-python-client<=2.8.0.dev20240305163122 crunchstat-summary 2.7.1 depends on arvados-python-client==2.7.1
In short, anywhere in our stack where we build and install a development version of a-c-r also needs to be taught to build and install a development version of crunchstat-summary along with it.
Updated by Brett Smith 8 months ago
Another instance: when the test-provision jobs build a compute image, that image ends up with the release version of the SDK, rather than the current development one:
Getting the user API TOKEN No existing token found for user 'admin' (user_uuid: 'dbn11-tpzed-k3l5i4srw411nr9'). Creating token API TOKEN FOR user 'admin': '5oq292rvp6dwzl9spdtdaktlqrdkuzrprngj0ge100rn8ko8go'. Switching to user 'admin' Running test CWL workflow INFO /usr/bin/cwl-runner 2.8.0.dev20240307150318, arvados-python-client 2.7.1, cwltool 3.1.20230601100705
This causes the workflow to fail when it tries to build a Docker image but does not have the fix from #21417:
2024-03-11 20:52:36 cwltool[41629] ERROR: Workflow error: "filename '11c94a930eeaab893ef215d4fa917532efe9aaffe4b4589b15e5d81daa82432d.json' not found" Traceback (most recent call last): File "/usr/lib/python3-arvados-cwl-runner/lib/python3.9/site-packages/arvados_cwl/arvdocker.py", line 141, in arv_docker_get_image arvados.commands.keepdocker.main(args, stdout=sys.stderr, install_sig_handlers=False, api=api_client) File "/usr/lib/python3-arvados-cwl-runner/lib/python3.9/site-packages/arvados/commands/keepdocker.py", line 546, in main json_file = image_tar.extractfile(image_tar.getmember(json_filename)) File "/usr/lib/python3.9/tarfile.py", line 1790, in getmember raise KeyError("filename %r not found" % name) KeyError: "filename '11c94a930eeaab893ef215d4fa917532efe9aaffe4b4589b15e5d81daa82432d.json' not found"
The difference between the previous error and this one is the previous error tried to install specific versions of sdk/cwl
and sdk/python
in one pip command, and that was irreconcilable. I'm guessing this stack only tries to install sdk/cwl
, possibly after sdk/python
, and pip resolves the situation by downgrading the Python SDK.
Updated by Peter Amstutz 8 months ago
- Target version changed from Development 2024-03-13 sprint to Development 2024-03-27 sprint
Updated by Peter Amstutz 8 months ago
Brett Smith wrote in #note-27:
Another instance: when the test-provision jobs build a compute image, that image ends up with the release version of the SDK, rather than the current development one:
[...]
This causes the workflow to fail when it tries to build a Docker image but does not have the fix from #21417:
[...]
The difference between the previous error and this one is the previous error tried to install specific versions of
sdk/cwl
andsdk/python
in one pip command, and that was irreconcilable. I'm guessing this stack only tries to installsdk/cwl
, possibly aftersdk/python
, and pip resolves the situation by downgrading the Python SDK.
test-provision, to the best of my knowledge, is building from the OS dev packages. So this might be a packaging problem. Still trying to chase this one down.
Updated by Peter Amstutz 8 months ago
So I looked at the dev debian package, and it seems that although it installs the dev version of arvados-cwl-runner, it has 2.7.1 version of both arvados-python-client and crunchstat-summary. Looking fpm_build_virtualenv_worker
I don't see anything that ensures that the dev package build pulls in dev versions of the other python modules besides the one being built.
I don't understand why it doesn't raise a version conflict error, though.
Updated by Peter Amstutz 8 months ago
- Related to Bug #21601: fpm virtualenv packages not using branch versions for dependencies added
Updated by Peter Amstutz 8 months ago
Updated by Peter Amstutz 8 months ago
- Target version changed from Development 2024-03-27 sprint to Development 2024-04-10 sprint
Updated by Peter Amstutz 7 months ago
- Target version changed from Development 2024-04-10 sprint to Development 2024-04-24 sprint
Updated by Peter Amstutz 7 months ago
- Status changed from In Progress to Resolved
Updated by Peter Amstutz 7 months ago
- Target version changed from Development 2024-04-24 sprint to Development 2024-04-10 sprint