Arvados: Issueshttps://dev.arvados.org/https://dev.arvados.org/favicon.ico?15576888422022-04-11T20:24:42ZArvados
Redmine Arvados - Bug #18990 (New): should reflect the value of TLS/Insecure in the "Get API Token" dialoghttps://dev.arvados.org/issues/189902022-04-11T20:24:42ZWard Vandewegeward@curii.com
<p>When <code>TLS/Insecure</code> is set to <code>true</code>, the "Get API Token" dialog should say</p>
<pre><code>export ARVADOS_API_HOST_INSECURE=true</code></pre>
<p>and otherwise, it should say</p>
<pre><code>unset ARVADOS_API_HOST_INSECURE</code></pre>
<p>Currently, workbench2 always does the latter.</p> Arvados - Bug #18936 (New): [api] [controller] remove reader_token supporthttps://dev.arvados.org/issues/189362022-03-25T13:28:26ZWard Vandewegeward@curii.com
<p>Workbench 1 appends the anonymous user token in a "reader token" to each GET request to make sure that content shared with the anonymous user is available to authenticated users, even if not shared with explicitly with them.</p>
<p>Controller just appends any reader tokens received to the token list for the request.</p>
<p>API uses reader_tokens for GET requests in (services/api/app/controllers/application_controller.rb).</p>
<p>But it also does something else; in services/api/app/middlewares/arvados_api_token.rb it seems that if the primary session token is not valid, the first working reader token is used instead.</p>
<p>Workbench 2 does not use reader_tokens (which means authenticated users can not access data only shared with the anonymous user!).</p>
<p>Nothing else in our codebase appears to use reader_tokens.</p>
<p>Our documentation does not mention reader_tokens.</p>
<p><a class="issue tracker-2 status-1 priority-3 priority-lowest" title="Feature: [config] simplify AnonymousUserToken configuration (New)" href="https://dev.arvados.org/issues/18937">#18937</a> is about simplifying the anonymous token configuration - basically, doing away with the need for an anonymous token at all. Once that is done, we can remove the controller and API code that handles reader_tokens. Maybe log a warning if a reader token is used (though, as long as WB1 is around, that's going to generate a lot of noise in the logs)?</p> Arvados - Bug #18762 (New): rails background tasks scaling issueshttps://dev.arvados.org/issues/187622022-02-14T21:08:24ZWard Vandewegeward@curii.com
<p>The rails api has a few background threads that should only run once, even when multiple rails api instances are active.</p>
<p>- ward: fill in which tasks</p>
<p>Just like we did in <a class="issue tracker-1 status-3 priority-4 priority-default closed parent" title="Bug: SweepTrashedObjects scaling issues (Resolved)" href="https://dev.arvados.org/issues/18339">#18339</a>, the existing background tasks in the rails api code should be put inside a mutex. Either move the code inside controller, or if that is hard, apply the same solution as in <a class="issue tracker-1 status-3 priority-4 priority-default closed parent" title="Bug: SweepTrashedObjects scaling issues (Resolved)" href="https://dev.arvados.org/issues/18339">#18339</a>.</p> Arvados - Bug #18671 (New): [go sdk] update documentationhttps://dev.arvados.org/issues/186712022-01-24T21:09:44ZWard Vandewegeward@curii.com
<p>The documentation at </p>
<pre><code><a class="external" href="https://doc.arvados.org/sdk/go/index.html">https://doc.arvados.org/sdk/go/index.html</a><br /><a class="external" href="https://doc.arvados.org/sdk/go/example.html">https://doc.arvados.org/sdk/go/example.html</a></code></pre>
<p>refers to the old Go SDK. The godoc link for the go/sdk/arvados directory describes the current SDK. We want to move over to an RPC interface as per <a class="issue tracker-2 status-1 priority-4 priority-default" title="Feature: [go sdk] describe + implement desired Go SDK (New)" href="https://dev.arvados.org/issues/18672">#18672</a>.</p>
<ul>
<li>The godoc should be updated/improved to incorporate all the examples from the examples page at our documentation site.</li>
<li>Examples should be added for important features (e.g. the CollectionFileSystem)</li>
<li>The old pages should be removed from the Arvados documentation, with only the godoc link remaining. Any Go programmer should be able to use the Arvados Go sdk with just the godoc page.</li>
</ul> Arvados - Bug #18618 (New): Reusing workflows/steps is too slowhttps://dev.arvados.org/issues/186182022-01-07T15:26:37ZWard Vandewegeward@curii.com
<p>Arvados takes too long to figure out if a workflow or step has already been run and can be reused.</p>
<p>A user reported that it can take ~1 minute for that determination to be made.</p> Arvados - Bug #18393 (New): [workbench2] forces relogin on every new window/tabhttps://dev.arvados.org/issues/183932021-11-19T14:37:25ZWard Vandewegeward@curii.com
<p>How to reproduce:</p>
<p>1. open a new browser window or tab for <a class="external" href="https://workbench2.ce8i5.arvadosapi.com">https://workbench2.ce8i5.arvadosapi.com</a>. Log in.<br />2. open another browser window or tab for <a class="external" href="https://workbench2.ce8i5.arvadosapi.com">https://workbench2.ce8i5.arvadosapi.com</a>. The login page is shown again.</p>
<p>Observed on ce8i5 which is configured with direct Google authentication, and is the login cluster for a login federation. Relevant config:</p>
<pre>
...
RemoteClusters:
ce8i5:
Host: ce8i5.arvadosapi.com
Proxy: true
ActivateUsers: true
tordo:
Host: tordo.arvadosapi.com
Proxy: true
ActivateUsers: true
9tee4:
Host: 9tee4.arvadosapi.com
Proxy: true
ActivateUsers: true
API:
MaxTokenLifetime: 24h
Login:
LoginCluster: ce8i5
# TokenLifetime: 8h
Google:
Enable: true
AlternateEmailAddresses: true
...
</pre>
<p>Not seeing this on tordo or 9tee4.</p> Arvados - Bug #18385 (New): arvados-server config-dump | arvados-server config-check -config=- sp...https://dev.arvados.org/issues/183852021-11-16T20:58:23ZWard Vandewegeward@curii.com
<p>This was observed on ce8i5 as part of keepstore-on-the-compute-node testing in <a class="issue tracker-3 status-3 priority-4 priority-default closed" title="Support: Run chr19 WGS test on ce8i5 to test compute-local keepstore (Resolved)" href="https://dev.arvados.org/issues/18320">#18320</a> (the `keepstore.txt` file listed the about below as warnings). We reproduced it with `arvados-server config-dump | arvados-server config-check`:</p>
<pre>
arvados-server config-dump | arvados-server config-check -config=-
time="2021-11-16T20:57:41Z" level=warning msg="deprecated or unknown config entry: Clusters.ce8i5.InstanceTypes.d2asv4.Name"
time="2021-11-16T20:57:41Z" level=warning msg="deprecated or unknown config entry: Clusters.ce8i5.InstanceTypes.d2asv4.Scratch"
time="2021-11-16T20:57:41Z" level=warning msg="deprecated or unknown config entry: Clusters.ce8i5.InstanceTypes.e4asv4.Scratch"
time="2021-11-16T20:57:41Z" level=warning msg="deprecated or unknown config entry: Clusters.ce8i5.InstanceTypes.e4asv4.Name"
time="2021-11-16T20:57:41Z" level=warning msg="deprecated or unknown config entry: Clusters.ce8i5.InstanceTypes.e8asv4.Scratch"
time="2021-11-16T20:57:41Z" level=warning msg="deprecated or unknown config entry: Clusters.ce8i5.InstanceTypes.e8asv4.Name"
time="2021-11-16T20:57:41Z" level=warning msg="deprecated or unknown config entry: Clusters.ce8i5.InstanceTypes.e2asv4.Scratch"
time="2021-11-16T20:57:41Z" level=warning msg="deprecated or unknown config entry: Clusters.ce8i5.InstanceTypes.e2asv4.Name"
time="2021-11-16T20:57:41Z" level=warning msg="deprecated or unknown config entry: Clusters.ce8i5.InstanceTypes.a1v2.Scratch"
time="2021-11-16T20:57:41Z" level=warning msg="deprecated or unknown config entry: Clusters.ce8i5.InstanceTypes.a1v2.Name"
time="2021-11-16T20:57:41Z" level=warning msg="deprecated or unknown config entry: Clusters.ce8i5.InstanceTypes.a2v2.Name"
time="2021-11-16T20:57:41Z" level=warning msg="deprecated or unknown config entry: Clusters.ce8i5.InstanceTypes.a2v2.Scratch"
time="2021-11-16T20:57:41Z" level=warning msg="deprecated or unknown config entry: Clusters.ce8i5.InstanceTypes.a4v2.Scratch"
time="2021-11-16T20:57:41Z" level=warning msg="deprecated or unknown config entry: Clusters.ce8i5.InstanceTypes.a4v2.Name"
time="2021-11-16T20:57:41Z" level=warning msg="deprecated or unknown config entry: Clusters.ce8i5.InstanceTypes.d4asv4.Scratch"
time="2021-11-16T20:57:41Z" level=warning msg="deprecated or unknown config entry: Clusters.ce8i5.InstanceTypes.d4asv4.Name"
time="2021-11-16T20:57:41Z" level=warning msg="deprecated or unknown config entry: Clusters.ce8i5.InstanceTypes.d8asv4.Scratch"
time="2021-11-16T20:57:41Z" level=warning msg="deprecated or unknown config entry: Clusters.ce8i5.InstanceTypes.d8asv4.Name"
time="2021-11-16T20:57:41Z" level=warning msg="deprecated or unknown config entry: Clusters.ce8i5.Users.NewInactiveUserNotificationRecipients.REDACTED@curii.com"
time="2021-11-16T20:57:41Z" level=warning msg="deprecated or unknown config entry: Clusters.ce8i5.Users.NewUserNotificationRecipients.REDACTED@curii.com"
time="2021-11-16T20:57:41Z" level=warning msg="deprecated or unknown config entry: Clusters.ce8i5.Collections.ManagedProperties.responsible_person_uuid.Value"
</pre>
<p>All of these warnings are wrong. The keys are not invalid, and none of the InstanceType definitions have a `Name` or `Scratch` field defined.</p>
<p>Note that running `arvados-server config-check` by itself produces no output, as expected. The problem must be in the output generated by config-dump, e.g. this is the `InstanceTypes` section:</p>
<pre>
InstanceTypes:
a1v2:
AddedScratch: 0
IncludedScratch: 10000000000
Name: a1v2
Preemptible: false
Price: 0.043
ProviderType: Standard_A1_v2
RAM: 2147483648
Scratch: 10000000000
VCPUs: 1
a2v2:
AddedScratch: 0
IncludedScratch: 20000000000
Name: a2v2
Preemptible: false
Price: 0.091
ProviderType: Standard_A2_v2
RAM: 4294967296
Scratch: 20000000000
VCPUs: 2
...
</pre>
<p>The actual config file has:</p>
<pre>
InstanceTypes:
a1v2:
ProviderType: Standard_A1_v2
VCPUs: 1
RAM: 2GiB
IncludedScratch: 10GB
Price: 0.043
a2v2:
ProviderType: Standard_A2_v2
VCPUs: 2
RAM: 4GiB
IncludedScratch: 20GB
Price: 0.091
...
</pre>
<p>Here's the `config-dump` output for the other keys that generate warnings:</p>
<pre>
ManagedProperties:
responsible_person_uuid:
Function: original_owner
Protected: true
Value: null
...
NewInactiveUserNotificationRecipients:
REDACTED@curii.com: {}
NewUserNotificationRecipients:
REDACTED@curii.com: {}
</pre> Arvados Workbench 2 - Bug #18371 (New): Handle unreachable API server better on startuphttps://dev.arvados.org/issues/183712021-11-15T19:09:20ZWard Vandewegeward@curii.comArvados - Bug #18311 (New): [cwl] test 221 in the 1.2 conformance suite is failing on singularityhttps://dev.arvados.org/issues/183112021-10-29T15:56:21ZWard Vandewegeward@curii.com
<p>On our singularity clusters (9tee4 and tordo, currently), the failure is (from <a class="external" href="https://ci.arvados.org/view/Arvados%20build%20pipeline/job/run-tests-cwl-suite/37/console"<a href="https://ci.arvados.org/view/Arvados%20build%20pipeline/job/run-tests-cwl-suite/37/">run-tests-cwl-suite: #37 <img src="https://ci.arvados.org/buildStatus/icon?job=run-tests-cwl-suite&build=37" alt="" /></a>/console</a>):</p>
<pre>
21:57:52 Test 221 failed: /usr/bin/arvados-cwl-runner --compute-checksum --disable-reuse --eval-timeout 60 --outdir=/tmp/tmpxpfu_s_d --quiet tests/timelimit2-wf.cwl tests/empty.json
21:57:52 Test that workflow level time limit is not applied to workflow execution time
21:57:52 Returned non-zero
21:57:52 ERROR [container timelimit2-wf.cwl] (<a href="https://arvadosapi.com/9tee4-dz642-j9qil6yio4ms1ku">9tee4-dz642-j9qil6yio4ms1ku</a>) error log:
21:57:52
21:57:52 2021-10-29T01:56:31.148851065Z crunch-run Loading Docker image from keep
21:57:52 2021-10-29T01:56:31.641354946Z crunch-run Starting container
21:57:52 2021-10-29T01:56:31.642351277Z crunch-run Waiting for container to finish
21:57:52 2021-10-29T01:56:37.397845753Z stderr INFO /usr/bin/arvados-cwl-runner 2.3.0.dev20211020182823, arvados-python-client 2.3.0.dev20211013202728, cwltool 3.1.20211020155521
21:57:52 2021-10-29T01:56:37.414865353Z stderr INFO Resolved '/var/lib/cwl/workflow.json#main' to 'file:///var/lib/cwl/workflow.json#main'
21:57:52 2021-10-29T01:56:45.046750708Z stderr INFO Using cluster 9tee4 (https://workbench2.9tee4.arvadosapi.com/)
21:57:52 2021-10-29T01:56:45.558323744Z stderr INFO Upload local files: "workflow.json"
21:57:52 2021-10-29T01:56:45.758393815Z stderr INFO Uploaded to f9d6a1cda3285e35b4b70234c4712b95+61 (<a href="https://arvadosapi.com/9tee4-4zz18-wn85zp8etf12ora">9tee4-4zz18-wn85zp8etf12ora</a>)
21:57:52 2021-10-29T01:56:45.763297415Z stderr INFO Upload local files: "workflow.json"
21:57:52 2021-10-29T01:56:45.887338055Z stderr INFO Uploaded to f9d6a1cda3285e35b4b70234c4712b95+61 (<a href="https://arvadosapi.com/9tee4-4zz18-35bm0g8ul56y4no">9tee4-4zz18-35bm0g8ul56y4no</a>)
21:57:52 2021-10-29T01:56:50.325226086Z stderr INFO Using collection cache size 256 MiB
21:57:52 2021-10-29T01:56:50.574797297Z stderr INFO Running inside container <a href="https://arvadosapi.com/9tee4-dz642-j9qil6yio4ms1ku">9tee4-dz642-j9qil6yio4ms1ku</a>
21:57:52 2021-10-29T01:56:50.596940754Z stderr INFO [workflow workflow.json#main] start
21:57:52 2021-10-29T01:56:50.597164015Z stderr INFO [workflow workflow.json#main] starting step step1
21:57:52 2021-10-29T01:56:50.597407019Z stderr INFO [step step1] start
21:57:52 2021-10-29T01:56:50.759170131Z stderr INFO [container step1] <a href="https://arvadosapi.com/9tee4-xvhdp-njas6p46cvav01s">9tee4-xvhdp-njas6p46cvav01s</a> state is Committed
21:57:52 2021-10-29T01:57:38.633382081Z stderr INFO [container step1] <a href="https://arvadosapi.com/9tee4-xvhdp-njas6p46cvav01s">9tee4-xvhdp-njas6p46cvav01s</a> is Final
21:57:52 2021-10-29T01:57:38.720112167Z stderr ERROR [container step1] (<a href="https://arvadosapi.com/9tee4-xvhdp-njas6p46cvav01s">9tee4-xvhdp-njas6p46cvav01s</a>) error log:
21:57:52 2021-10-29T01:57:38.720112167Z stderr
21:57:52 2021-10-29T01:57:38.720112167Z stderr 2021-10-29T01:57:20.484569438Z crunch-run Not starting a gateway server (GatewayAuthSecret was not provided by dispatcher)
21:57:52 2021-10-29T01:57:38.720112167Z stderr 2021-10-29T01:57:20.484755292Z crunch-run crunch-run 2.3.0 (go1.17.1) started
21:57:52 2021-10-29T01:57:38.720112167Z stderr 2021-10-29T01:57:20.484776349Z crunch-run Executing container '<a href="https://arvadosapi.com/9tee4-dz642-szbmvhrdakmb0ha">9tee4-dz642-szbmvhrdakmb0ha</a>' using singularity runtime
21:57:52 2021-10-29T01:57:38.720112167Z stderr 2021-10-29T01:57:20.484839635Z crunch-run Executing on host 'compute1.9tee4.arvadosapi.com'
21:57:52 2021-10-29T01:57:38.720112167Z stderr 2021-10-29T01:57:20.604062194Z crunch-run container token "v2/9tee4-gj3su-eaaf5ru6ohnw7d2/36sq5h3y5f0qo1s7qlgzrn9hfn4h851ufstfkpjku5sh76i92j/9tee4-dz642-szbmvhrdakmb0ha"
21:57:52 2021-10-29T01:57:38.720112167Z stderr 2021-10-29T01:57:20.604359220Z crunch-run Running [arv-mount --foreground --read-write --storage-classes default --crunchstat-interval=10 --file-cache 268435456 --mount-by-pdh by_id --mount-by-id by_uuid /tmp/crunch-run.<a href="https://arvadosapi.com/9tee4-dz642-szbmvhrdakmb0ha">9tee4-dz642-szbmvhrdakmb0ha</a>.3491659512/keep4170174168]
21:57:52 2021-10-29T01:57:38.720112167Z stderr 2021-10-29T01:57:21.510019953Z crunch-run Fetching Docker image from collection 'd2a6f06e1f8e3e7d72b8ee89622e9f96+261'
21:57:52 2021-10-29T01:57:38.720112167Z stderr 2021-10-29T01:57:21.576950500Z crunch-run Using Docker image id "sha256:61064933b465210ec06517e95e64b0909841d4ccf037552266d7079baed45a6e"
21:57:52 2021-10-29T01:57:38.720112167Z stderr 2021-10-29T01:57:21.576984357Z crunch-run Loading Docker image from keep
21:57:52 2021-10-29T01:57:38.720112167Z stderr 2021-10-29T01:57:22.013919659Z crunch-run Starting container
21:57:52 2021-10-29T01:57:38.720112167Z stderr 2021-10-29T01:57:22.014675984Z crunch-run Waiting for container to finish
21:57:52 2021-10-29T01:57:38.720112167Z stderr 2021-10-29T01:57:27.015064319Z crunch-run maximum run time exceeded. Stopping container.
21:57:52 2021-10-29T01:57:38.720112167Z stderr 2021-10-29T01:57:27.015148637Z crunch-run stopping container
21:57:52 2021-10-29T01:57:38.720112167Z stderr 2021-10-29T01:57:27.123407016Z crunch-run Cancelled
21:57:52 2021-10-29T01:57:38.817582951Z stderr ERROR [step step1] Output is missing expected field file:///var/lib/cwl/workflow.json#main/step1/o
21:57:52 2021-10-29T01:57:38.900999492Z stderr WARNING [step step1] completed permanentFail
21:57:52 2021-10-29T01:57:38.976870679Z stderr INFO [workflow workflow.json#main] completed permanentFail
21:57:52 2021-10-29T01:57:38.976917303Z stderr ERROR Overall process status is permanentFail
21:57:52 2021-10-29T01:57:39.142930716Z stderr INFO Final output collection 42866e7c2f47b2fd38dca68903e59c4b+59 "Output of main (2021-10-29T01:57:39.107Z)" (<a href="https://arvadosapi.com/9tee4-4zz18-yr16og3bk6qwk3q">9tee4-4zz18-yr16og3bk6qwk3q</a>)
21:57:52 2021-10-29T01:57:39.374321052Z stderr WARNING Final process status is permanentFail
21:57:52 2021-10-29T01:57:39.956477426Z crunch-run Complete
21:57:52 ERROR Overall process status is permanentFail
21:57:52 WARNING Final process status is permanentFail
</pre>
<p>And on 9tee4 (from <a class="external" href="https://ci.arvados.org/view/Arvados%20build%20pipeline/job/run-tests-cwl-suite/40/console"<a href="https://ci.arvados.org/view/Arvados%20build%20pipeline/job/run-tests-cwl-suite/40/">run-tests-cwl-suite: #40 <img src="https://ci.arvados.org/buildStatus/icon?job=run-tests-cwl-suite&build=40" alt="" /></a>/console</a>):</p>
<pre>
11:24:02 Test 221 failed: /usr/bin/arvados-cwl-runner --compute-checksum --disable-reuse --eval-timeout 60 --outdir=/tmp/tmp_lnu_8e_ --quiet tests/timelimit2-wf.cwl tests/empty.json
11:24:02 Test that workflow level time limit is not applied to workflow execution time
11:24:02 Returned non-zero
11:24:02 ERROR [container timelimit2-wf.cwl] (<a href="https://arvadosapi.com/9tee4-dz642-tggdcuakup1jfq5">9tee4-dz642-tggdcuakup1jfq5</a>) error log:
11:24:02
11:24:02 2021-10-29T15:22:53.175747990Z crunch-run Loading Docker image from keep
11:24:02 2021-10-29T15:22:53.849775581Z crunch-run Starting container
11:24:02 2021-10-29T15:22:53.850902104Z crunch-run Waiting for container to finish
11:24:02 2021-10-29T15:23:05.619135487Z stderr INFO /usr/bin/arvados-cwl-runner 2.3.0.dev20211020182823, arvados-python-client 2.3.0.dev20211013202728, cwltool 3.1.20211020155521
11:24:02 2021-10-29T15:23:05.636148789Z stderr INFO Resolved '/var/lib/cwl/workflow.json#main' to 'file:///var/lib/cwl/workflow.json#main'
11:24:02 2021-10-29T15:23:14.022887185Z stderr INFO Using cluster 9tee4 (https://workbench2.9tee4.arvadosapi.com/)
11:24:02 2021-10-29T15:23:14.279142453Z stderr INFO Upload local files: "workflow.json"
11:24:02 2021-10-29T15:23:14.428674656Z stderr INFO Uploaded to f9d6a1cda3285e35b4b70234c4712b95+61 (<a href="https://arvadosapi.com/9tee4-4zz18-9zab9j0xsgvigar">9tee4-4zz18-9zab9j0xsgvigar</a>)
11:24:02 2021-10-29T15:23:14.435330631Z stderr INFO Upload local files: "workflow.json"
11:24:02 2021-10-29T15:23:14.571627543Z stderr INFO Uploaded to f9d6a1cda3285e35b4b70234c4712b95+61 (<a href="https://arvadosapi.com/9tee4-4zz18-gjzewquov369heg">9tee4-4zz18-gjzewquov369heg</a>)
11:24:02 2021-10-29T15:23:19.004648013Z stderr INFO Using collection cache size 256 MiB
11:24:02 2021-10-29T15:23:19.030485917Z stderr INFO Running inside container <a href="https://arvadosapi.com/9tee4-dz642-tggdcuakup1jfq5">9tee4-dz642-tggdcuakup1jfq5</a>
11:24:02 2021-10-29T15:23:19.052813153Z stderr INFO [workflow workflow.json#main] start
11:24:02 2021-10-29T15:23:19.053081383Z stderr INFO [workflow workflow.json#main] starting step step1
11:24:02 2021-10-29T15:23:19.053338000Z stderr INFO [step step1] start
11:24:02 2021-10-29T15:23:19.230789355Z stderr INFO [container step1] <a href="https://arvadosapi.com/9tee4-xvhdp-qojf0j4ca8lyvnf">9tee4-xvhdp-qojf0j4ca8lyvnf</a> state is Committed
11:24:02 2021-10-29T15:23:55.085180501Z stderr INFO [container step1] <a href="https://arvadosapi.com/9tee4-xvhdp-qojf0j4ca8lyvnf">9tee4-xvhdp-qojf0j4ca8lyvnf</a> is Final
11:24:02 2021-10-29T15:23:55.176969339Z stderr ERROR [container step1] (<a href="https://arvadosapi.com/9tee4-xvhdp-qojf0j4ca8lyvnf">9tee4-xvhdp-qojf0j4ca8lyvnf</a>) error log:
11:24:02 2021-10-29T15:23:55.176969339Z stderr
11:24:02 2021-10-29T15:23:55.176969339Z stderr 2021-10-29T15:23:45.345522001Z crunch-run Not starting a gateway server (GatewayAuthSecret was not provided by dispatcher)
11:24:02 2021-10-29T15:23:55.176969339Z stderr 2021-10-29T15:23:45.345759158Z crunch-run crunch-run 2.3.0 (go1.17.1) started
11:24:02 2021-10-29T15:23:55.176969339Z stderr 2021-10-29T15:23:45.345779799Z crunch-run Executing container '<a href="https://arvadosapi.com/9tee4-dz642-e1vpz4lmmpzjdr4">9tee4-dz642-e1vpz4lmmpzjdr4</a>' using singularity runtime
11:24:02 2021-10-29T15:23:55.176969339Z stderr 2021-10-29T15:23:45.345801566Z crunch-run Executing on host 'compute1.9tee4.arvadosapi.com'
11:24:02 2021-10-29T15:23:55.176969339Z stderr 2021-10-29T15:23:45.459713326Z crunch-run container token "v2/9tee4-gj3su-ewn10p08oeckpxj/1eqyw1uioulmxl78olk0i743ipt4nro0neksh32ee48ks8rjux/9tee4-dz642-e1vpz4lmmpzjdr4"
11:24:02 2021-10-29T15:23:55.176969339Z stderr 2021-10-29T15:23:45.460034932Z crunch-run Running [arv-mount --foreground --read-write --storage-classes default --crunchstat-interval=10 --file-cache 268435456 --mount-by-pdh by_id --mount-by-id by_uuid /tmp/crunch-run.<a href="https://arvadosapi.com/9tee4-dz642-e1vpz4lmmpzjdr4">9tee4-dz642-e1vpz4lmmpzjdr4</a>.734327895/keep1570108756]
11:24:02 2021-10-29T15:23:55.176969339Z stderr 2021-10-29T15:23:46.365999816Z crunch-run Fetching Docker image from collection 'd2a6f06e1f8e3e7d72b8ee89622e9f96+261'
11:24:02 2021-10-29T15:23:55.176969339Z stderr 2021-10-29T15:23:46.430319132Z crunch-run Using Docker image id "sha256:61064933b465210ec06517e95e64b0909841d4ccf037552266d7079baed45a6e"
11:24:02 2021-10-29T15:23:55.176969339Z stderr 2021-10-29T15:23:46.430387977Z crunch-run Loading Docker image from keep
11:24:02 2021-10-29T15:23:55.176969339Z stderr 2021-10-29T15:23:47.011343535Z crunch-run Starting container
11:24:02 2021-10-29T15:23:55.176969339Z stderr 2021-10-29T15:23:47.012153992Z crunch-run Waiting for container to finish
11:24:02 2021-10-29T15:23:55.176969339Z stderr 2021-10-29T15:23:52.012532501Z crunch-run maximum run time exceeded. Stopping container.
11:24:02 2021-10-29T15:23:55.176969339Z stderr 2021-10-29T15:23:52.012641954Z crunch-run stopping container
11:24:02 2021-10-29T15:23:55.176969339Z stderr 2021-10-29T15:23:52.123251411Z crunch-run Cancelled
11:24:02 2021-10-29T15:23:55.259792247Z stderr ERROR [step step1] Output is missing expected field file:///var/lib/cwl/workflow.json#main/step1/o
11:24:02 2021-10-29T15:23:55.344468592Z stderr WARNING [step step1] completed permanentFail
11:24:02 2021-10-29T15:23:55.423478553Z stderr INFO [workflow workflow.json#main] completed permanentFail
11:24:02 2021-10-29T15:23:55.423519944Z stderr ERROR Overall process status is permanentFail
11:24:02 2021-10-29T15:23:55.572398801Z stderr INFO Final output collection 42866e7c2f47b2fd38dca68903e59c4b+59 "Output of main (2021-10-29T15:23:55.541Z)" (<a href="https://arvadosapi.com/9tee4-4zz18-smx5y5kzo0oyh11">9tee4-4zz18-smx5y5kzo0oyh11</a>)
11:24:02 2021-10-29T15:23:55.804266214Z stderr WARNING Final process status is permanentFail
11:24:02 2021-10-29T15:23:56.277148741Z crunch-run Complete
11:24:02 ERROR Overall process status is permanentFail
11:24:02 WARNING Final process status is permanentFail
</pre>
<p>Compare with ce8i5 where it passes (<a class="external" href="https://ci.arvados.org/view/Arvados%20build%20pipeline/job/run-tests-cwl-suite/38"<a href="https://ci.arvados.org/view/Arvados%20build%20pipeline/job/run-tests-cwl-suite/38/">run-tests-cwl-suite: #38 <img src="https://ci.arvados.org/buildStatus/icon?job=run-tests-cwl-suite&build=38" alt="" /></a></a>).</p> Arvados - Bug #18292 (New): [cleanup] remove AssignNodeHostname from the configuration. Also from...https://dev.arvados.org/issues/182922021-10-22T21:24:46ZWard Vandewegeward@curii.comArvados - Bug #18278 (New): [k8s] start using an ingresshttps://dev.arvados.org/issues/182782021-10-19T19:52:53ZWard Vandewegeward@curii.com
<p>As reported in <a class="external" href="https://forum.arvados.org/t/deploy-arvados-on-gke">https://forum.arvados.org/t/deploy-arvados-on-gke</a>, the GKE k8s setup needs some work.</p> Arvados - Bug #18262 (New): [crunch-run] handle out-of-diskspace on the compute node betterhttps://dev.arvados.org/issues/182622021-10-08T21:38:39ZWard Vandewegeward@curii.com
<p>When a job consumes all available disk space on a compute node, and the node was not started with a particular scratch space requirement (i.e. no extra partition was added), bad things happen because the job fills up the root partition of the node.</p>
<p>In one example today, a workflow filled up the root partition (which was tiny) which caused /etc/resolv.conf to be emptied on the next dhcp renew (sigh), which caused crunch-run to be unable to find the api server and keepstores and had the effect that the container failed with truncated logs, and without explicitly being marked as such. It looked as if crunch-run was crashing until we caught the compute node in the act, which was a bit of a debugging adventure.</p>
<p>Can we somehow restrict the amount of disk space the container is allowed to use?</p> Arvados - Bug #18191 (New): [doc] the compute node image doc does not take releases into accounthttps://dev.arvados.org/issues/181912021-09-24T21:48:28ZWard Vandewegeward@curii.com
<p>The page at <a class="external" href="https://doc.arvados.org/install/crunch2-cloud/install-compute-node.html">https://doc.arvados.org/install/crunch2-cloud/install-compute-node.html</a> is source based, and it does not take our releases into account. This means that by default, you get the bleeding edge/dev packages in your freshly baked compute image, which is <strong>not</strong> what you want.</p>
<p>Solution: either fix it by adding an explicit git checkout command (this would need to be autogenerated with the correct git branch, use an existing/new variable in the doc build scripts), or package the build script as a proper OS package, versioned like everything else.</p> Arvados - Bug #18161 (New): [a-d-c] the arvados_dispatchcloud_queue_entries prometheus metric sho...https://dev.arvados.org/issues/181612021-09-16T16:34:47ZWard Vandewegeward@curii.com
<p>The arvados_dispatchcloud_queue_entries metric is implemented in the "container queue" module which knows nothing about instances. It reports the best instance type for a set of resource requirements, based on the current configuration file.</p>
<p>This can cause inaccurate metrics when the node definitions in the configuration file are changed (and a-d-c is restarted) while containers are running. Instead of getting actual data you get aspirational data at this point. Today, a job was started that used 48 m5a.xlarge nodes and then ran into cloud capacity problems (spot). I updated the config file to make m5a.xlarge much more expensive, and restarted a-d-c, which promptly started the rest of the pending containers on m5.xlarge nodes. But the metric now reported 96 containers running on m5.xlarge, instead of the reality, which was 48 on m5a.xlarge and 48 on m5.xlarge.</p>
<p>The `arvados_dispatchcloud_instances_total` metrics (aka `node by state`) are correct in this scenario, and do not need fixing.</p>
<p>The `arvados_dispatchcloud_queue_entries` metric should be moved to the scheduler, which knows about queues and workers, and be changed to report actual information.</p> Arvados - Bug #18101 (New): [a-d-c] [AWS] add option to spin up (spot) instances in more/all avai...https://dev.arvados.org/issues/181012021-09-03T20:02:59ZWard Vandewegeward@curii.com
<p>When using spot instances on AWS, it is common to see a message like this in the a-d-c logs:</p>
<pre>
InsufficientInstanceCapacity: We currently do not have sufficient m5.8xlarge capacity in the Availability Zone you requested (us-east-1a). Our system will be working on provisioning additional capacity. You can currently get m5.8xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1b, us-east-1c, us-east-1d, us-east-1f.
</pre>
<p>Currently, a-d-c requests compute instances with a specific subnet, which is tied to one availability zone, and we recommend that that zone is the same as the one the keepstores run in.</p>
<p>Traffic between availability zones in the same AWS region costs $0.02/GB (cf. <a class="external" href="https://aws.amazon.com/ec2/pricing/on-demand/#Data_Transfer_within_the_same_AWS_Region">https://aws.amazon.com/ec2/pricing/on-demand/#Data_Transfer_within_the_same_AWS_Region</a>).</p>
<p>Once <a class="issue tracker-6 status-3 priority-4 priority-default closed behind-schedule" title="Idea: Run Keepstore on local compute nodes (Resolved)" href="https://dev.arvados.org/issues/16516">#16516</a> (run Keepstore on the compute node) is implemented, it will be advantageous to configure a cluster on AWS where (spot) instances are requested across multiple (all?) availability zones in a region. When a spot instance runs in a different AZ, there would be an extra cost of $0.02/GB for all traffic to/from the permanent EC2 instances (e.g. API server), but that traffic should be minimal (mostly crunchstat-summary log traffic).</p>
<p>The Arvados configuration should support multiple subnets:</p>
<pre>
CloudVMs:
Driver: ec2
DriverParameters:
SubnetIDs: ['subnet-...', 'subnet-...']
</pre>
<p>Alternatively, it would be nice if we could pass <strong>no</strong> AZ in the request; I'm not sure how that would work in the AWS sdk, presumably you would still have to supply a desired subnet. This needs a bit of investigation.</p>