Arvados: Issueshttps://dev.arvados.org/https://dev.arvados.org/favicon.ico?15576888422024-03-25T16:52:07ZArvados
Redmine Arvados - Bug #21618 (New): cloudtest should give up if test instance disappears from listing bef...https://dev.arvados.org/issues/216182024-03-25T16:52:07ZTom Cleggtom@curii.com
<p>Currently, if an instance/image has a problem that causes it to shutdown before responding to a boot probe, cloudtest keeps probing after it disappears, which is clearly futile.</p> Arvados - Feature #21599 (New): _inspect/requests endpoint should reveal whether each request is ...https://dev.arvados.org/issues/215992024-03-15T18:45:20ZTom Cleggtom@curii.com
<p>This is a little inconvenient because the queue decision happens lower in the handler stack than the inspector (and we don't want to change that).</p>
<p>We can do something similar to responseLogFieldsContextKey in <a class="source" href="https://dev.arvados.org/projects/arvados/repository/arvados/entry/sdk/go/httpserver/logger.go">source:sdk/go/httpserver/logger.go</a> -- attach an atomic.Value to the request context as it passes through the Inspect handler, then have RequestLimiter Store() queue status there (queue label, time the request was released for processing), and Load() when generating the _inspect/requests report.</p> Arvados - Idea #21323 (New): System services use cache/config directories indicated by XDG env va...https://dev.arvados.org/issues/213232023-12-29T16:49:54ZTom Cleggtom@curii.com
<p>From <a class="issue tracker-2 status-3 priority-4 priority-default closed parent" title="Feature: Go SDK supports local filesystem-backed data cache (Resolved)" href="https://dev.arvados.org/issues/20318#note-19">#20318#note-19</a></p>
<ul>
<li>If the systemd $*_DIRECTORY variable is set, use that.</li>
<li>Otherwise, if the XDG $XDG_*_HOME/$XDG_*_DIR variable is set, use that. (See <a class="issue tracker-6 status-1 priority-4 priority-default" title="Idea: Support XDG base directory envvars throughout the Python SDK (New)" href="https://dev.arvados.org/issues/21020">#21020</a>)</li>
<li>Otherwise, default to current behavior.</li>
<li>Update our systemd unit files to use the *Directory directives.</li>
</ul>
<p><a class="external" href="https://www.freedesktop.org/software/systemd/man/latest/systemd.exec.html#RuntimeDirectory=">https://www.freedesktop.org/software/systemd/man/latest/systemd.exec.html#RuntimeDirectory=</a></p> Arvados - Bug #21319 (New): Avoid waiting/deadlock when a controller handler performs subrequests...https://dev.arvados.org/issues/213192023-12-27T23:26:44ZTom Cleggtom@curii.comArvados - Bug #21314 (New): a-d-c should cancel a container if it can't be loadedhttps://dev.arvados.org/issues/213142023-12-21T16:55:13ZTom Cleggtom@curii.com
<p>If a container's "mounts" field is invalid, a-d-c logs this, and keeps trying.</p>
<code class="json syntaxhl"><span class="p">{</span><span class="nl">"ClusterID"</span><span class="p">:</span><span class="s2">"irdev"</span><span class="p">,</span><span class="nl">"ContainerUUID"</span><span class="p">:</span><span class="s2">"<a href="https://arvadosapi.com/xxxxx-dz642-xxxxxxxxxxxxxxx">xxxxx-dz642-xxxxxxxxxxxxxxx</a>"</span><span class="p">,</span><span class="nl">"PID"</span><span class="p">:</span><span class="mi">2037423</span><span class="p">,</span><span class="nl">"error"</span><span class="p">:</span><span class="s2">"json: cannot unmarshal array into Go struct field Container.mounts of type arvados.Mount"</span><span class="p">,</span><span class="nl">"level"</span><span class="p">:</span><span class="s2">"warning"</span><span class="p">,</span><span class="nl">"msg"</span><span class="p">:</span><span class="s2">"error getting mounts"</span><span class="p">,</span><span class="nl">"time"</span><span class="p">:</span><span class="s2">"2023-12-13T20:34:41.064140517Z"</span><span class="p">}</span><span class="w">
</span></code>
<p>In this situation, the offending container should be cancelled.</p> Arvados - Feature #21279 (New): cloudtest command should test connectivity to crunch-run gatewayhttps://dev.arvados.org/issues/212792023-12-08T17:14:50ZTom Cleggtom@curii.com
<p>After the instance boots, <code>cloudtest</code> should copy itself (<code>/proc/self/exe</code>) to the compute node and use it to run an http server on a dynamic port, then check that it can connect to that server. This will sometimes detect network/firewall/configuration problems that would prevent controller from connecting to crunch-run's gateway server and break live logs.</p> Arvados - Feature #21269 (New): Fix unkeyed struct fields and enable "go vet" checkshttps://dev.arvados.org/issues/212692023-12-06T16:49:18ZTom Cleggtom@curii.com
<p>See #21227#note-5 and <a class="changeset" title="21227: Fail tests on 'go vet' problems. ...except "literal uses unkeyed fields", of which there ..." href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/d7b8f2a876c797c22bcb8594f73624402d758e18">d7b8f2a876</a></p>
<p>Most of the offending code is in test suites.</p> Arvados - Bug #21187 (New): a-c-r should detect and warn when arv:IntermediateOutput outputTTL is...https://dev.arvados.org/issues/211872023-11-09T19:31:33ZTom Cleggtom@curii.com
<p>Currently, if outputTTL is set too low and a workflow tries to use intermediate data after it has already been trashed, a-c-r may read a intermediate collection manifest successfully (before trash time) but then fail to save it later (after trash time) in a combined collection. In that case the user ends up getting a python stack trace ending in a 403 error (invalid blob signature).</p>
<p>a-c-r should warn the user when the duration the current workflow has been running exceeds outputTTL (this is probably a good indicator the user should increase outputTTL even if it hasn't actually broken anything yet)</p>
<p>a-c-r should also report a more helpful error message when it fails to create a collection due to expired blob signatures. This could be done by checking for a 403 error from the create call and/or checking the expiry times (given as hexadecimal unix times) on the blob signatures in the manifest text.</p> Arvados - Feature #21175 (New): Do not retry after "unsupported instance type" EC2 errorshttps://dev.arvados.org/issues/211752023-11-07T16:20:10ZTom Cleggtom@curii.com
<p>Currently arvados-dispatch-cloud treats "unsupported instance type" as a transient capacity error, and recovers by trying other subnets and other instance types. This works, but generates unnecessary logging noise and API calls by retrying the same instance type (if still needed) after a hold-off period.</p>
<p>This could be improved by having the ec2 driver set an "instance type T unavailable in subnet S" flag for the life of the arvados-dispatch-cloud process and, when that flag is set, skip the EC2 API call and just try the next subnet or return a capacity error.</p>
<p>In the event all configured instance types suitable for a given container are unsupported in all subnets, the current version of a-d-c will wait futilely for them to appear.</p>
<p>This could be improved by having the ec2 driver and worker pool propagate the "permanently unavailable" state back to the scheduler so it can cancel the container.</p> Arvados - Bug #21134 (New): Fix proxy error logging in controller's container log handlerhttps://dev.arvados.org/issues/211342023-10-20T15:25:18ZTom Cleggtom@curii.com
<p>Currently, when controller gets a non-HTTP error while trying to proxy a request to keep-web, the error gets logged using stdlib <code>log.Print()</code> instead of structured logs:</p>
<pre>
{"ClusterID":"2xpu4","PID":22503,"RequestID":"req-1bqnt5n2ozwpcscdjc94","level":"info","msg":"request","remoteAddr":"127.0.0.1:38144","reqBytes":0,"reqForwardedFor":"XXX","reqHost":"2xpu4.arvadosapi.com","reqMethod":"PROPFIND","reqPath":"arvados/v1/container_requests/XXX/log/XXX","reqQuery":"","time":"2023-10-20T14:56:54.866512957Z"}
2023/10/20 14:56:54 http: proxy error: dial tcp 127.0.0.1:9002: connect: connection refused
{"ClusterID":"2xpu4","PID":22503,"RequestID":"req-1bqnt5n2ozwpcscdjc94","level":"info","msg":"response","priority":1,"remoteAddr":"127.0.0.1:38144","reqBytes":0,"reqForwardedFor":"XXX","reqHost":"2xpu4.arvadosapi.com","reqMethod":"PROPFIND","reqPath":"arvados/v1/container_requests/XXX/log/XXX","reqQuery":"","respBody":"","respBytes":0,"respStatus":"Bad Gateway","respStatusCode":502,"time":"2023-10-20T14:56:54.881886673Z","timeToStatus":0.015358,"timeTotal":0.015365,"timeWriteBody":0.000007,"tokenUUIDs":["XXX"]}
</pre>
<p>The error should be returned to the client in the 502 response body and, ideally, in a field in the "response" log entry.</p> Arvados - Feature #21133 (New): Add diagnostics checks for container log APIhttps://dev.arvados.org/issues/211332023-10-20T15:18:09ZTom Cleggtom@curii.com
If it runs a container, <code>arvados-server diagnostics</code> should
<ol>
<li>access the container log endpoint while waiting for the container to finish, to make sure it returns a valid response (due to races, it's not necessarily an error if it can't get any actual log data before the container finishes, but it's an error if it receives a 502 error at any point during the container lifecycle, for example).</li>
<li>access the container log endpoint after the container has finished, to make sure the controller→webdav communication works correctly.</li>
</ol>
<p>It is possible for a cluster to be misconfigured such that logs work only for unfinished containers, or only for finished containers, so <code>diagnostics</code> should do its best to check for both problems.</p> Arvados - Feature #21079 (New): When at cloud quota, retry creating instances periodically even w...https://dev.arvados.org/issues/210792023-10-16T19:25:05ZTom Cleggtom@curii.com
<p>From <a class="issue tracker-1 status-3 priority-4 priority-default closed parent" title="Bug: Do not treat InsufficientInstanceCapacity as quota error (Resolved)" href="https://dev.arvados.org/issues/20984#note-8">#20984#note-8</a>:</p>
<blockquote>
<p>The current behavior has some very bad failure modes. A user launched a pipeline which asked for a large node (m4.10xlarge) and got InsufficientInstanceCapacity after only 3 instances had been created; this caused the dispatcher to completely stop trying to start nodes and lowered the dynamic max instances down to 3. As a result it became starved because the instances already running were waiting on the worker instance to start, but dispatcher was waiting for an instance to shut down before it would try starting a new one.</p>
<p>Instead of going completely silent on quota error, I think we want to either go back to the old behavior (1 minute quiet period) or implement an exponential back off behavior (wait for 15 seconds, then 30 seconds, then 60 seconds, then 2 minutes, etc). An instance shutdown can still be used as a signal to try starting a new instance if it is in the quiet period, but a quiet period of indefinite length is turning out to be bad behavior -- the correct assumption is that we're sharing the cloud resource with other users and new resources could become available any time without us having to do anything.</p>
</blockquote>
<p>Even after <a class="issue tracker-1 status-3 priority-4 priority-default closed parent" title="Bug: Do not treat InsufficientInstanceCapacity as quota error (Resolved)" href="https://dev.arvados.org/issues/20984">#20984</a> is fixed, a similar situation can still happen with conditions like InsufficientFreeAddressesInSubnet: if the relevant resources are freed up by something other than arvados-dispatch-cloud (or the relevant quota is increased), the current implementation will not notice until an existing instance gets shut down.</p>
<p>To address this, the quota flag should get reset after some time interval (1 minute?) even if no instances have been shut down.</p>
<p>Part of the original motivation for latching the quota flag was to avoid exhausting lock/unlock cycles. When changing this, make sure the fix in #20457 (don't unlock the next locked container just because cloud is at quota) is still effective.</p> Arvados - Idea #20993 (New): Test ruby-google-api-clienthttps://dev.arvados.org/issues/209932023-09-25T15:52:28ZTom Cleggtom@curii.com
<p>Currently <a class="source" href="https://dev.arvados.org/projects/arvados/repository/arvados/entry/build/run-tests.sh">source:build/run-tests.sh</a> has a test_sdk/ruby-google-api-client stub, but it doesn't actually test the code in <a class="source" href="https://dev.arvados.org/projects/arvados/repository/arvados/entry/sdk/ruby-google-api-client">source:sdk/ruby-google-api-client</a> like you'd think. In a sense that code gets tested indirectly by being imported by sdk/ruby, but we should also be running its own tests.</p> Arvados - Bug #20804 (New): crunchstat-summary should use container logs API, not CollectionReade...https://dev.arvados.org/issues/208042023-07-31T17:36:21ZTom Cleggtom@curii.com
<p>Currently crunchstat-summary uses CollectionReader to read crunchstat logs for finished containers, and uses the "logs" API for unfinished containers. That "logs" data will soon be unavailable, so crunchstat-summary will lose the ability to make graphs/stats for containers while they are still running.</p>
<p>We can fix this by replacing both "get crunchstat logs via CollectionReader" and "...via logs API" code paths with "get crunchstat logs via container logs API".</p> Arvados - Feature #20756 (New): Support crunchstat tracking and memory limits with singularityhttps://dev.arvados.org/issues/207562023-07-19T13:58:18ZTom Cleggtom@curii.com
<p>Singularity has capability to put the container in a new cgroup and set resource usage limits. Even without applying any limits, this also enables resource usage tracking by crunchstat.</p>
<p><a class="external" href="https://docs.sylabs.io/guides/3.0/user-guide/cgroups.html">https://docs.sylabs.io/guides/3.0/user-guide/cgroups.html</a></p>
<p>The docs say "the <code>--apply-cgroups</code> option can only be used with root privileges" but these tests worked as a non-root user:</p>
<pre>
$ singularity version
3.10.4-dirty
$ singularity exec --apply-cgroups /dev/null docker://debian:12 sleep 600 &
[1] 60133
$ pstree -up | grep sleep
| | `-starter-suid(60133)-+-sleep(60151)
$ cat /proc/60133/cgroup
0::/user.slice/user-1000.slice/session-5424.scope
$ cat /proc/60151/cgroup
0::/user.slice/user-1000.slice/user@1000.service/user.slice/singularity-60151.scope
$ cat /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/user.slice/singularity-60151.scope/memory.current
2465792
</pre>
<pre>
$ singularity exec --apply-cgroups <(printf '[memory]\n limit = 5000000\n') docker://debian:12 echo ok
ok
$ singularity exec --apply-cgroups <(printf '[memory]\n limit = 5000\n') docker://debian:12 echo ok
Killed
</pre>
<p>As of <a class="issue tracker-1 status-3 priority-4 priority-default closed parent" title="Bug: Make sure cgroupsV2 works with Arvados (Resolved)" href="https://dev.arvados.org/issues/17244">#17244</a> crunch-run does not correctly identify the pid of a process inside the container when telling crunchstat which process/cgroup to monitor (it returns the pid of the singularity executor wrapper instead). This will also need to be fixed in order for crunchstat to work correctly.</p>