Arvados: Issueshttps://dev.arvados.org/https://dev.arvados.org/favicon.ico?15576888422024-03-25T16:52:07ZArvados
Redmine Arvados - Bug #21618 (New): cloudtest should give up if test instance disappears from listing bef...https://dev.arvados.org/issues/216182024-03-25T16:52:07ZTom Cleggtom@curii.com
<p>Currently, if an instance/image has a problem that causes it to shutdown before responding to a boot probe, cloudtest keeps probing after it disappears, which is clearly futile.</p> Arvados - Bug #21598 (New): Local keepstore invoked by crunch-run should never do EmptyTrash workhttps://dev.arvados.org/issues/215982024-03-15T18:32:48ZTom Cleggtom@curii.com
<p>We don't want N compute nodes periodically checking expiry times on all of the trashed blocks on all backend volumes.</p> Arvados - Bug #21319 (New): Avoid waiting/deadlock when a controller handler performs subrequests...https://dev.arvados.org/issues/213192023-12-27T23:26:44ZTom Cleggtom@curii.comArvados - Bug #21314 (New): a-d-c should cancel a container if it can't be loadedhttps://dev.arvados.org/issues/213142023-12-21T16:55:13ZTom Cleggtom@curii.com
<p>If a container's "mounts" field is invalid, a-d-c logs this, and keeps trying.</p>
<code class="json syntaxhl"><span class="p">{</span><span class="nl">"ClusterID"</span><span class="p">:</span><span class="s2">"irdev"</span><span class="p">,</span><span class="nl">"ContainerUUID"</span><span class="p">:</span><span class="s2">"<a href="https://arvadosapi.com/xxxxx-dz642-xxxxxxxxxxxxxxx">xxxxx-dz642-xxxxxxxxxxxxxxx</a>"</span><span class="p">,</span><span class="nl">"PID"</span><span class="p">:</span><span class="mi">2037423</span><span class="p">,</span><span class="nl">"error"</span><span class="p">:</span><span class="s2">"json: cannot unmarshal array into Go struct field Container.mounts of type arvados.Mount"</span><span class="p">,</span><span class="nl">"level"</span><span class="p">:</span><span class="s2">"warning"</span><span class="p">,</span><span class="nl">"msg"</span><span class="p">:</span><span class="s2">"error getting mounts"</span><span class="p">,</span><span class="nl">"time"</span><span class="p">:</span><span class="s2">"2023-12-13T20:34:41.064140517Z"</span><span class="p">}</span><span class="w">
</span></code>
<p>In this situation, the offending container should be cancelled.</p> Arvados - Bug #21187 (New): a-c-r should detect and warn when arv:IntermediateOutput outputTTL is...https://dev.arvados.org/issues/211872023-11-09T19:31:33ZTom Cleggtom@curii.com
<p>Currently, if outputTTL is set too low and a workflow tries to use intermediate data after it has already been trashed, a-c-r may read a intermediate collection manifest successfully (before trash time) but then fail to save it later (after trash time) in a combined collection. In that case the user ends up getting a python stack trace ending in a 403 error (invalid blob signature).</p>
<p>a-c-r should warn the user when the duration the current workflow has been running exceeds outputTTL (this is probably a good indicator the user should increase outputTTL even if it hasn't actually broken anything yet)</p>
<p>a-c-r should also report a more helpful error message when it fails to create a collection due to expired blob signatures. This could be done by checking for a 403 error from the create call and/or checking the expiry times (given as hexadecimal unix times) on the blob signatures in the manifest text.</p> Arvados - Bug #21134 (New): Fix proxy error logging in controller's container log handlerhttps://dev.arvados.org/issues/211342023-10-20T15:25:18ZTom Cleggtom@curii.com
<p>Currently, when controller gets a non-HTTP error while trying to proxy a request to keep-web, the error gets logged using stdlib <code>log.Print()</code> instead of structured logs:</p>
<pre>
{"ClusterID":"2xpu4","PID":22503,"RequestID":"req-1bqnt5n2ozwpcscdjc94","level":"info","msg":"request","remoteAddr":"127.0.0.1:38144","reqBytes":0,"reqForwardedFor":"XXX","reqHost":"2xpu4.arvadosapi.com","reqMethod":"PROPFIND","reqPath":"arvados/v1/container_requests/XXX/log/XXX","reqQuery":"","time":"2023-10-20T14:56:54.866512957Z"}
2023/10/20 14:56:54 http: proxy error: dial tcp 127.0.0.1:9002: connect: connection refused
{"ClusterID":"2xpu4","PID":22503,"RequestID":"req-1bqnt5n2ozwpcscdjc94","level":"info","msg":"response","priority":1,"remoteAddr":"127.0.0.1:38144","reqBytes":0,"reqForwardedFor":"XXX","reqHost":"2xpu4.arvadosapi.com","reqMethod":"PROPFIND","reqPath":"arvados/v1/container_requests/XXX/log/XXX","reqQuery":"","respBody":"","respBytes":0,"respStatus":"Bad Gateway","respStatusCode":502,"time":"2023-10-20T14:56:54.881886673Z","timeToStatus":0.015358,"timeTotal":0.015365,"timeWriteBody":0.000007,"tokenUUIDs":["XXX"]}
</pre>
<p>The error should be returned to the client in the 502 response body and, ideally, in a field in the "response" log entry.</p> Arvados - Bug #20804 (New): crunchstat-summary should use container logs API, not CollectionReade...https://dev.arvados.org/issues/208042023-07-31T17:36:21ZTom Cleggtom@curii.com
<p>Currently crunchstat-summary uses CollectionReader to read crunchstat logs for finished containers, and uses the "logs" API for unfinished containers. That "logs" data will soon be unavailable, so crunchstat-summary will lose the ability to make graphs/stats for containers while they are still running.</p>
<p>We can fix this by replacing both "get crunchstat logs via CollectionReader" and "...via logs API" code paths with "get crunchstat logs via container logs API".</p> Arvados - Bug #20646 (New): Explain that on EC2, DeployPublicKey depends on AdminUsername matchin...https://dev.arvados.org/issues/206462023-06-15T15:24:28ZTom Cleggtom@curii.com
<p>Currently the <a href="https://doc.arvados.org/main/install/crunch2-cloud/install-compute-node.html#requirements" class="external">install docs</a> imply the DeployPublicKey only works on Azure. In fact, it also works on AWS, provided DriverParameters.AdminUsername matches the "default username" for the machine image being used, which varies by distribution, see <a class="external" href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/connection-prereqs.html">https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/connection-prereqs.html</a></p> Arvados - Bug #20516 (New): Diagnostics command should recommend using cloudtest to diagnose furt...https://dev.arvados.org/issues/205162023-05-16T19:19:53ZTom Cleggtom@curii.comArvados - Bug #20317 (New): Searching for a container UUID always shows "{uuid} not available" errorhttps://dev.arvados.org/issues/203172023-04-11T18:21:47ZTom Cleggtom@curii.com
<p>I would expect this to show either a non-CR-specific page about the container, or show the most recent container request associated with it.</p>
<p>Browser debugger shows a request for <code>.../arvados/v1/container_requests?filters=[[requesting_container_uuid,=,{container_uuid}]]</code> which doesn't quite seem right.</p>
<p>(Maybe this just needs to be changed to <code>?filters=[[container_uuid,=,{container_uuid}]]</code> ?)</p> Arvados - Bug #19965 (New): RailsAPI should either enforce container.log is a PDH, or handle UUID...https://dev.arvados.org/issues/199652023-01-20T15:17:10ZTom Cleggtom@curii.com
<p>In <a class="issue tracker-2 status-3 priority-4 priority-default closed parent" title="Feature: crunch-run commits log collection at container start (Resolved)" href="https://dev.arvados.org/issues/19886">#19886</a> we found that RailsAPI will allow crunch-run to update a container record with a UUID in the log attribute, but it appears the "update container request logs" code will then silently fail because it is assuming the container log attribute is a PDH. As a result, the container request log collections will be stale or empty.</p>
<p>Either the container model should enforce that log can only be set to a PDH (which is currently the only way it's used), or the "copy container log data to container request log collection" code should be updated to work when the container log is a UUID.</p>
<p>Either way, setting log to a non-empty, non-UUID, non-PDH value should be an error.</p> Arvados - Bug #18586 (New): Remove docs/code for unsupported AsyncPermissionsUpdateIntervalhttps://dev.arvados.org/issues/185862021-12-14T16:21:47ZTom Cleggtom@curii.com
While looking at the documentation for potential spots that still need tweaking, I think I've found some traces of a currently not supported feature: the <code>async</code> flag on the <code>groups</code> API:
<ul>
<li>API documentation mentions it along with the <code>async_permissions_update_interval</code> config knob.</li>
<li>Some test & config code is still existing about <code>API.AsyncPermissionsUpdateInterval</code> and its relationship with the previous point.</li>
<li>Group's controller and arvados model still have some code related the <code>async</code> parameter, but it seems to me that we don't really do anything with it anymore.</li>
</ul>
<p>(copied from <a class="issue tracker-2 status-3 priority-4 priority-default closed parent" title="Feature: Ability to make groups visible to all users (Resolved)" href="https://dev.arvados.org/issues/18277#note-16">#18277#note-16</a>)</p> Arvados - Bug #18334 (New): Accept release info changes in docker recipeshttps://dev.arvados.org/issues/183342021-11-04T15:02:18ZTom Cleggtom@curii.com
<p>In some circumstances, "apt-get update" stops working due to existence of a future debian version.</p>
<p>This can break cmd/arvados-package tests.</p>
<pre>
$ docker run --rm -it arvados-package-deps-debian:10 bash
root@7d1560822db7:/# apt-get update
Get:1 http://deb.debian.org/debian buster InRelease [122 kB]
Get:2 http://security.debian.org/debian-security buster/updates InRelease [65.4 kB]
Get:3 http://deb.debian.org/debian buster-updates InRelease [51.9 kB]
Reading package lists... Done
E: Repository 'http://security.debian.org/debian-security buster/updates InRelease' changed its 'Suite' value from 'stable' to 'oldstable'
N: This must be accepted explicitly before updates for this repository can be applied. See apt-secure(8) manpage for details.
N: Repository 'http://deb.debian.org/debian buster InRelease' changed its 'Version' value from '10.9' to '10.11'
E: Repository 'http://deb.debian.org/debian buster InRelease' changed its 'Suite' value from 'stable' to 'oldstable'
N: This must be accepted explicitly before updates for this repository can be applied. See apt-secure(8) manpage for details.
E: Repository 'http://deb.debian.org/debian buster-updates InRelease' changed its 'Suite' value from 'stable-updates' to 'oldstable-updates'
N: This must be accepted explicitly before updates for this repository can be applied. See apt-secure(8) manpage for details.
</pre>
<p>Proposed fix: "apt-get --allow-releaseinfo-change update" in scripts.</p> Arvados - Bug #18114 (New): [a-d-c] slow down retries when CreateInstance returns non-quota/non-t...https://dev.arvados.org/issues/181142021-09-07T18:04:45ZTom Cleggtom@curii.com
<p>If we get an error from the cloud provider when trying to create an instance, but the error isn't recognized as a quota or rate-limiting error, we retry very quickly, which would be unhelpful for an error like "invalid instance type". We should consider other options, like a per-instance-type quiet period for unrecognized errors (which would probably be a better way to respond to InsufficientInstanceCapacity as well).</p> Arvados - Bug #17878 (New): [container shell] confusing error "channel 3: bad ext data" when forw...https://dev.arvados.org/issues/178782021-07-08T17:21:30ZTom Cleggtom@curii.com
<p>Currently, the "container shell" port forwarding feature only works on containers with network access enabled via <code>RuntimeConstraints: {API: true}</code> (aside: it is possible for crunch-run to enable the required networking features on the fly to make it work on containers with <code>API: false</code> but this is not yet implemented).</p>
<p>When the container does not have networking enabled, "arvados-client shell {uuid} -L1234:localhost:1234" at first appears to work (it doesn't complain or fail), but when a client tries to use this connection, mysterious error messages appear:</p>
<pre>
(on client host)
$ go tool pprof -http :6060 http://localhost:6000/debug/pprof/goroutine
Fetching profile over HTTP from http://localhost:6000/debug/pprof/goroutine
http://localhost:6000/debug/pprof/goroutine: Get "http://localhost:6000/debug/pprof/goroutine": EOF
(in arvados-client shell session)
channel 3: bad ext data
</pre>
<p>In this situation where port forwarding isn't supported, the SSH server should print a message on the client's stderr when first setting up the SSH connection with port forwarding enabled, and/or print a more helpful message on the client's stderr when attempting to connect to the forwarded port.</p>