Arvados: Issueshttps://dev.arvados.org/https://dev.arvados.org/favicon.ico?15576888422024-03-25T16:52:07ZArvados
Redmine Arvados - Bug #21618 (New): cloudtest should give up if test instance disappears from listing bef...https://dev.arvados.org/issues/216182024-03-25T16:52:07ZTom Cleggtom@curii.com
<p>Currently, if an instance/image has a problem that causes it to shutdown before responding to a boot probe, cloudtest keeps probing after it disappears, which is clearly futile.</p> Arvados - Bug #21617 (In Progress): Timeout error reading content from collection on a remote clu...https://dev.arvados.org/issues/216172024-03-25T14:43:50ZTom Cleggtom@curii.com
In a 3-way federation with login cluster z1111:
<ul>
<li>a collection stored on z1111 can be read from z2222 (e.g., workbench.z2222/collections/z1111-4zz18-...)</li>
<li>a collection stored on z2222 cannot be read from z1111 (timeout)</li>
<li>a collection stored on z2222 cannot be read from z3333 (timeout)</li>
</ul>
<p>It looks like the intermediate cluster's keepstore process cannot retrieve the list of keep services from the cluster where the data is stored ("failed to validate remote token") -- this auto-retries in the background for a while, then eventually blockReadRemote gives up.</p>
<p>Manual testing, with jutro/tordo/pirca playing the roles of z1111/z2222/z3333, indicates the same problem existed before and after <a class="issue tracker-2 status-2 priority-4 priority-default parent" title="Feature: Keepstore can stream GET and PUT requests using keep-gateway API (In Progress)" href="https://dev.arvados.org/issues/2960">#2960</a> was merged and deployed to tordo.</p> Arvados - Bug #21598 (In Progress): Local keepstore invoked by crunch-run should never do EmptyTr...https://dev.arvados.org/issues/215982024-03-15T18:32:48ZTom Cleggtom@curii.com
<p>We don't want N compute nodes periodically checking expiry times on all of the trashed blocks on all backend volumes.</p> Arvados - Bug #21417 (Resolved): Stop trying to read image timestamp from docker metadata in arv-...https://dev.arvados.org/issues/214172024-01-25T16:44:16ZTom Cleggtom@curii.com
<p>This part of <a class="source" href="https://dev.arvados.org/projects/arvados/repository/arvados/entry/sdk/python/arvados/commands/keepdocker.py">source:sdk/python/arvados/commands/keepdocker.py</a> should go away so it doesn't crash on new image tarball formats:</p>
<pre>
json_file = image_tar.extractfile(image_tar.getmember(json_filename))
image_metadata = json.loads(json_file.read().decode('utf-8'))
json_file.close()
image_tar.close()
link_base = {'head_uuid': coll_uuid, 'properties': {}}
if 'created' in image_metadata:
link_base['properties']['image_timestamp'] = image_metadata['created']
</pre>
<p>See <a class="issue tracker-6 status-3 priority-4 priority-default closed" title="Idea: test-provision-debian11 fails loading workflow Docker image (Resolved)" href="https://dev.arvados.org/issues/21408">#21408</a> for example.</p>
<p>(Tom & Peter discussed offline, came to the conclusion that saving the image timestamp is not important enough to justify maintaining the code.)</p> Arvados - Bug #21379 (Resolved): arv-user-activity crashes on file_download event for remote coll...https://dev.arvados.org/issues/213792024-01-12T19:37:43ZTom Cleggtom@curii.com
<pre>
User activity on pirca between 2024-01-11 05:00 and 2024-01-12 05:00
Traceback (most recent call last):
File "/usr/bin/arv-user-activity", line 8, in <module>
sys.exit(main())
File "/usr/share/python3/dist/python3-arvados-user-activity/lib/python3.7/site-packages/arvados_user_activity/main.py", line 214, in main
getCollectionName(arv, e["properties"].get("collection_uuid"), e["properties"].get("portable_data_hash")),
File "/usr/share/python3/dist/python3-arvados-user-activity/lib/python3.7/site-packages/arvados_user_activity/main.py", line 111, in getCollectionName
u = arv.collections().list(filters=filters, order="created_at", limit=1).execute().get("items")
File "/usr/share/python3/dist/python3-arvados-user-activity/lib/python3.7/site-packages/googleapiclient/_helpers.py", line 130, in positional_wrapper
return wrapped(*args, **kwargs)
File "/usr/share/python3/dist/python3-arvados-user-activity/lib/python3.7/site-packages/googleapiclient/http.py", line 938, in execute
raise HttpError(resp, content, uri=self.uri)
arvados.errors.ApiError: <HttpError 400 when requesting https://pirca.arvadosapi.com/arvados/v1/collections?filters=%5B%5B%22uuid%22%2C+%22%3D%22%2C+%22<a href="https://arvadosapi.com/tordo-4zz18-kaaj8hjcnqb8i0p">tordo-4zz18-kaaj8hjcnqb8i0p</a>%22%5D%5D&order=created_at&limit=1&alt=json returned "cannot execute federated list query unless count=="none"">
</pre>
<pre>
User activity on tordo between 2024-01-11 19:33 and 2024-01-12 19:33
Traceback (most recent call last):
File "/tmp/venv/bin/arv-user-activity", line 8, in <module>
sys.exit(main())
File "/tmp/venv/lib/python3.9/site-packages/arvados_user_activity/main.py", line 214, in main
getCollectionName(arv, e["properties"].get("collection_uuid"), e["properties"].get("portable_data_hash")),
File "/tmp/venv/lib/python3.9/site-packages/arvados_user_activity/main.py", line 111, in getCollectionName
u = arv.collections().list(filters=filters, order="created_at", limit=1, count="none").execute().get("items")
File "/tmp/venv/lib/python3.9/site-packages/googleapiclient/_helpers.py", line 130, in positional_wrapper
return wrapped(*args, **kwargs)
File "/tmp/venv/lib/python3.9/site-packages/googleapiclient/http.py", line 938, in execute
raise HttpError(resp, content, uri=self.uri)
arvados.errors.ApiError: <HttpError 400 when requesting https://tordo.arvadosapi.com/arvados/v1/collections?filters=%5B%5B%22uuid%22%2C+%22%3D%22%2C+%22<a href="https://arvadosapi.com/pirca-4zz18-tsiyvmfkr2gub8w">pirca-4zz18-tsiyvmfkr2gub8w</a>%22%5D%5D&order=created_at&limit=1&count=none&alt=json returned "cannot execute federated list query with limit (1) < nUUIDs (1), offset (0) > 0, or order ([created_at]) parameter">
</pre> Arvados - Bug #21319 (New): Avoid waiting/deadlock when a controller handler performs subrequests...https://dev.arvados.org/issues/213192023-12-27T23:26:44ZTom Cleggtom@curii.comArvados - Bug #21314 (New): a-d-c should cancel a container if it can't be loadedhttps://dev.arvados.org/issues/213142023-12-21T16:55:13ZTom Cleggtom@curii.com
<p>If a container's "mounts" field is invalid, a-d-c logs this, and keeps trying.</p>
<code class="json syntaxhl"><span class="p">{</span><span class="nl">"ClusterID"</span><span class="p">:</span><span class="s2">"irdev"</span><span class="p">,</span><span class="nl">"ContainerUUID"</span><span class="p">:</span><span class="s2">"<a href="https://arvadosapi.com/xxxxx-dz642-xxxxxxxxxxxxxxx">xxxxx-dz642-xxxxxxxxxxxxxxx</a>"</span><span class="p">,</span><span class="nl">"PID"</span><span class="p">:</span><span class="mi">2037423</span><span class="p">,</span><span class="nl">"error"</span><span class="p">:</span><span class="s2">"json: cannot unmarshal array into Go struct field Container.mounts of type arvados.Mount"</span><span class="p">,</span><span class="nl">"level"</span><span class="p">:</span><span class="s2">"warning"</span><span class="p">,</span><span class="nl">"msg"</span><span class="p">:</span><span class="s2">"error getting mounts"</span><span class="p">,</span><span class="nl">"time"</span><span class="p">:</span><span class="s2">"2023-12-13T20:34:41.064140517Z"</span><span class="p">}</span><span class="w">
</span></code>
<p>In this situation, the offending container should be cancelled.</p> Arvados - Bug #21285 (Resolved): Add MaxGatewayTunnels config, separate from MaxConcurrentRequestshttps://dev.arvados.org/issues/212852023-12-11T17:42:08ZTom Cleggtom@curii.com
<p>Currently N running containers will start N gateway tunnels that occupy N of the MaxConcurrentRequests slots, even though they don't use the resources that MaxConcurrentRequests is meant to protect (mainly RailsAPI and PostgreSQL). Since each one stays open for the entire duration of the respective container, these tunnel connections can easily consume most/all of the MaxConcurrentRequests slots, leaving none for workbench2 or even other API calls from the containers themselves.</p>
<p>To address this, we should add a separate MaxGatewayTunnels config. Incoming gateway_tunnel requests should not occupy MaxConcurrentRequests slots. After reaching the MaxGatewayTunnels limit, additional gateway_tunnel requests should return 503 immediately rather than wait in a queue. Crunch-run should delay and retry when this happens.</p>
<p>Nginx and load balancers will be expected to allow {MaxConcurrentRequests + MaxQueuedRequests + MaxGatewayTunnels} concurrent requests. Documentation and installer should be updated accordingly.</p> Arvados - Bug #21252 (Closed): retryablehttp PR to avoid retrying "net/http: invalid header"https://dev.arvados.org/issues/212522023-12-01T16:03:17ZTom Cleggtom@curii.comArvados - Bug #21187 (New): a-c-r should detect and warn when arv:IntermediateOutput outputTTL is...https://dev.arvados.org/issues/211872023-11-09T19:31:33ZTom Cleggtom@curii.com
<p>Currently, if outputTTL is set too low and a workflow tries to use intermediate data after it has already been trashed, a-c-r may read a intermediate collection manifest successfully (before trash time) but then fail to save it later (after trash time) in a combined collection. In that case the user ends up getting a python stack trace ending in a 403 error (invalid blob signature).</p>
<p>a-c-r should warn the user when the duration the current workflow has been running exceeds outputTTL (this is probably a good indicator the user should increase outputTTL even if it hasn't actually broken anything yet)</p>
<p>a-c-r should also report a more helpful error message when it fails to create a collection due to expired blob signatures. This could be done by checking for a 403 error from the create call and/or checking the expiry times (given as hexadecimal unix times) on the blob signatures in the manifest text.</p> Arvados - Bug #21184 (Resolved): Fix build pipeline for debian 11https://dev.arvados.org/issues/211842023-11-08T20:19:52ZTom Cleggtom@curii.comArvados - Bug #21169 (Resolved): Fix deprecated ERB usage in account setup email viewhttps://dev.arvados.org/issues/211692023-11-02T18:02:45ZTom Cleggtom@curii.com
<p>As of <a class="issue tracker-6 status-3 priority-4 priority-default closed parent" title="Idea: Support Ubuntu 22.04 LTS (Resolved)" href="https://dev.arvados.org/issues/20846">#20846</a>, testing services/api in Ruby 3 gave the following warnings.</p>
<pre>
/home/tom/arvados/services/api/app/views/user_notifier/account_is_setup.text.erb:5: warning: Passing safe_level with the 2nd argument of ERB.new is de\
precated. Do not use it, and specify other arguments as keyword arguments.
/home/tom/arvados/services/api/app/views/user_notifier/account_is_setup.text.erb:5: warning: Passing trim_mode with the 3rd argument of ERB.new is dep\
recated. Use keyword argument like ERB.new(str, trim_mode: ...) instead.
</pre>
<p>However, if we do the obvious thing:</p>
<pre><code class="diff syntaxhl"><span class="gh">diff --git a/services/api/app/views/user_notifier/account_is_setup.text.erb b/services/api/app/views/user_notifier/account_is_setup.text.erb
index 352ee7754e..e6349922fa 100644
</span><span class="gd">--- a/services/api/app/views/user_notifier/account_is_setup.text.erb
</span><span class="gi">+++ b/services/api/app/views/user_notifier/account_is_setup.text.erb
</span><span class="p">@@ -2,4 +2,4 @@</span>
SPDX-License-Identifier: AGPL-3.0 %>
-<%= ERB.new(Rails.configuration.Users.UserSetupMailText, 0, "-").result(binding) %>
<span class="gi">+<%= ERB.new(Rails.configuration.Users.UserSetupMailText, safe_level: 0, trim_mode: "-").result(binding) %>
</span></code></pre>
<p>The result is:</p>
<pre>
UserNotifierTest#test_account_is_setup = 0.54 s = E
Error:
UserNotifierTest#test_account_is_setup:
ActionView::Template::Error: unknown keyword: :safe_level
app/views/user_notifier/account_is_setup.text.erb:5:in `new'
app/views/user_notifier/account_is_setup.text.erb:5
app/mailers/user_notifier.rb:14:in `account_is_setup'
test/unit/user_notifier_test.rb:36:in `block in <class:UserNotifierTest>'
</pre>
<p>Keeping <code>trim_mode: "-"</code> and removing <code>safe_mode: 0</code> makes the errors and warnings go away, but what are the other implications of removing that?</p> Arvados - Bug #21134 (New): Fix proxy error logging in controller's container log handlerhttps://dev.arvados.org/issues/211342023-10-20T15:25:18ZTom Cleggtom@curii.com
<p>Currently, when controller gets a non-HTTP error while trying to proxy a request to keep-web, the error gets logged using stdlib <code>log.Print()</code> instead of structured logs:</p>
<pre>
{"ClusterID":"2xpu4","PID":22503,"RequestID":"req-1bqnt5n2ozwpcscdjc94","level":"info","msg":"request","remoteAddr":"127.0.0.1:38144","reqBytes":0,"reqForwardedFor":"XXX","reqHost":"2xpu4.arvadosapi.com","reqMethod":"PROPFIND","reqPath":"arvados/v1/container_requests/XXX/log/XXX","reqQuery":"","time":"2023-10-20T14:56:54.866512957Z"}
2023/10/20 14:56:54 http: proxy error: dial tcp 127.0.0.1:9002: connect: connection refused
{"ClusterID":"2xpu4","PID":22503,"RequestID":"req-1bqnt5n2ozwpcscdjc94","level":"info","msg":"response","priority":1,"remoteAddr":"127.0.0.1:38144","reqBytes":0,"reqForwardedFor":"XXX","reqHost":"2xpu4.arvadosapi.com","reqMethod":"PROPFIND","reqPath":"arvados/v1/container_requests/XXX/log/XXX","reqQuery":"","respBody":"","respBytes":0,"respStatus":"Bad Gateway","respStatusCode":502,"time":"2023-10-20T14:56:54.881886673Z","timeToStatus":0.015358,"timeTotal":0.015365,"timeWriteBody":0.000007,"tokenUUIDs":["XXX"]}
</pre>
<p>The error should be returned to the client in the 502 response body and, ideally, in a field in the "response" log entry.</p> Arvados - Bug #21086 (Resolved): sdk/go/arvados should use TLS certificates from /etc/arvados/ca-...https://dev.arvados.org/issues/210862023-10-17T15:46:11ZTom Cleggtom@curii.com
<p>Move the "choose a cert source" logic from <code>sdk/go/arvadosclient</code> to <code>sdk/go/arvados</code> and make sure both libraries use it.</p> Arvados - Bug #20804 (New): crunchstat-summary should use container logs API, not CollectionReade...https://dev.arvados.org/issues/208042023-07-31T17:36:21ZTom Cleggtom@curii.com
<p>Currently crunchstat-summary uses CollectionReader to read crunchstat logs for finished containers, and uses the "logs" API for unfinished containers. That "logs" data will soon be unavailable, so crunchstat-summary will lose the ability to make graphs/stats for containers while they are still running.</p>
<p>We can fix this by replacing both "get crunchstat logs via CollectionReader" and "...via logs API" code paths with "get crunchstat logs via container logs API".</p>