Arvados: Issueshttps://dev.arvados.org/https://dev.arvados.org/favicon.ico?15576888422024-03-27T15:52:03ZArvados
Redmine Arvados - Bug #21622 (New): Mail delivery failure should not cause API calls to failhttps://dev.arvados.org/issues/216222024-03-27T15:52:03ZPeter Amstutzpeter.amstutz@curii.comArvados - Bug #21618 (New): cloudtest should give up if test instance disappears from listing bef...https://dev.arvados.org/issues/216182024-03-25T16:52:07ZTom Cleggtom@curii.com
<p>Currently, if an instance/image has a problem that causes it to shutdown before responding to a boot probe, cloudtest keeps probing after it disappears, which is clearly futile.</p> Arvados - Bug #21617 (In Progress): Timeout error reading content from collection on a remote clu...https://dev.arvados.org/issues/216172024-03-25T14:43:50ZTom Cleggtom@curii.com
In a 3-way federation with login cluster z1111:
<ul>
<li>a collection stored on z1111 can be read from z2222 (e.g., workbench.z2222/collections/z1111-4zz18-...)</li>
<li>a collection stored on z2222 cannot be read from z1111 (timeout)</li>
<li>a collection stored on z2222 cannot be read from z3333 (timeout)</li>
</ul>
<p>It looks like the intermediate cluster's keepstore process cannot retrieve the list of keep services from the cluster where the data is stored ("failed to validate remote token") -- this auto-retries in the background for a while, then eventually blockReadRemote gives up.</p>
<p>Manual testing, with jutro/tordo/pirca playing the roles of z1111/z2222/z3333, indicates the same problem existed before and after <a class="issue tracker-2 status-2 priority-4 priority-default parent" title="Feature: Keepstore can stream GET and PUT requests using keep-gateway API (In Progress)" href="https://dev.arvados.org/issues/2960">#2960</a> was merged and deployed to tordo.</p> Arvados - Bug #21612 (New): a-c-r with --debug can try to log entire input/output objects, which ...https://dev.arvados.org/issues/216122024-03-20T20:22:22ZBrett Smithbrett.smith@curii.com
<p>User got this error while running aws-s3-bulk-download.cwl with >6K input URLs, using <code>a-c-r --submit --debug</code>.</p>
<p>I don't think it actually interfered with the workflow's run at all, but it clogs the logs and looks scary.</p>
<p>IMO a-c-r (along with the rest of our code) should not try to log data that can be arbitrarily large.</p>
<p>Three instances where this came up:</p>
<pre>
--- Logging error ---
Traceback (most recent call last):
File "/usr/lib/python3.7/logging/__init__.py", line 1037, in emit
stream.write(msg + self.terminator)
BlockingIOError: [Errno 11] write could not complete without blocking
Call stack:
File "/usr/bin/arvados-cwl-runner", line 8, in <module>
sys.exit(main())
File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/arvados_cwl/__init__.py", line 440, in main
input_required=not workflow_op)
File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/main.py", line 1302, in main
tool, initialized_job_order_object, runtimeContext, logger=_logger
File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/arvados_cwl/executor.py", line 874, in arv_executor
self.start_run(runnable, runtimeContext)
File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/arvados_cwl/executor.py", line 248, in start_run
self.workflow_eval_lock, self.stop_polling)
File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/task_queue.py", line 85, in add
task()
File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/command_line_tool.py", line 202, in run
self.output_callback(cast(Optional[CWLObjectType], ev), "success")
File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/arvados_cwl/executor.py", line 321, in wrapped_callback
cb(obj, st)
File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/workflow.py", line 429, in receive_output
output_callback(output, processStatus)
File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/workflow_job.py", line 564, in receive_output
_logger.debug("[%s] produced output %s", step.name, json_dumps(jobout, indent=4))
</pre>
<pre>--- Logging error ---
Traceback (most recent call last):
File "/usr/lib/python3.7/logging/__init__.py", line 1037, in emit
stream.write(msg + self.terminator)
BlockingIOError: [Errno 11] write could not complete without blocking
Call stack:
File "/usr/bin/arvados-cwl-runner", line 8, in <module>
sys.exit(main())
File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/arvados_cwl/__init__.py", line 440, in main
input_required=not workflow_op)
File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/main.py", line 1302, in main
tool, initialized_job_order_object, runtimeContext, logger=_logger
File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/arvados_cwl/executor.py", line 863, in arv_executor
for runnable in jobiter:
File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/workflow.py", line 175, in job
yield from job.job(builder.job, output_callbacks, runtimeContext)
File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/workflow_job.py", line 821, in job
for newjob in step.iterable:
File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/workflow_job.py", line 751, in try_make_job
yield from jobs
File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/workflow_job.py", line 77, in job
yield from self.step.job(joborder, output_callback, runtimeContext)
File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/workflow.py", line 462, in job
runtimeContext,
File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/workflow.py", line 175, in job
yield from job.job(builder.job, output_callbacks, runtimeContext)
File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/workflow_job.py", line 821, in job
for newjob in step.iterable:
File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/workflow_job.py", line 735, in try_make_job
json_dumps(inputobj, indent=4),
</pre>
<pre>--- Logging error ---
Traceback (most recent call last):
File "/usr/lib/python3.7/logging/__init__.py", line 1037, in emit
stream.write(msg + self.terminator)
BlockingIOError: [Errno 11] write could not complete without blocking
Call stack:
File "/usr/bin/arvados-cwl-runner", line 8, in <module>
sys.exit(main())
File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/arvados_cwl/__init__.py", line 440, in main
input_required=not workflow_op)
File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/main.py", line 1302, in main
tool, initialized_job_order_object, runtimeContext, logger=_logger
File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/arvados_cwl/executor.py", line 874, in arv_executor
self.start_run(runnable, runtimeContext)
File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/arvados_cwl/executor.py", line 248, in start_run
self.workflow_eval_lock, self.stop_polling)
File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/task_queue.py", line 85, in add
task()
File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/command_line_tool.py", line 202, in run
self.output_callback(cast(Optional[CWLObjectType], ev), "success")
File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/arvados_cwl/executor.py", line 321, in wrapped_callback
cb(obj, st)
File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/workflow.py", line 429, in receive_output
output_callback(output, processStatus)
File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/workflow_job.py", line 582, in receive_output
self.do_output_callback(final_output_callback)
File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/workflow_job.py", line 541, in do_output_callback
_logger.debug("[%s] outputs %s", self.name, json_dumps(wo, indent=4))
</pre> Arvados - Bug #21607 (New): arv-mount memory usage grows over timehttps://dev.arvados.org/issues/216072024-03-19T13:15:14ZPeter Amstutzpeter.amstutz@curii.com
<p>arv-mount releases metadata (collection and project listings) for files and directories that haven't been used recently to prevent unlimited memory growth.</p>
<p>Ideally it should reach a ceiling and then level off as new stuff replaces the memory used by old stuff. However, in the current version, memory usage still creeps up.</p>
<p>arv-mount would benefit from additional debugging and memory profiling to determine if there are objects being held past their intended lifetime.</p> Arvados - Bug #21601 (In Progress): fpm virtualenv packages not using branch versions for depende...https://dev.arvados.org/issues/216012024-03-15T20:38:09ZPeter Amstutzpeter.amstutz@curii.com
<p><a class="external" href="https://dev.arvados.org/issues/19744#note-30">https://dev.arvados.org/issues/19744#note-30</a></p>
<p>The python3-arvados-cwl-runner_2.8.0~dev20240314145937-1_amd64.deb package has arvados-python-client 2.7.1 and crunchstat-summary 2.7.1, when it should have the dev versions from the same commit.</p>
<p>I went back and looked at earlier packages: python3-arvados-cwl-runner_2.7.1~rc3-1_amd64.deb has arvados-python-client 2.7.1rc3 (as expected) and python3-arvados-cwl-runner_2.7.0~dev20230908133938-1_amd64.deb has arvados-python-client 2.7.0.dev20230908133938 (also as expected).</p>
<p>My current theory is that this behavior got lost in the changes made in 20846-package-build-fixes, but I need to find out how it worked before.</p> Arvados - Bug #21598 (In Progress): Local keepstore invoked by crunch-run should never do EmptyTr...https://dev.arvados.org/issues/215982024-03-15T18:32:48ZTom Cleggtom@curii.com
<p>We don't want N compute nodes periodically checking expiry times on all of the trashed blocks on all backend volumes.</p> Arvados - Bug #21583 (In Progress): Running RailsAPI with Passenger implicity requires Ruby 3.3 v...https://dev.arvados.org/issues/215832024-03-13T11:03:08ZBrett Smithbrett.smith@curii.com
<p>Some useful background:</p>
<ul>
<li>base64 has been a default Gem for a while, but it will not be included in Ruby 3.4, and Ruby 3.3 warns you about this.</li>
<li>To make the warning go away, libraries have started declaring a dependency on the base64 gem. The library that's relevant to our story is <a href="https://github.com/lostisland/faraday/commit/ea30bd0b543882f1cf26e75ac4e46e0705fa7e68" class="external">faraday</a>, which is used by our ruby-google-api-client fork.</li>
<li>The current version of base64 is 0.2.0. This version is included in Ruby 3.3. Older versions of Ruby we support have 0.1.1.</li>
</ul>
<p>With this background, when we started using our ruby-google-api-client fork in RailsAPI in <a class="changeset" title="21384: Update arvados-google-api-client in RailsAPI There's no functional need for this. The mai..." href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/cbfdb1b66ab9c1b6e69d1c9cd589633386267177">cbfdb1b66ab9c1b6e69d1c9cd589633386267177</a>, <a class="source" href="https://dev.arvados.org/projects/arvados/repository/arvados/entry/services/api/Gemfile.lock">source:services/api/Gemfile.lock</a> gained a dependency on base64 0.2.0. This works fine in development, because Bundler can load this newer version before any code requires it.</p>
<p>However, Passenger loads the base64 Gem <em>before</em> it starts anything related to your application, including Bundler. Because of this, running RailsAPI behind Passenger with Ruby<3.3 now fails with this error in the Passenger log:</p>
<pre>[ E 2024-03-12 15:12:44.8347 907382/Tf age/Cor/App/Implementation.cpp:221 ]: Could not spawn process for application /var/www/arvados-api/current: The application encountered the following error: You have already activated base64 0.1.1, but your Gemfile requires base64 0.2.0. Since base64 is a default gem, you can either remove your dependency on it or try updating to a newer version of bundler that supports base64 as a default gem. (Gem::LoadError)
</pre>
<p>This is the root cause of <a class="issue tracker-1 status-1 priority-4 priority-default parent" title="Bug: test-provision-ubuntu2004 intermittently times out waiting for the controller to come up (New)" href="https://dev.arvados.org/issues/21524">#21524</a>. Note how test-provision-ubuntu2004 started failing immediately after <a class="changeset" title="21384: Update arvados-google-api-client in RailsAPI There's no functional need for this. The mai..." href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/cbfdb1b66ab9c1b6e69d1c9cd589633386267177">cbfdb1b66ab9c1b6e69d1c9cd589633386267177</a>:</p>
<p><img src="https://dev.arvados.org/attachments/download/3541/clipboard-202403130702-u2cmy.png" alt="" /></p>
<p>What can we do?</p>
<p>There is no single version of base64 that we can lock to that will keep everyone happy. The current lock breaks Ruby<3.3. If we change the lock to base64 0.1.1, we'll break Ruby>=3.3.</p>
<p>We cannot address the problem indirectly by tweaking our faraday dependency. We need version ~>2.8.0 to keep compatibility with the range of Ruby versions we're trying to support, and all those releases declare the base64 dependency.</p>
<p><a href="https://myrtana.sk/articles/my-passenger-was-really-old" class="external">This random blog post with cool styling</a> says you can upgrade Passenger, but note they upgrade Passenger to the version in bookworm, which is the version I'm testing and have reproduced this problem with:</p>
<pre>% apt list --installed '*passenger*'
libnginx-mod-http-passenger/bookworm,now 1:6.0.20-1~bookworm1 amd64 [installed]
passenger-dev/bookworm,now 1:6.0.20-1~bookworm1 amd64 [installed,automatic]
passenger-doc/bookworm,now 1:6.0.20-1~bookworm1 all [installed,automatic]
passenger/bookworm,now 1:6.0.20-1~bookworm1 amd64 [installed,automatic]
</pre>
<p>We could just revert <a class="changeset" title="21384: Update arvados-google-api-client in RailsAPI There's no functional need for this. The mai..." href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/cbfdb1b66ab9c1b6e69d1c9cd589633386267177">cbfdb1b66ab9c1b6e69d1c9cd589633386267177</a>. That has all the downsides implied by the commit message, but it would work.</p>
<p>We can just cheat and remove the lock by hand, but then we have to remember to keep doing that every time we update a RailsAPI gem for as long as we support Ruby<3.3. That sucks. We could write an extremely stupid test to help us remember this I guess.</p>
<p>🤷</p> Arvados - Bug #21580 (New): Public Favorites renders inconsistentlyhttps://dev.arvados.org/issues/215802024-03-12T16:29:20ZLisa Knox
<p>The Public Favorites view seems to load a different (or incomplete?) list of items on the first load than the second load</p>
<ol>
<li>Start from Home Projects</li>
<li>Click "Public Favorites" and note the names in the Data Explorer</li>
<li>Click "Public Favorites" again and the names will be different</li>
</ol> Arvados - Bug #21575 (New): Project test "creates a project from the context menu in the correct ...https://dev.arvados.org/issues/215752024-03-05T18:23:00ZBrett Smithbrett.smith@curii.com
<pre> Project tests
✓ creates a new project with multiple properties (13675ms)
✓ creates a project without and with description (17398ms)
1) creates a project from the context menu in the correct subfolder
✓ shows the appropriate buttons in the multiselect toolbar (8915ms)
✓ creates new project on home project and then a subproject inside it (8872ms)
✓ attempts to use a preexisting name creating a project (5101ms)
✓ navigates to the parent project after trashing the one being displayed (4993ms)
✓ resets the search box only when navigating out of the current project (4871ms)
✓ navigates to the root project after trashing the parent of the one being displayed (5556ms)
✓ shows details panel when clicking on the info icon (3084ms)
✓ clears search input when changing project (3421ms)
✓ opens advanced popup for project with username (3320ms)
✓ copies project URL to clipboard (7745ms)
✓ sorts displayed items correctly (7721ms)
Frozen projects
✓ should be able to freeze own project (3351ms)
✓ should not be able to modify items within the frozen project (4503ms)
✓ should be able to freeze not owned project (2892ms)
✓ should be able to unfreeze project if user is an admin (4875ms)
17 passing (2m)
1 failing
1) Project tests
creates a project from the context menu in the correct subfolder:
AssertionError: Timed out retrying after 4000ms: Expected to find content: 'Test project (211370)' within the element: [ <tr.MuiTableRow-root-856.MuiTableRow-hover-858>, 2 more... ] but never did.
at Context.eval (https://127.0.0.1:56979/__cypress/tests?p=cypress/integration/project.spec.js:427:48)
</pre> Arvados - Bug #21571 (New): Documentation should call it "arv-mount" rather than "FUSE Driver"https://dev.arvados.org/issues/215712024-03-04T17:09:03ZBrett Smithbrett.smith@curii.com
<p>"FUSE Driver" is a meaningless name to people who don't know what "FUSE" is, which is most people. The documentation should refer to the tool as "arv-mount" as much as possible, since that's a distinctive tool name and more people understand what a "mount" is generally (not a ton more, but still). If necessary the documentation can explain that arv-mount is implemented using FUSE, but that shouldn't be an identifier.</p> Arvados - Bug #21570 (New): Remove CentOS install instructions/support claims from our documentationhttps://dev.arvados.org/issues/215702024-03-04T16:50:54ZBrett Smithbrett.smith@curii.com
<p>As of Arvados 3.0 we no longer support CentOS 7. <a href="https://blog.centos.org/2020/12/future-is-centos-stream/" class="external">Support for CentOS 8 ended in 2021.</a> Basically, anybody using CentOS today is using CentOS Stream, which moves a lot differently than the RHEL-compatible distros and we don't build packages for. Accordingly, remove all references to CentOS install support from our documentation.</p> Arvados - Bug #21568 (In Progress): arv-mount double free or corruption with many concurrent acce...https://dev.arvados.org/issues/215682024-03-01T21:34:40ZBrett Smithbrett.smith@curii.com
<p>Steps to reproduce:</p>
<ul>
<li>Write Arvados <code>settings.conf</code> pointed at pirca with an admin token.</li>
<li><code>arv-mount --foreground --shared --directory-cache=SIZE MOUNT_PATH</code> - I have been able to reproduce with sizes as low as 2MiB and high as 1GiB, I suspect it doesn't matter</li>
<li>Start a ~simultaneous <code>ls -lR MOUNT_PATH/SUBDIR</code> process for each <code>SUBDIR</code> under <code>MOUNT_PATH</code> - as I write this, I see 179</li>
</ul>
<p>Reproduction script attached. I was trying to reproduce <a class="issue tracker-1 status-2 priority-4 priority-default parent" title="Bug: arv-mount KeyError during cap_cache - Seemingly lost track of parent inode (In Progress)" href="https://dev.arvados.org/issues/21541">#21541</a> but instead arv-mount got SIGSEGV:</p>
<pre>Mar 01 16:14:23 arv-mount[1131697]: double free or corruption (out)
Mar 01 16:14:23 systemd[2070]: arv-mount-stress-_hkr5ofm.service: Main process exited, code=killed, status=6/ABRT
Mar 01 16:14:23 systemd[2070]: arv-mount-stress-_hkr5ofm.service: Failed with result 'signal'.
</pre>
<p>Not attaching logs because they contain a lot of private data.</p> Arvados - Bug #21547 (New): return certain database errors as 500 so they can be retriedhttps://dev.arvados.org/issues/215472024-02-27T19:19:14ZPeter Amstutzpeter.amstutz@curii.com
<p>Certain database errors represent transient errors. We should tell the client to retry the request by returning a 500 internal server error instead of 422 (which is the default behavior).</p>
<p>#<ActiveRecord::Deadlocked: PG::TRDeadlockDetected: ERROR: deadlock detected></p>
<p>Rationale: The observed deadlocks in Arvados are conflicts between two statements (a lock ordering issue), so unwinding and retrying is a reasonable solution</p>
<p>#<ActiveRecord::StatementInvalid: PG::UnableToSend></p>
<p>Rationale: It seems this gets thrown when the API server can't connect to the database.</p>
<p>Here's the list of postgres errors known to the PG gem:</p>
<p><a class="external" href="https://github.com/ged/ruby-pg/blob/daec80f91b9519509ca1694a231f11a75cb43f7f/ext/errorcodes.def#L598">https://github.com/ged/ruby-pg/blob/daec80f91b9519509ca1694a231f11a75cb43f7f/ext/errorcodes.def#L598</a></p>
<p><a class="external" href="https://github.com/ged/ruby-pg/blob/daec80f91b9519509ca1694a231f11a75cb43f7f/ext/pg_errors.c#L88">https://github.com/ged/ruby-pg/blob/daec80f91b9519509ca1694a231f11a75cb43f7f/ext/pg_errors.c#L88</a></p>
<p>Some other possible Exceptions to retry:</p>
<p>ConnectionBad<br />ConnectionException<br />ConnectionDoesNotExist<br />ConnectionFailure<br />TooManyConnections<br />CannotConnectNow<br />IdleSessionTimeout<br />ObjectInUse<br />LockNotAvailable<br />AdminShutdown<br />CrashShutdown</p>
<p>(There's a lot of connection related errors and I don't know the difference between them, but I included them all because it seems like those are very likely to be errors that occur through no fault of the client).</p> Arvados - Bug #21541 (In Progress): arv-mount KeyError during cap_cache - Seemingly lost track of...https://dev.arvados.org/issues/215412024-02-26T19:01:27ZBrett Smithbrett.smith@curii.com
<p>User's arv-mount process crashed with this traceback. Afterward trying to list files in the mount root returned EIO.</p>
<pre>2024-02-23 23:36:17 arvados.arvados_fuse[2803055] ERROR: Unhandled exception during FUSE operation
Traceback (most recent call last):
File "venv/lib/python3.10/site-packages/arvados_fuse/__init__.py", line 327, in catch_exceptions_wrapper
return orig_func(self, *args, **kwargs)
File "venv/lib/python3.10/site-packages/arvados_fuse/__init__.py", line 570, in lookup
self.inodes.touch(p)
File "venv/lib/python3.10/site-packages/arvados_fuse/__init__.py", line 276, in touch
self.inode_cache.touch(entry)
File "venv/lib/python3.10/site-packages/arvados_fuse/__init__.py", line 234, in touch
self.manage(obj)
File "venv/lib/python3.10/site-packages/arvados_fuse/__init__.py", line 228, in manage
self.cap_cache()
File "venv/lib/python3.10/site-packages/arvados_fuse/__init__.py", line 212, in cap_cache
self._remove(ent, True)
File "venv/lib/python3.10/site-packages/arvados_fuse/__init__.py", line 186, in _remove
obj.kernel_invalidate()
File "venv/lib/python3.10/site-packages/arvados_fuse/fusedir.py", line 220, in kernel_invalidate
parent = self.inodes[self.parent_inode]
File "venv/lib/python3.10/site-packages/arvados_fuse/__init__.py", line 260, in __getitem__
return self._entries[item]
KeyError: 865
</pre>
<p>This exact same traceback appeared seven times in one second. It's not clear whether that's multiple threads running into the same issue, or the error recurring because of different accesses.</p>
<p>Note this mount is intentionally accessible to multiple users on the host. You can assume there was concurrent access. Unfortunately for the same reason it's hard to know whether a specific operation caused the error.</p>