Bug #21302
openContainer created with corrupted mounts
Description
A user's container request failed to dispatch because the container created for it had corrupted mounts, so Crunch could not dispatch it. There were ~2,250 mounts of individual files. This container was near the end of a workflow, so all its inputs came from a single collection, the output of a previous step.
The mounts were corrupted in different ways:
- Some had their path changed, so the mount for a given file would be backed by a different file.
- Some became the empty array
[]
. - Some became the literal object
{"for": "to_hash"}
.
We haven't yet identified a pattern for which mounts got corrupted which way.
We did check the database record and confirmed the mounts were corrupted in the container's mounts column. This suggests that the problem happened during container creation, and isn't an issue with, e.g., serializing the record to JSON.
See child ticket for supporting documentation.
Subtasks
Updated by Tom Clegg about 1 year ago
- Related to Bug #21314: a-d-c should cancel a container if it can't be loaded added
Updated by Peter Amstutz 10 months ago
- Target version changed from To be scheduled to Future
Updated by Peter Amstutz 3 months ago
This happened again.
syslog:Oct 1 00:00:11 ip-172-25-144-184 arvados-dispatch-cloud[1414811]: {"ClusterID":"xngs1","ContainerUUID":"xxxxx-dz642-9vlpjyz07bs8jpj","PID":1414811,"error":"json: cannot unmarshal array into Go struct field Container.mounts of type arvados.Mount","level":"warning","msg":"error getting mounts","time":"2024-10-01T00:00:11.229977070Z"} "/keep/9675bc74facc27e95dbd682f50014f1c+1654":{ "kind":"collection", "portable_data_hash":"9675bc74facc27e95dbd682f50014f1c+1654" }, "/keep/9682a92f38b185d61899911be9cae755+1596":[], "/keep/9694fecee4a8866f68a6378ccc479fbf+940":{ "for":"to_hash" }, "/keep/96c1bcaeead7f643209ee3e585d687fd+940":{ "for":"to_hash" }, "/keep/96d31dba7b507bc51e83d37b20d57719+1696":[], "/keep/96d90ebb580218cd2e1ad6d2e5b288de+1762":{ "kind":"collection", "portable_data_hash":"96d90ebb580218cd2e1ad6d2e5b288de+1762" },
Updated by Peter Amstutz 3 months ago
I'm wondering if there's something weird in the Ruby serialization or deserialization that causes them to think that these particular objects are something else.
If so, it should be possible to re-produce by submitting a container request with mounts that are examples of the fields that are being corrupted.
Updated by Peter Amstutz 3 months ago
So, simply adding a couple of mounts with the PDH of mounts that were corrupted before doesn't trigger the bug.
The bug has only been observed seems when there are large numbers of mounts (1000+? 2000+?)
I feel like the {"for":"to_hash"}
strongly implicates something weird happening in Ruby land, I just can't see that getting added by Go or Postgres.
Updated by Brett Smith 3 months ago
- Target version changed from Future to Development 2024-10-09 sprint
- Assigned To set to Brett Smith
Updated by Brett Smith 3 months ago
- Status changed from New to In Progress
After a handful of attempts and various tweaks to try to get as close as possible to the original, I have not been able to reproduce this on pirca. pirca-j7d0g-042lw4miq9kw5q5
The user clusters where this has happened are running Ubuntu 20.04, while pirca is running Debian 11. It's possible that the bug only happens with a specific JSON gem and/or C library.
Updated by Brett Smith 3 months ago
bundle list
gives byte-identical output on both pirca and a user cluster where we saw this bug. Which, good, it's supposed to! That rules out a gem difference (but not a stdlib difference).
On Ubuntu 20.04: ruby 2.7.0p0 (2019-12-25 revision 647ee6f091) [x86_64-linux-gnu]
On pirca via RVM: ruby 2.7.5p203 (2021-11-24 revision f69aeb8314) [x86_64-linux]
Ruby 2.7.1 did include a security fix for the json gem. However, it seems like that fix got backported to Ubuntu: it has the json 2.3.0 gem with the fix.
Neither the json gem nor the oj gem (which should be what multi_json
selects according to the documentation) seem to have any external dependencies. ldd from pirca:
% pwd /usr/local/rvm/rubies/default/lib/ruby/2.7.0/x86_64-linux/json/ext % ldd *.so /var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/oj-3.9.2/ext/oj/oj.so generator.so: linux-vdso.so.1 (0x00007ffe04922000) libruby.so.2.7 => /usr/local/rvm/rubies/ruby-2.7.5/lib/libruby.so.2.7 (0x00007fbc29090000) libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fbc28f44000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fbc28d70000) libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007fbc28d53000) libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fbc28d31000) librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fbc28d27000) libgmp.so.10 => /lib/x86_64-linux-gnu/libgmp.so.10 (0x00007fbc28ca4000) libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fbc28c9e000) libcrypt.so.1 => /lib/x86_64-linux-gnu/libcrypt.so.1 (0x00007fbc28c63000) /lib64/ld-linux-x86-64.so.2 (0x00007fbc293ee000) parser.so: linux-vdso.so.1 (0x00007ffdb95b9000) libruby.so.2.7 => /usr/local/rvm/rubies/ruby-2.7.5/lib/libruby.so.2.7 (0x00007f3cba6a0000) libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f3cba554000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f3cba380000) libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f3cba363000) libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f3cba341000) librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f3cba337000) libgmp.so.10 => /lib/x86_64-linux-gnu/libgmp.so.10 (0x00007f3cba2b4000) libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f3cba2ae000) libcrypt.so.1 => /lib/x86_64-linux-gnu/libcrypt.so.1 (0x00007f3cba273000) /lib64/ld-linux-x86-64.so.2 (0x00007f3cba9fc000) /var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/oj-3.9.2/ext/oj/oj.so: linux-vdso.so.1 (0x00007ffcd1bd1000) libruby.so.2.7 => /usr/local/rvm/rubies/ruby-2.7.5/lib/libruby.so.2.7 (0x00007f3fc51fd000) libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f3fc50b1000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f3fc4edd000) libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f3fc4ec0000) libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f3fc4e9e000) librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f3fc4e94000) libgmp.so.10 => /lib/x86_64-linux-gnu/libgmp.so.10 (0x00007f3fc4e11000) libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f3fc4e0b000) libcrypt.so.1 => /lib/x86_64-linux-gnu/libcrypt.so.1 (0x00007f3fc4dd0000) /lib64/ld-linux-x86-64.so.2 (0x00007f3fc55b2000)
Updated by Brett Smith 3 months ago
I am pretty sure we are just using Oj:
- Our Gemfile declares a dependency on
multi_json
but we don't actually use it anywhere. - source:services/api/config/initializers/oj_mimic_json.rb does what you think and causes Oj to basically slot itself in as a JSON gem replacement, including with Rails. It runs the incantation documented by Oj for this.
Updated by Brett Smith 3 months ago
Stray thoughts:
Another difference between the clusters is PostgreSQL version. Even if the client code is the same, it may go down different code paths based on what it believes the database server supports.
The container mounts column is declared serialize :mounts, Hash
. Another possibility is that JSON is working fine, but the bug is in Rails' serialization process. (Again, same version of Rails across clusters, but maybe different code paths.)
Another "simple" possibility is that this bug depends on a race condition and I haven't had the (bad?) luck to trigger it yet.
Updated by Peter Amstutz 2 months ago
- Target version changed from Development 2024-10-09 sprint to Development 2024-10-23 sprint
Updated by Peter Amstutz 2 months ago
PA to find Oj bug report from 2017 that had kind of similar behavior
Updated by Peter Amstutz 2 months ago
This doesn't describe exactly the same behavior but it is kind of similar, it is blamed on the 'curb' gem:
https://github.com/ohler55/oj/issues/330
This one is blamed on faulty hardware:
Updated by Brett Smith 2 months ago
Peter Amstutz wrote in #note-14:
This doesn't describe exactly the same behavior but it is kind of similar, it is blamed on the 'curb' gem:
Yeah, this is interesting but it's hard to know what to do with this.
- We don't use the curb gem specifically (it was removed from
Gemfile.lock
in Jan 2020 in 14d53b3386296eb9b4d25c91cd9d40a5dfb52b75) - It's possible another gem we use is causing a similar bad interaction with GC, but I don't know enough about Ruby internals to know how we'd start investigating that.
- Similarly, I don't know how we'd implement the proposed workaround of disabling GC since this happens deep in the Rails stack. Maybe we could write our own wrappers around the wrappers that get installed in the "use Oj" initializer. But that might also have serious performance implications.
If it is something like this, that does suggest there's probably a race condition involved.
Updated by Peter Amstutz 2 months ago
- Target version changed from Development 2024-10-23 sprint to Future