Bug #21302
openContainer created with corrupted mounts
Description
A user's container request failed to dispatch because the container created for it had corrupted mounts, so Crunch could not dispatch it. There were ~2,250 mounts of individual files. This container was near the end of a workflow, so all its inputs came from a single collection, the output of a previous step.
The mounts were corrupted in different ways:
- Some had their path changed, so the mount for a given file would be backed by a different file.
- Some became the empty array
[]
. - Some became the literal object
{"for": "to_hash"}
.
We haven't yet identified a pattern for which mounts got corrupted which way.
We did check the database record and confirmed the mounts were corrupted in the container's mounts column. This suggests that the problem happened during container creation, and isn't an issue with, e.g., serializing the record to JSON.
See child ticket for supporting documentation.
Subtasks
Related issues
Updated by Tom Clegg 10 months ago
- Related to Bug #21314: a-d-c should cancel a container if it can't be loaded added
Updated by Peter Amstutz 8 months ago
- Target version changed from To be scheduled to Future
Updated by Peter Amstutz 13 days ago
This happened again.
syslog:Oct 1 00:00:11 ip-172-25-144-184 arvados-dispatch-cloud[1414811]: {"ClusterID":"xngs1","ContainerUUID":"xxxxx-dz642-9vlpjyz07bs8jpj","PID":1414811,"error":"json: cannot unmarshal array into Go struct field Container.mounts of type arvados.Mount","level":"warning","msg":"error getting mounts","time":"2024-10-01T00:00:11.229977070Z"} "/keep/9675bc74facc27e95dbd682f50014f1c+1654":{ "kind":"collection", "portable_data_hash":"9675bc74facc27e95dbd682f50014f1c+1654" }, "/keep/9682a92f38b185d61899911be9cae755+1596":[], "/keep/9694fecee4a8866f68a6378ccc479fbf+940":{ "for":"to_hash" }, "/keep/96c1bcaeead7f643209ee3e585d687fd+940":{ "for":"to_hash" }, "/keep/96d31dba7b507bc51e83d37b20d57719+1696":[], "/keep/96d90ebb580218cd2e1ad6d2e5b288de+1762":{ "kind":"collection", "portable_data_hash":"96d90ebb580218cd2e1ad6d2e5b288de+1762" },
Updated by Peter Amstutz 13 days ago
I'm wondering if there's something weird in the Ruby serialization or deserialization that causes them to think that these particular objects are something else.
If so, it should be possible to re-produce by submitting a container request with mounts that are examples of the fields that are being corrupted.
Updated by Peter Amstutz 13 days ago
So, simply adding a couple of mounts with the PDH of mounts that were corrupted before doesn't trigger the bug.
The bug has only been observed seems when there are large numbers of mounts (1000+? 2000+?)
I feel like the {"for":"to_hash"}
strongly implicates something weird happening in Ruby land, I just can't see that getting added by Go or Postgres.
Updated by Brett Smith 12 days ago
- Target version changed from Future to Development 2024-10-09 sprint
- Assigned To set to Brett Smith
Updated by Brett Smith 7 days ago
- Status changed from New to In Progress
After a handful of attempts and various tweaks to try to get as close as possible to the original, I have not been able to reproduce this on pirca. pirca-j7d0g-042lw4miq9kw5q5
The user clusters where this has happened are running Ubuntu 20.04, while pirca is running Debian 11. It's possible that the bug only happens with a specific JSON gem and/or C library.
Updated by Brett Smith 7 days ago
bundle list
gives byte-identical output on both pirca and a user cluster where we saw this bug. Which, good, it's supposed to! That rules out a gem difference (but not a stdlib difference).
On Ubuntu 20.04: ruby 2.7.0p0 (2019-12-25 revision 647ee6f091) [x86_64-linux-gnu]
On pirca via RVM: ruby 2.7.5p203 (2021-11-24 revision f69aeb8314) [x86_64-linux]
Ruby 2.7.1 did include a security fix for the json gem. However, it seems like that fix got backported to Ubuntu: it has the json 2.3.0 gem with the fix.
Neither the json gem nor the oj gem (which should be what multi_json
selects according to the documentation) seem to have any external dependencies. ldd from pirca:
% pwd /usr/local/rvm/rubies/default/lib/ruby/2.7.0/x86_64-linux/json/ext % ldd *.so /var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/oj-3.9.2/ext/oj/oj.so generator.so: linux-vdso.so.1 (0x00007ffe04922000) libruby.so.2.7 => /usr/local/rvm/rubies/ruby-2.7.5/lib/libruby.so.2.7 (0x00007fbc29090000) libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fbc28f44000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fbc28d70000) libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007fbc28d53000) libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fbc28d31000) librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fbc28d27000) libgmp.so.10 => /lib/x86_64-linux-gnu/libgmp.so.10 (0x00007fbc28ca4000) libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fbc28c9e000) libcrypt.so.1 => /lib/x86_64-linux-gnu/libcrypt.so.1 (0x00007fbc28c63000) /lib64/ld-linux-x86-64.so.2 (0x00007fbc293ee000) parser.so: linux-vdso.so.1 (0x00007ffdb95b9000) libruby.so.2.7 => /usr/local/rvm/rubies/ruby-2.7.5/lib/libruby.so.2.7 (0x00007f3cba6a0000) libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f3cba554000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f3cba380000) libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f3cba363000) libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f3cba341000) librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f3cba337000) libgmp.so.10 => /lib/x86_64-linux-gnu/libgmp.so.10 (0x00007f3cba2b4000) libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f3cba2ae000) libcrypt.so.1 => /lib/x86_64-linux-gnu/libcrypt.so.1 (0x00007f3cba273000) /lib64/ld-linux-x86-64.so.2 (0x00007f3cba9fc000) /var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/oj-3.9.2/ext/oj/oj.so: linux-vdso.so.1 (0x00007ffcd1bd1000) libruby.so.2.7 => /usr/local/rvm/rubies/ruby-2.7.5/lib/libruby.so.2.7 (0x00007f3fc51fd000) libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f3fc50b1000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f3fc4edd000) libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f3fc4ec0000) libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f3fc4e9e000) librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f3fc4e94000) libgmp.so.10 => /lib/x86_64-linux-gnu/libgmp.so.10 (0x00007f3fc4e11000) libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f3fc4e0b000) libcrypt.so.1 => /lib/x86_64-linux-gnu/libcrypt.so.1 (0x00007f3fc4dd0000) /lib64/ld-linux-x86-64.so.2 (0x00007f3fc55b2000)
Updated by Brett Smith 7 days ago
I am pretty sure we are just using Oj:
- Our Gemfile declares a dependency on
multi_json
but we don't actually use it anywhere. - source:services/api/config/initializers/oj_mimic_json.rb does what you think and causes Oj to basically slot itself in as a JSON gem replacement, including with Rails. It runs the incantation documented by Oj for this.
Updated by Brett Smith 7 days ago
Stray thoughts:
Another difference between the clusters is PostgreSQL version. Even if the client code is the same, it may go down different code paths based on what it believes the database server supports.
The container mounts column is declared serialize :mounts, Hash
. Another possibility is that JSON is working fine, but the bug is in Rails' serialization process. (Again, same version of Rails across clusters, but maybe different code paths.)
Another "simple" possibility is that this bug depends on a race condition and I haven't had the (bad?) luck to trigger it yet.
Updated by Peter Amstutz 5 days ago
- Target version changed from Development 2024-10-09 sprint to Development 2024-10-23 sprint
Updated by Peter Amstutz 5 days ago
PA to find Oj bug report from 2017 that had kind of similar behavior
Updated by Peter Amstutz 5 days ago
This doesn't describe exactly the same behavior but it is kind of similar, it is blamed on the 'curb' gem:
https://github.com/ohler55/oj/issues/330
This one is blamed on faulty hardware: