Project

General

Profile

Actions

Bug #21302

open

Container created with corrupted mounts

Added by Brett Smith 11 months ago. Updated about 1 month ago.

Status:
In Progress
Priority:
Normal
Assigned To:
Category:
API
Target version:
Story points:
-

Description

A user's container request failed to dispatch because the container created for it had corrupted mounts, so Crunch could not dispatch it. There were ~2,250 mounts of individual files. This container was near the end of a workflow, so all its inputs came from a single collection, the output of a previous step.

The mounts were corrupted in different ways:

  • Some had their path changed, so the mount for a given file would be backed by a different file.
  • Some became the empty array [].
  • Some became the literal object {"for": "to_hash"}.

We haven't yet identified a pattern for which mounts got corrupted which way.

We did check the database record and confirmed the mounts were corrupted in the container's mounts column. This suggests that the problem happened during container creation, and isn't an issue with, e.g., serializing the record to JSON.

See child ticket for supporting documentation.


Subtasks


Related issues

Related to Arvados - Bug #21314: a-d-c should cancel a container if it can't be loadedResolvedTom CleggActions
Actions #1

Updated by Tom Clegg 11 months ago

  • Related to Bug #21314: a-d-c should cancel a container if it can't be loaded added
Actions #2

Updated by Peter Amstutz 9 months ago

  • Tracker changed from Idea to Bug
Actions #3

Updated by Peter Amstutz 9 months ago

  • Target version changed from To be scheduled to Future
Actions #4

Updated by Peter Amstutz about 2 months ago

This happened again.

syslog:Oct  1 00:00:11 ip-172-25-144-184 arvados-dispatch-cloud[1414811]: {"ClusterID":"xngs1","ContainerUUID":"xxxxx-dz642-9vlpjyz07bs8jpj","PID":1414811,"error":"json: cannot unmarshal array into Go struct field Container.mounts of type arvados.Mount","level":"warning","msg":"error getting mounts","time":"2024-10-01T00:00:11.229977070Z"}

  "/keep/9675bc74facc27e95dbd682f50014f1c+1654":{
   "kind":"collection",
   "portable_data_hash":"9675bc74facc27e95dbd682f50014f1c+1654" 
  },
  "/keep/9682a92f38b185d61899911be9cae755+1596":[],
  "/keep/9694fecee4a8866f68a6378ccc479fbf+940":{
   "for":"to_hash" 
  },
  "/keep/96c1bcaeead7f643209ee3e585d687fd+940":{
   "for":"to_hash" 
  },
  "/keep/96d31dba7b507bc51e83d37b20d57719+1696":[],
  "/keep/96d90ebb580218cd2e1ad6d2e5b288de+1762":{
   "kind":"collection",
   "portable_data_hash":"96d90ebb580218cd2e1ad6d2e5b288de+1762" 
  },
Actions #5

Updated by Peter Amstutz about 2 months ago

I'm wondering if there's something weird in the Ruby serialization or deserialization that causes them to think that these particular objects are something else.

If so, it should be possible to re-produce by submitting a container request with mounts that are examples of the fields that are being corrupted.

Actions #6

Updated by Peter Amstutz about 2 months ago

So, simply adding a couple of mounts with the PDH of mounts that were corrupted before doesn't trigger the bug.

The bug has only been observed seems when there are large numbers of mounts (1000+? 2000+?)

I feel like the {"for":"to_hash"} strongly implicates something weird happening in Ruby land, I just can't see that getting added by Go or Postgres.

Actions #7

Updated by Brett Smith about 2 months ago

  • Target version changed from Future to Development 2024-10-09 sprint
  • Assigned To set to Brett Smith
Actions #8

Updated by Brett Smith about 2 months ago

  • Status changed from New to In Progress

After a handful of attempts and various tweaks to try to get as close as possible to the original, I have not been able to reproduce this on pirca. pirca-j7d0g-042lw4miq9kw5q5

The user clusters where this has happened are running Ubuntu 20.04, while pirca is running Debian 11. It's possible that the bug only happens with a specific JSON gem and/or C library.

Actions #9

Updated by Brett Smith about 1 month ago

bundle list gives byte-identical output on both pirca and a user cluster where we saw this bug. Which, good, it's supposed to! That rules out a gem difference (but not a stdlib difference).

On Ubuntu 20.04: ruby 2.7.0p0 (2019-12-25 revision 647ee6f091) [x86_64-linux-gnu]
On pirca via RVM: ruby 2.7.5p203 (2021-11-24 revision f69aeb8314) [x86_64-linux]

Ruby 2.7.1 did include a security fix for the json gem. However, it seems like that fix got backported to Ubuntu: it has the json 2.3.0 gem with the fix.

Neither the json gem nor the oj gem (which should be what multi_json selects according to the documentation) seem to have any external dependencies. ldd from pirca:

% pwd
/usr/local/rvm/rubies/default/lib/ruby/2.7.0/x86_64-linux/json/ext
% ldd *.so /var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/oj-3.9.2/ext/oj/oj.so
generator.so:
        linux-vdso.so.1 (0x00007ffe04922000)
        libruby.so.2.7 => /usr/local/rvm/rubies/ruby-2.7.5/lib/libruby.so.2.7 (0x00007fbc29090000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fbc28f44000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fbc28d70000)
        libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007fbc28d53000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fbc28d31000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fbc28d27000)
        libgmp.so.10 => /lib/x86_64-linux-gnu/libgmp.so.10 (0x00007fbc28ca4000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fbc28c9e000)
        libcrypt.so.1 => /lib/x86_64-linux-gnu/libcrypt.so.1 (0x00007fbc28c63000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fbc293ee000)
parser.so:
        linux-vdso.so.1 (0x00007ffdb95b9000)
        libruby.so.2.7 => /usr/local/rvm/rubies/ruby-2.7.5/lib/libruby.so.2.7 (0x00007f3cba6a0000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f3cba554000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f3cba380000)
        libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f3cba363000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f3cba341000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f3cba337000)
        libgmp.so.10 => /lib/x86_64-linux-gnu/libgmp.so.10 (0x00007f3cba2b4000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f3cba2ae000)
        libcrypt.so.1 => /lib/x86_64-linux-gnu/libcrypt.so.1 (0x00007f3cba273000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f3cba9fc000)
/var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/oj-3.9.2/ext/oj/oj.so:
        linux-vdso.so.1 (0x00007ffcd1bd1000)
        libruby.so.2.7 => /usr/local/rvm/rubies/ruby-2.7.5/lib/libruby.so.2.7 (0x00007f3fc51fd000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f3fc50b1000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f3fc4edd000)
        libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f3fc4ec0000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f3fc4e9e000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f3fc4e94000)
        libgmp.so.10 => /lib/x86_64-linux-gnu/libgmp.so.10 (0x00007f3fc4e11000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f3fc4e0b000)
        libcrypt.so.1 => /lib/x86_64-linux-gnu/libcrypt.so.1 (0x00007f3fc4dd0000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f3fc55b2000)
Actions #10

Updated by Brett Smith about 1 month ago

I am pretty sure we are just using Oj:

Actions #11

Updated by Brett Smith about 1 month ago

Stray thoughts:

Another difference between the clusters is PostgreSQL version. Even if the client code is the same, it may go down different code paths based on what it believes the database server supports.

The container mounts column is declared serialize :mounts, Hash. Another possibility is that JSON is working fine, but the bug is in Rails' serialization process. (Again, same version of Rails across clusters, but maybe different code paths.)

Another "simple" possibility is that this bug depends on a race condition and I haven't had the (bad?) luck to trigger it yet.

Actions #12

Updated by Peter Amstutz about 1 month ago

  • Target version changed from Development 2024-10-09 sprint to Development 2024-10-23 sprint
Actions #13

Updated by Peter Amstutz about 1 month ago

PA to find Oj bug report from 2017 that had kind of similar behavior

Actions #14

Updated by Peter Amstutz about 1 month ago

This doesn't describe exactly the same behavior but it is kind of similar, it is blamed on the 'curb' gem:

https://github.com/ohler55/oj/issues/330

This one is blamed on faulty hardware:

https://github.com/ohler55/oj/issues/892

Actions #15

Updated by Brett Smith about 1 month ago

Peter Amstutz wrote in #note-14:

This doesn't describe exactly the same behavior but it is kind of similar, it is blamed on the 'curb' gem:

https://github.com/ohler55/oj/issues/330

Yeah, this is interesting but it's hard to know what to do with this.

  • We don't use the curb gem specifically (it was removed from Gemfile.lock in Jan 2020 in 14d53b3386296eb9b4d25c91cd9d40a5dfb52b75)
  • It's possible another gem we use is causing a similar bad interaction with GC, but I don't know enough about Ruby internals to know how we'd start investigating that.
  • Similarly, I don't know how we'd implement the proposed workaround of disabling GC since this happens deep in the Rails stack. Maybe we could write our own wrappers around the wrappers that get installed in the "use Oj" initializer. But that might also have serious performance implications.

If it is something like this, that does suggest there's probably a race condition involved.

Actions #16

Updated by Peter Amstutz about 1 month ago

  • Target version changed from Development 2024-10-23 sprint to Future
Actions

Also available in: Atom PDF