Bug #17983

[arv-mount] 2.2.1 deadlock when --read-write and --collection is specified

Added by Ward Vandewege about 2 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
08/16/2021
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-
Release relationship:
Auto

Description

ward@shell:~$ arv-mount --debug --foreground --read-write --collection pirca-4zz18-nsaus8rzkfeh69m keep/
2021-08-06 14:54:54 arvados.arv-mount[1224] DEBUG: arv-mount debugging enabled
2021-08-06 14:54:54 arvados.arv-mount[1224] INFO: /usr/bin/arv-mount 2.2.1 started
2021-08-06 14:54:54 arvados.arv-mount[1224] INFO: enable write is True
2021-08-06 14:54:54 arvados.arvados_fuse[1224] DEBUG: InodeCache touched inode 1 (size 0) (uuid pirca-4zz18-nsaus8rzkfeh69m) total now 0 (1 entries)
FUSE library version: 2.9.9
unique: 1, opcode: INIT (26), nodeid: 0, insize: 56, pid: 0
INIT: 7.27
flags=0x003ffffb
max_readahead=0x00020000
   INIT: 7.19
   flags=0x00000021
   max_readahead=0x00020000
   max_write=0x00020000
   max_background=0
   congestion_threshold=0
   unique: 1, success, outsize: 40
unique: 2, opcode: GETATTR (3), nodeid: 1, insize: 56, pid: 1299
   unique: 2, success, outsize: 120
unique: 3, opcode: STATFS (17), nodeid: 1, insize: 40, pid: 1299
   unique: 3, success, outsize: 96
unique: 4, opcode: OPENDIR (27), nodeid: 1, insize: 48, pid: 1311
2021-08-06 14:56:08 arvados.arvados_fuse[1224] DEBUG: arv-mount opendir: inode 1
2021-08-06 14:56:08 arvados.arvados_fuse[1224] DEBUG: InodeCache touched inode 1 (size 0) (uuid pirca-4zz18-nsaus8rzkfeh69m) total now 0 (1 entries)
2021-08-06 14:56:08 arvados.arvados_fuse[1224] DEBUG: Updating collection pirca-4zz18-nsaus8rzkfeh69m inode 1 to record version None
2021-08-06 14:56:08 arvados.arvados_fuse[1224] DEBUG: <arvados_fuse.fusedir.CollectionDirectory object at 0x7fb99c7c2558> manifest_size 116
2021-08-06 14:56:08 arvados.arvados_fuse[1224] DEBUG: InodeCache touched inode 1 (size 14848) (uuid pirca-4zz18-nsaus8rzkfeh69m) total now 14848 (1 entries)
   unique: 4, success, outsize: 32
unique: 5, opcode: READDIR (28), nodeid: 1, insize: 80, pid: 1311
2021-08-06 14:56:08 arvados.arvados_fuse[1224] DEBUG: arv-mount readdir: fh 0 off 0
   unique: 5, success, outsize: 120
unique: 6, opcode: LOOKUP (1), nodeid: 1, insize: 55, pid: 1311
2021-08-06 14:56:08 arvados.arvados_fuse[1224] DEBUG: InodeCache touched inode 1 (size 14848) (uuid pirca-4zz18-nsaus8rzkfeh69m) total now 14848 (1 entries)
2021-08-06 14:56:08 arvados.arvados_fuse[1224] DEBUG: arv-mount lookup: parent_inode 1 name 'stderr (1).txt' inode 2
   unique: 6, success, outsize: 144
unique: 7, opcode: READDIR (28), nodeid: 1, insize: 80, pid: 1311
2021-08-06 14:56:08 arvados.arvados_fuse[1224] DEBUG: arv-mount readdir: fh 0 off 3
   unique: 7, success, outsize: 16
unique: 8, opcode: RELEASEDIR (29), nodeid: 1, insize: 64, pid: 0
2021-08-06 14:56:08 arvados.arvados_fuse[1224] DEBUG: arv-mount release fh 0
   unique: 8, success, outsize: 16
unique: 9, opcode: LOOKUP (1), nodeid: 1, insize: 49, pid: 1313
2021-08-06 14:56:21 arvados.arvados_fuse[1224] DEBUG: InodeCache touched inode 1 (size 14848) (uuid pirca-4zz18-nsaus8rzkfeh69m) total now 14848 (1 entries)
2021-08-06 14:56:21 arvados.arvados_fuse[1224] DEBUG: arv-mount lookup: parent_inode 1 name 'testfile' not found
   unique: 9, error: -2 (No such file or directory), outsize: 16
unique: 10, opcode: CREATE (35), nodeid: 1, insize: 65, pid: 1313
2021-08-06 14:56:21 arvados.arvados_fuse[1224] DEBUG: arv-mount create: parent_inode 1 'testfile' 100644
2021-08-06 14:56:21 arvados.arvados_fuse[1224] DEBUG: collection notify add <arvados.collection.Collection object at 0x7fb99c6a4b70> testfile <arvados.arvfile.ArvadosFile object at 0x7fb99c82b048>
2021-08-06 14:56:21 arvados.arvados_fuse[1224] DEBUG: collection notify mod <arvados.collection.Collection object at 0x7fb99c6a4b70> testfile (<arvados.arvfile.ArvadosFile object at 0x7fb99c82b048>, <arvados.arvfile.ArvadosFile object at 0x7fb99c82b048>)
2021-08-06 14:56:21 arvados.arvados_fuse[1224] DEBUG: InodeCache touched inode 1 (size 14848) (uuid pirca-4zz18-nsaus8rzkfeh69m) total now 14848 (1 entries)
   unique: 10, success, outsize: 160
unique: 11, opcode: FLUSH (25), nodeid: 3, insize: 64, pid: 1313
   unique: 11, success, outsize: 16
unique: 12, opcode: GETXATTR (22), nodeid: 3, insize: 68, pid: 1313
   unique: 12, error: -38 (Function not implemented), outsize: 16
unique: 13, opcode: WRITE (16), nodeid: 3, insize: 112, pid: 1313
2021-08-06 14:56:21 arvados.arvados_fuse[1224] DEBUG: arv-mount write 1 0 32
2021-08-06 14:56:21 arvados.arvados_fuse[1224] DEBUG: collection notify write <arvados.collection.Collection object at 0x7fb99c6a4b70> testfile (<arvados.arvfile.ArvadosFile object at 0x7fb99c82b048>, <arvados.arvfile.ArvadosFile object at 0x7fb99c82b048>)
   unique: 13, success, outsize: 24
unique: 14, opcode: FLUSH (25), nodeid: 3, insize: 64, pid: 1313
2021-08-06 14:56:21 arvados.arvados_fuse[1224] DEBUG: collection notify mod <arvados.collection.Collection object at 0x7fb99c6a4b70> testfile (<arvados.arvfile.ArvadosFile object at 0x7fb99c82b048>, <arvados.arvfile.ArvadosFile object at 0x7fb99c82b048>)
2021-08-06 14:56:21 arvados.keep[1224] DEBUG: {'pirca-bi6l4-1cd520a55b71d5a7': {'href': '/keep_services/pirca-bi6l4-1cd520a55b71d5a7', 'kind': 'arvados#keepService', 'etag': '2k0rcryua3dxrzkrar3m1pvyc', 'uuid': 'pirca-bi6l4-1cd520a55b71d5a7', 'owner_uuid': 'pirca-tpzed-000000000000000', 'created_at': '2021-08-02T14:54:54.574032000Z', 'modified_by_client_uuid': None, 'modified_by_user_uuid': 'pirca-tpzed-000000000000000', 'modified_at': '2021-08-02T14:54:54.574032000Z', 'service_host': 'keep0.pirca.arvadosapi.com', 'service_port': 25107, 'service_ssl_flag': False, 'service_type': 'disk', 'read_only': False, '_service_root': 'http://keep0.pirca.arvadosapi.com:25107/'}, 'pirca-bi6l4-629198f41fae2d7c': {'href': '/keep_services/pirca-bi6l4-629198f41fae2d7c', 'kind': 'arvados#keepService', 'etag': '97c5ja2or0op55az5cx3dxaui', 'uuid': 'pirca-bi6l4-629198f41fae2d7c', 'owner_uuid': 'pirca-tpzed-000000000000000', 'created_at': '2021-08-02T14:54:54.574032000Z', 'modified_by_client_uuid': None, 'modified_by_user_uuid': 'pirca-tpzed-000000000000000', 'modified_at': '2021-08-02T14:54:54.574032000Z', 'service_host': 'keep1.pirca.arvadosapi.com', 'service_port': 25107, 'service_ssl_flag': False, 'service_type': 'disk', 'read_only': False, '_service_root': 'http://keep1.pirca.arvadosapi.com:25107/'}}
2021-08-06 14:56:21 arvados.keep[1224] DEBUG: 0837c95a7eebd0844371cda55d082228+32: ['http://keep1.pirca.arvadosapi.com:25107/', 'http://keep0.pirca.arvadosapi.com:25107/']
2021-08-06 14:56:21 arvados.keep[1224] DEBUG: Pool max threads is 2
2021-08-06 14:56:21 arvados.keep[1224] DEBUG: Request: PUT http://keep1.pirca.arvadosapi.com:25107/0837c95a7eebd0844371cda55d082228
2021-08-06 14:56:21 arvados.keep[1224] DEBUG: Request: PUT http://keep0.pirca.arvadosapi.com:25107/0837c95a7eebd0844371cda55d082228
2021-08-06 14:56:21 arvados.keep[1224] INFO: PUT 200: 32 bytes in 147.55630493164062 msec (0.000 MiB/sec)
2021-08-06 14:56:21 arvados.keep[1224] DEBUG: KeepWriterThread <KeepWriterThread(Thread-37, started daemon 140433837463296)> succeeded 0837c95a7eebd0844371cda55d082228+32 http://keep0.pirca.arvadosapi.com:25107/
2021-08-06 14:56:21 arvados.keep[1224] INFO: PUT 200: 32 bytes in 170.5644130706787 msec (0.000 MiB/sec)
2021-08-06 14:56:21 arvados.keep[1224] DEBUG: KeepWriterThread <KeepWriterThread(Thread-36, started daemon 140433845856000)> succeeded 0837c95a7eebd0844371cda55d082228+32 http://keep1.pirca.arvadosapi.com:25107/

The above backtrace was caused by (in a separate shell):

ward@shell:~$ date > keep/testfile

which just hangs.

On latest main with collection versioning disabled, the problem is identical (tested on tordo).

When the `--disable-event-listening` flag is added to `arv-mount`, the deadlock does not happen.

On latest main with collection versioning enabled (e.g. ce8i5), the deadlock is partially masked due to bug #18005.


Subtasks

Task #17987: Review 17983-fuse-unlock-relock ResolvedPeter Amstutz

Associated revisions

Revision b059ddba
Added by Peter Amstutz about 1 month ago

Merge branch '17983-fuse-unlock-relock' into main refs #17983

Arvados-DCO-1.1-Signed-off-by: Peter Amstutz <>

History

#1 Updated by Ward Vandewege about 2 months ago

  • Description updated (diff)
  • Subject changed from [arv-mount] deadlock when --read-write and --collection is specified to [arv-mount] 2.2.1 deadlock when --read-write and --collection is specified

#2 Updated by Ward Vandewege about 2 months ago

  • Target version set to 2021-08-18 sprint

#3 Updated by Ward Vandewege about 2 months ago

  • Description updated (diff)

#4 Updated by Peter Amstutz about 2 months ago

  • Assigned To set to Peter Amstutz

#5 Updated by Ward Vandewege about 1 month ago

(moved to #18005)

#6 Updated by Peter Amstutz about 1 month ago

Ward Vandewege wrote:

I am seeing the same "undefined method `each_line` for nil:NilClass on ce8i5 when trying to add a new property to a collection without any properties (I used wb2):

[...]

Huh, weird. Can you dig out a more thorough stack trace?

#7 Updated by Ward Vandewege about 1 month ago

  • Description updated (diff)

#8 Updated by Ward Vandewege about 1 month ago

  • Description updated (diff)

#9 Updated by Peter Amstutz about 1 month ago

  • Status changed from New to In Progress

#10 Updated by Ward Vandewege about 1 month ago

  • Description updated (diff)

#11 Updated by Peter Amstutz about 1 month ago

  • Target version deleted (2021-08-18 sprint)

I haven't totally pinned this down but in general terms here's what I think is happening:

  1. In one thread, it flushes the file.
  2. This releases the llfuse lock and calls collection flush.
  3. the collection lock is acquired
  4. collection flush makes a callback to notify the FUSE object that it changed
  5. the callback tries to re-acquire the llfuse lock
  6. the thread blocks on the llfuse lock

Meanwhile, in another thread

  1. It has the llfuse lock
  2. it tries to get the collection lock
  3. the collection lock is held by the other thread
  4. the thread blocks on the collection lock

The problem is fusedir.py:304

it calls llfuse.lock

It must not do that without releasing the collection lock first.

#12 Updated by Peter Amstutz about 1 month ago

  • Target version set to 2021-08-18 sprint

#13 Updated by Peter Amstutz about 1 month ago

17983-fuse-unlock-relock @ 87b0f117ff1ce917d7301f8a6304e931a189e48d

  • Unlock the collection before trying to acquire llfuse.lock, then re-acquire the lock, so that locking order (llfuse, then collection) is correct

This seems to fix the original deadlock, and I did some addition manual testing and the writable FUSE mount seems quite solid (in a way that it has never been before).

https://ci.arvados.org/view/Developer/job/developer-run-tests/2641/

#14 Updated by Peter Amstutz about 1 month ago

  • Release set to 41

#15 Updated by Lucas Di Pentima about 1 month ago

I've tested it manually and worked for me too. Indeed the solution looks a bit nasty but I can confirm it seems really stable.

LGTM.

#16 Updated by Peter Amstutz about 1 month ago

  • Status changed from In Progress to Resolved

Also available in: Atom PDF