Feature #5778

[FUSE] Support efficient copy at command line

Added by Peter Amstutz over 6 years ago. Updated 2 months ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
FUSE
Target version:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Story points:
2.0

Description

Keep can perform efficient copy-on-write of files and directories, but POSIX doesn't provide an API for this. We've decided not to abuse standard hardlinks: while similar (in the "fast copy" sense), hardlinks offer incompatible semantics ("two filenames refer to the same data; writes to either file are reflected in both").

Possible approaches for exposing COW capability through arv-mount:

  • Use BTRFS clone ioctl() (requires support for handling ioctl() in llfuse). User can use cp --reflink
  • Use s3fs approach of writing a special xattr() to a special place to request a COW link. User uses a custom command to communicate with the file system.

Meanwhile, the following workaround is possible without modifying the FUSE driver (and could be provided as a "copy" CLI program):

  • Determine source and target collections, perform the operation using Arvados SDK. Results show up in target directory on refresh.

History

#1 Updated by Peter Amstutz over 6 years ago

  • Description updated (diff)

#2 Updated by Peter Amstutz over 6 years ago

  • Category set to FUSE

#3 Updated by Tom Clegg over 6 years ago

  • Description updated (diff)
  • Target version set to Arvados Future Sprints

#4 Updated by Ward Vandewege 3 months ago

  • Target version deleted (Arvados Future Sprints)

#5 Updated by Joshua Randall 2 months ago

For a limited use-case in which you want to use arv-mount to drive the actual copying (i.e. which file(s) to copy from one collection to another), I guess the (partial) workaround might be:
- duplicate input collections using the CLI or SDK into temporary collections
- use arv-mount read-write with the input collections mounted by ID
- mv (rename) the files of interest from the input collections to the output collection
- (optionally) delete the temporary duplicated collections

Does this make sense, or is a simpler workaround possible today?

It seems like another option to consider to enable this use-case without the external duplication step might be to have some sort of flag for arv-mount that allows renames to succeed against sources on read-only collections (i.e. when the input is specified by PDH)?

Currently an attempt to do that fails with "Operation not permitted" - that makes sense as the PDH mount point is read-only even when using `--read-write`, and clearly that is the correct default behaviour, but I thought it might be a compromise to offer an arv-mount option that would allow a user to opt-in to allowing an `mv` command to succeed against a fundamentally read only source without actually modifying that source (obviously).

I guess of the other options mentioned in this story, the one that enables `cp --reflink` seems the most user-friendly. Is it possible with llfuse today?

#6 Updated by Peter Amstutz 2 months ago

It's been a rather long time since we looked into this, but the issue at the time was that the way cp --reflink was communicated to the file system wasn't propagated to FUSE.

I don't know if that was a limitation of the FUSE kernel interface, libfuse, or llfuse (probably not the last one). It is quite possible the situation has improved at some point in the last 5 years.

My preferred solution is still to reinterpret hard link requests as copy-on-write, it seems like a program that relies on POSIX semantics that closely is going to run into other more fundamental problems running on top of arv-mount before "expected modifications made to a hard linked file to show up in both files" becomes a problem.

Do you have a use case for this?

Also available in: Atom PDF