Writable FUSE mount

Problem

When running a job in order to write output to Keep it is necessary to either a) use the Keep API in the Python SDK or b) write to a staging directory and then use arv-put to upload the contents. Both of these have disadvantages; using the Python API directly is only practical if the user script itself is written in Python. Writing to a staging directory requires sufficient scratch space on the compute node, and cannot stream the output to Keep servers as it is being written, instead requiring a potentially time consuming upload step after the actual job is finished.

Goals

  • Accommodate the vast majority of scripts and tools that naturally write their output as regular files
  • Improve performance by asynchronously streaming blocks to Keep as they are written, instead of waiting until the job is finished

Requirements

  • Allow user to use basic directory command line tools such as 'mv' and 'rm'
  • Support random-access read-write including r+ and w+, but optimize for once-through streaming writes of large files.

Proposal

We currently support a read-only mount. This streams from files Keep on-demand and presents a regular file system interface through FUSE. This has proven to be a key piece of infrastructure that greatly simplifies integration with existing tools that don't natively understand Keep. A corresponding writable mount would allow existing tools to write output to regular files without needing to be aware of Keep, while avoiding the drawbacks of writing to a staging directory. After the tool is finished, the writable mount directory would be committed as an Arvados collection.

Design

See also #3198

  1. arv-mount will implement a simple in-memory read-write file system using Keep as the backing store.
  2. Store metadata (name, directory, size, Keep blocks, ranges) for all files
  3. Can use 'mv', 'rm' to manipulates the metadata atomically
    • 'cp' and 'ln' are problematic because Unix doesn't have copy-on-write semantics for hard links; unclear if it is possible to know at the FUSE level that the user is copying a file as opposed to other read/write activity.
    • Content addressing doesn't prevent duplication if repacking files into a new set of blocks results in different block locators.
  4. Writes are recorded sequentially into a buffer block
  5. For each write, a corresponding range is patched into the file manifest (the manifest normalization code already contains most of the logic necessary to do this efficiently)
  6. When the buffer block is filled, a new buffer block is allocated, and the old buffer block is asynchronously uploaded. Metadata is updated with the actual Keep locator of the just-written block.
  7. Commit a new collection to API server at specified points
    1. on unmount, if modified
    2. on fsync?
    3. autocommit after N seconds (of inactivity?)
  8. Provide several virtual files:
    1. .arv/commit - write a character to this file (or maybe use 'touch') to force commit
    2. .arv/dirty - read 0 or 1 to indicate there if are uncommitted changes
    3. .arv/hash - read to get the hash of the last committed collection

Assumptions

Files are written out once, and a single file is written at a time. If either of these assumptions are violated, the proposed design will be less efficient because either a) blocks will contain garbage data from old writes that are superceded or b) writes will be interleaved resulting in a lengthy manifest with many small stream ranges instead of one large range. If (b) is a real problem, it could be addressed by having a separate buffer block for each file.