Project

General

Profile

Writable FUSE mount » History » Version 6

Peter Amstutz, 12/12/2014 04:39 PM

1 1 Peter Amstutz
h1. Writable FUSE mount
2
3
h2. Problem
4
5 5 Peter Amstutz
When running a job in order to write output to Keep it is necessary to either a) use the Keep API in the Python SDK or b) write to a staging directory and then use arv-put to upload the contents.  Both of these have disadvantages; using the Python API directly is only practical if the user script itself is written in Python.  Writing to a staging directory requires sufficient scratch space on the compute node, and cannot stream the output to Keep servers as it is being written, instead requiring a potentially time consuming upload step after the actual job is finished.
6 1 Peter Amstutz
7
h2. Goals
8
9
* Accommodate the vast majority of scripts and tools that naturally write their output as regular files
10
* Improve performance by asynchronously streaming blocks to Keep as they are written, instead of waiting until the job is finished
11
12 2 Peter Amstutz
h2. Requirements
13
14
* Allow user to use basic directory command line tools such as 'mv' and 'rm'
15
* Support random-access read-write including r+ and w+, but optimize for once-through streaming writes of large files.
16
17 1 Peter Amstutz
h2. Proposal
18
19 2 Peter Amstutz
We currently support a read-only mount.  This streams from files Keep on-demand and presents a regular file system interface through FUSE.  This has proven to be a key piece of infrastructure that greatly simplifies integration with existing tools that don't natively understand Keep.  A corresponding writable mount would allow existing tools to write output to regular files without needing to be aware of Keep, while avoiding the drawbacks of writing to a staging directory.  After the tool is finished, the writable mount directory would be committed as an Arvados collection.
20
21
h2. Design
22
23 6 Peter Amstutz
See also #3198
24
25 2 Peter Amstutz
# arv-mount will implement a simple in-memory read-write file system using Keep as the backing store.
26
# Store metadata (name, directory, size, Keep blocks, ranges) for all files
27
# Can use 'mv', 'rm' to manipulates the metadata atomically 
28 3 Peter Amstutz
** 'cp' and 'ln' are problematic because Unix doesn't have copy-on-write semantics for hard links; unclear if it is possible to know at the FUSE level that the user is copying a file as opposed to other read/write activity.
29 4 Peter Amstutz
** Content addressing doesn't prevent duplication if repacking files into a new set of blocks results in different block locators.
30 2 Peter Amstutz
# Writes are recorded sequentially into a buffer block 
31
# For each write, a corresponding range is patched into the file manifest (the manifest normalization code already contains most of the logic necessary to do this efficiently)
32
# When the buffer block is filled, a new buffer block is allocated, and the old buffer block is asynchronously uploaded.  Metadata is updated with the actual Keep locator of the just-written block.
33
# Commit a new collection to API server at specified points
34
## on unmount, if modified
35
## on fsync?
36
## autocommit after N seconds (of inactivity?)
37
# Provide several virtual files:
38
## .arv/commit - write a character to this file (or maybe use 'touch') to force commit
39
## .arv/dirty - read 0 or 1 to indicate there if are uncommitted changes
40
## .arv/hash - read to get the hash of the last committed collection
41
42
h2. Assumptions
43
44
Files are written out once, and a single file is written at a time.  If either of these assumptions are violated, the proposed design will be less efficient because either a) blocks will contain garbage data from old writes that are superceded or b) writes will be interleaved resulting in a lengthy manifest with many small stream ranges instead of one large range.  If (b) is a real problem, it could be addressed by having a separate buffer block for each file.