Feature #14538
closed[keep-web] Do not block writes while flushing blocks to Keep
Description
Currently, writing a large file via collectionfs (e.g., webdav PUT) pauses every 64 MiB while a block is written to Keep.
The block flushing should be done in the background and subsequent writes should be allowed to continue.
Updated by Tom Clegg about 6 years ago
- Assigned To set to Tom Clegg
- Target version changed from To Be Groomed to 2018-12-12 Sprint
Updated by Tom Clegg about 6 years ago
- Category set to Keep
- Status changed from New to In Progress
14538-async-write @ 78c18757e42c40178d7a9eaf78f7b6d167bee926
This allows up to 4 concurrent writes per file while writing, and up to 4 concurrent writes per collection when syncing a collection in MarshalManifest.
TODO:- make the limit configurable
- limit concurrency at the filesystem level instead of per-file, to accommodate callers that write multiple large files concurrently
I'm not sure whether this should be held up by one or both of those.
Updated by Tom Clegg about 6 years ago
I'm thinking the default/initial write concurrency limit (flushing finished blocks while writing a file) should be 1 rather than 4. If the Keep side is slower than the incoming data, more concurrency won't necessarily help -- it'll just use more memory.
Updated by Peter Amstutz about 6 years ago
Over all LGTM assuming tests pass (link to jenkins below). Here's some mostly stylistic comments, take what you will.
In dirnode.sync():
- Stylistically, I don't know if flush() really benefits from being declared inline. It could just as easily be a method of dirnode and take throttle as a parameter. In order to follow execution of the code, I have to scroll down to the bottom of the function, and then scroll back up again.
pendingLen
is used before it is assigned. I know this means it has its default value (0) but it would be helpful to make that explicit so it doesn't look like a bug.
In pruneMemSegments():
- This seems redundant:
seg.Len() < maxBlockSize || seg.Len() == 0
Is there a potential race between pruneMemSegments() and sync() ? sync() does not check "flushing" so it seems like it might try to flush a block which is already being flushed. By my reading of the code, it probably is not a disaster if this happens (it would replace one storedSegment with an equivalent one) but worth considering.Now I see it calls waitPrune() before that, so no problem.
I'm thinking the default/initial write concurrency limit (flushing finished blocks while writing a file) should be 1 rather than 4. If the Keep side is slower than the incoming data, more concurrency won't necessarily help -- it'll just use more memory.
I expect the common case is slow clients / fast backend, in which case more threads won't get used. And as you said, fast client / slow backend would result in a pileup. The Python client uses 2 threads by default, which enables it to continue making progress if one write stalls due to a transient failure.
I'm a little concerned that flushing a collection seems to require iterating over every single segment of every single file, which could be expensive for very large collections (some benchmarking is warranted).
I submitted a test run here:
https://ci.curoverse.com/view/Developer/job/developer-run-tests/986/
Updated by Tom Clegg about 6 years ago
Peter Amstutz wrote:
- Stylistically, I don't know if flush() really benefits from being declared inline. It could just as easily be a method of dirnode and take throttle as a parameter. In order to follow execution of the code, I have to scroll down to the bottom of the function, and then scroll back up again.
Yes, sync() was getting long. Moved flush to (*dirnode)commitBlock(ctx, throttle, []fnSegmentRef).
pendingLen
is used before it is assigned. I know this means it has its default value (0) but it would be helpful to make that explicit so it doesn't look like a bug.
Done.
- This seems redundant:
seg.Len() < maxBlockSize || seg.Len() == 0
Indeed. Removed the == 0
part.
I'm a little concerned that flushing a collection seems to require iterating over every single segment of every single file, which could be expensive for very large collections (some benchmarking is warranted).
This seems unavoidable, since flushing a collection involves producing a manifest which references every segment of every file. Now we iterate twice, but (even without the concurrency being added here) I expect that's still much faster than the ensuing network round-trip with the resulting manifest.
We could stash the last known manifest text and use a "dirty" flag to optimize the no-op case, but it seems out of scope here unless I'm missing something.
test run
Fixed flaky test.
14538-async-write @ a88f7ad9728ee6968367928c6d3d7613bbf290ec https://ci.curoverse.com/view/Developer/job/developer-run-tests/988/
Updated by Peter Amstutz about 6 years ago
Tom Clegg wrote:
I'm a little concerned that flushing a collection seems to require iterating over every single segment of every single file, which could be expensive for very large collections (some benchmarking is warranted).
This seems unavoidable, since flushing a collection involves producing a manifest which references every segment of every file. Now we iterate twice, but (even without the concurrency being added here) I expect that's still much faster than the ensuing network round-trip with the resulting manifest.
The Python client is structured a little differently, block operations go through a BlockManager which tracks which blocks are pending, so there's no need to iterate over all the segments when flushing blocks. But this story is about incrementally flushing a PUT of a single large file so it isn't going to be iterating over the manifest except at the end.
We could stash the last known manifest text and use a "dirty" flag to optimize the no-op case, but it seems out of scope here unless I'm missing something.
14538-async-write @ a88f7ad9728ee6968367928c6d3d7613bbf290ec https://ci.curoverse.com/view/Developer/job/developer-run-tests/988/
This LGTM.
Updated by Tom Clegg about 6 years ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|81b8420a261a095a269a46d965b2fc0ee6ecf793.