Feature #14538

[keep-web] Do not block writes while flushing blocks to Keep

Added by Tom Clegg 18 days ago. Updated 4 days ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Keep
Target version:
Start date:
11/28/2018
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-

Description

Currently, writing a large file via collectionfs (e.g., webdav PUT) pauses every 64 MiB while a block is written to Keep.

The block flushing should be done in the background and subsequent writes should be allowed to continue.


Subtasks

Task #14548: Review 14538-async-writeResolvedPeter Amstutz

Associated revisions

Revision 81b8420a
Added by Tom Clegg 6 days ago

Merge branch '14538-async-write'

fixes #14538

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

History

#1 Updated by Tom Clegg 18 days ago

  • Assigned To set to Tom Clegg
  • Target version changed from To Be Groomed to 2018-12-12 Sprint

#2 Updated by Tom Clegg 17 days ago

  • Category set to Keep
  • Status changed from New to In Progress

14538-async-write @ 78c18757e42c40178d7a9eaf78f7b6d167bee926

This allows up to 4 concurrent writes per file while writing, and up to 4 concurrent writes per collection when syncing a collection in MarshalManifest.

TODO:
  • make the limit configurable
  • limit concurrency at the filesystem level instead of per-file, to accommodate callers that write multiple large files concurrently

I'm not sure whether this should be held up by one or both of those.

#3 Updated by Tom Clegg 17 days ago

I'm thinking the default/initial write concurrency limit (flushing finished blocks while writing a file) should be 1 rather than 4. If the Keep side is slower than the incoming data, more concurrency won't necessarily help -- it'll just use more memory.

#4 Updated by Peter Amstutz 17 days ago

Over all LGTM assuming tests pass (link to jenkins below). Here's some mostly stylistic comments, take what you will.

In dirnode.sync():

  • Stylistically, I don't know if flush() really benefits from being declared inline. It could just as easily be a method of dirnode and take throttle as a parameter. In order to follow execution of the code, I have to scroll down to the bottom of the function, and then scroll back up again.
  • pendingLen is used before it is assigned. I know this means it has its default value (0) but it would be helpful to make that explicit so it doesn't look like a bug.

In pruneMemSegments():

  • This seems redundant: seg.Len() < maxBlockSize || seg.Len() == 0
  • Is there a potential race between pruneMemSegments() and sync() ? sync() does not check "flushing" so it seems like it might try to flush a block which is already being flushed. By my reading of the code, it probably is not a disaster if this happens (it would replace one storedSegment with an equivalent one) but worth considering. Now I see it calls waitPrune() before that, so no problem.

I'm thinking the default/initial write concurrency limit (flushing finished blocks while writing a file) should be 1 rather than 4. If the Keep side is slower than the incoming data, more concurrency won't necessarily help -- it'll just use more memory.

I expect the common case is slow clients / fast backend, in which case more threads won't get used. And as you said, fast client / slow backend would result in a pileup. The Python client uses 2 threads by default, which enables it to continue making progress if one write stalls due to a transient failure.

I'm a little concerned that flushing a collection seems to require iterating over every single segment of every single file, which could be expensive for very large collections (some benchmarking is warranted).

I submitted a test run here:

https://ci.curoverse.com/view/Developer/job/developer-run-tests/986/

#5 Updated by Tom Clegg 16 days ago

Peter Amstutz wrote:

  • Stylistically, I don't know if flush() really benefits from being declared inline. It could just as easily be a method of dirnode and take throttle as a parameter. In order to follow execution of the code, I have to scroll down to the bottom of the function, and then scroll back up again.

Yes, sync() was getting long. Moved flush to (*dirnode)commitBlock(ctx, throttle, []fnSegmentRef).

  • pendingLen is used before it is assigned. I know this means it has its default value (0) but it would be helpful to make that explicit so it doesn't look like a bug.

Done.

  • This seems redundant: seg.Len() < maxBlockSize || seg.Len() == 0

Indeed. Removed the == 0 part.

I'm a little concerned that flushing a collection seems to require iterating over every single segment of every single file, which could be expensive for very large collections (some benchmarking is warranted).

This seems unavoidable, since flushing a collection involves producing a manifest which references every segment of every file. Now we iterate twice, but (even without the concurrency being added here) I expect that's still much faster than the ensuing network round-trip with the resulting manifest.

We could stash the last known manifest text and use a "dirty" flag to optimize the no-op case, but it seems out of scope here unless I'm missing something.

test run

Fixed flaky test.

14538-async-write @ a88f7ad9728ee6968367928c6d3d7613bbf290ec https://ci.curoverse.com/view/Developer/job/developer-run-tests/988/

#6 Updated by Peter Amstutz 13 days ago

Tom Clegg wrote:

I'm a little concerned that flushing a collection seems to require iterating over every single segment of every single file, which could be expensive for very large collections (some benchmarking is warranted).

This seems unavoidable, since flushing a collection involves producing a manifest which references every segment of every file. Now we iterate twice, but (even without the concurrency being added here) I expect that's still much faster than the ensuing network round-trip with the resulting manifest.

The Python client is structured a little differently, block operations go through a BlockManager which tracks which blocks are pending, so there's no need to iterate over all the segments when flushing blocks. But this story is about incrementally flushing a PUT of a single large file so it isn't going to be iterating over the manifest except at the end.

We could stash the last known manifest text and use a "dirty" flag to optimize the no-op case, but it seems out of scope here unless I'm missing something.

14538-async-write @ a88f7ad9728ee6968367928c6d3d7613bbf290ec https://ci.curoverse.com/view/Developer/job/developer-run-tests/988/

This LGTM.

#7 Updated by Tom Clegg 6 days ago

  • Status changed from In Progress to Resolved

Also available in: Atom PDF