Project

General

Profile

Actions

Bug #9701

closed

[SDKs] Python SDK Collection class should pack small files into large data blocks

Added by Tom Clegg over 7 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
SDKs
Target version:
Story points:
2.0

Description

Background

In Keep, small blocks incur a lot of overhead per data byte. Clients are supposed to mitigate this by writing one large block instead of lots of tiny blocks, even when storing small files. CollectionWriter performs well in this respect but Collection does not.

Test case:
  • 2000 files (filenames = 113K)
  • 38M data
  • Keep volumes are local disks (lower latency than cloud/network-backed keepstores)
code path arv-put runtime manifest size # data blocks
CollectionWriter (pre-#9463 arv-put) 2 s 137K 1
Collection (post-#9463 arv-put) 44 s 300K 2000

Proposed solution

Add an optional "flush" argument to the close() method (default True). If False is given, don't commit block data right away.

When allocating a new bufferblock, first check whether there is a bufferblock that
  • is uncommitted, and
  • is small enough to accommodate the next file the caller is about to write (or max_block_size÷2 if the next file's size is not known), and
  • only contains data for files that have been closed.

This should achieve good block packing performance, but avoid producing pathological manifests when multiple files are being written concurrently.

Use close(flush=False) in arv-put.

Alternative solution (maybe)

Write lots of small bufferblocks like we do now, but merge them into larger blocks when it's time to commit them. This could handle "small file, large file, small file" writing patterns better, and wouldn't rely on the caller's ability to predict file sizes and communicate them to the SDK. However, it might be more difficult to implement.

Related improvements not included here

Use rolling hashes to choose block transitions.

Add mechanism for a caller to pass in the anticipated size of a file being written, so Collection et al. can make better decisions.


Subtasks 1 (0 open1 closed)

Task #10052: Review 9701-collection-pack-small-files-altResolvedLucas Di Pentima08/03/2016Actions

Related issues

Related to Arvados - Idea #8791: [SDK] CollectionWriter file packing causes sub-optimal deduplicationDuplicateActions
Blocks Arvados - Idea #9463: [SDKs] Change arv-put to use the Collection class under the hoodResolvedLucas Di Pentima07/11/2016Actions
Actions #1

Updated by Tom Clegg over 7 years ago

  • Description updated (diff)
Actions #2

Updated by Tom Morris over 7 years ago

  • Target version set to Arvados Future Sprints
  • Story points set to 2.0
Actions #3

Updated by Tom Morris over 7 years ago

  • Target version changed from Arvados Future Sprints to 2016-09-28 sprint
Actions #4

Updated by Lucas Di Pentima over 7 years ago

  • Assigned To set to Lucas Di Pentima
Actions #5

Updated by Lucas Di Pentima over 7 years ago

  • Status changed from New to In Progress
Actions #6

Updated by Radhika Chippada over 7 years ago

  • Target version changed from 2016-09-28 sprint to 2016-10-12 sprint
Actions #7

Updated by Lucas Di Pentima over 7 years ago

Branch 9701-collection-pack-small-files-alt ready for preliminary review @ 480df25dc679998b53f9e7299244ac1ff3f90114

Specific tests still pending.
Just want to check if the general idea where this is going is correct.
Thanks

Actions #8

Updated by Peter Amstutz over 7 years ago

Instead of a "closed" flag on ArvadosFile, there should be a reference count of the number of ArvadosFileWriters are active.

This docstring is out of sync with the code:

    def set_closed(self):
        """Set current block as pending and closed flag to False""" 
        self._closed = True
        self.parent._my_block_manager().repack_small_blocks()

When ArvadosFile size > config.KEEP_BLOCK_SIZE/2 it should flush on close and skip repack_small_blocks(). Current behavior looks like it will not combine blocks when the owner file is greater than config.KEEP_BLOCK_SIZE/2, but it will not commit those blocks either.

In repack_small_blocks(), I think we can wait until pending_write_size > config.KEEP_BLOCK_SIZE unless force is true (the current behavior looks like it will tend to create a lot of blocks which are just over the config.KEEP_BLOCK_SIZE/2 limit when they could have been packed more.) The loop which packs new blocks should also continue to add data until size+new data > config.KEEP_BLOCK_SIZE.

repack_small_blocks() should take a sync parameter to pass to self.commit_bufferblock() instead of always calling with sync=True.

Actions #9

Updated by Lucas Di Pentima over 7 years ago

Peter Amstutz wrote:

Instead of a "closed" flag on ArvadosFile, there should be a reference count of the number of ArvadosFileWriters are active.

Added a set() to hold references to the ArvadosFileWriter objects that keep every ArvadosFile object open. I did it this way instead of just a simple counter so that every ArvadosFileWriter can remove itself once.

When ArvadosFile size > config.KEEP_BLOCK_SIZE/2 it should flush on close and skip repack_small_blocks(). Current behavior looks like it will not combine blocks when the owner file is greater than config.KEEP_BLOCK_SIZE/2, but it will not commit those blocks either.

In repack_small_blocks(), I think we can wait until pending_write_size > config.KEEP_BLOCK_SIZE unless force is true (the current behavior looks like it will tend to create a lot of blocks which are just over the config.KEEP_BLOCK_SIZE/2 limit when they could have been packed more.) The loop which packs new blocks should also continue to add data until size+new data > config.KEEP_BLOCK_SIZE.

repack_small_blocks() should take a sync parameter to pass to self.commit_bufferblock() instead of always calling with sync=True.

Updated to meet this requests. Also added a test to check for small blocks packing.

Branch ready at bf6ef98
Test suite running at: https://ci.curoverse.com/job/developer-test-job/206/

Actions #10

Updated by Peter Amstutz over 7 years ago

Why does add_writer() check isinstance(writer, ArvadosFileWriter) but remove_writer() does not?

repack_small_blocks() needs @synchronized.

We can simplify this:

small_blocks = [b for b in self._bufferblocks.values() if b.state() == _BufferBlock.WRITABLE and b.owner and b.owner.closed() and b.owner.size() <= (config.KEEP_BLOCK_SIZE / 2)]

We should be able to make the following assumptions:

  • A WRITABLE block always has an owner (if it doesn't, that's a bug).
  • If the block is WRITABLE and b.owner.closed() is True, we can assume b.owner.size() <= (config.KEEP_BLOCK_SIZE / 2) because otherwise the block would have been put into PENDING state when it was flushed.

So this can be:

small_blocks = [b for b in self._bufferblocks.values() if b.state() == _BufferBlock.WRITABLE and b.owner.closed()]

Change self._bufferblocks = {} to use collections.OrderedDict. This will pack the blocks in the order the buffer blocks were created, rather than the scrambled order when iterating over a regular dict.

This is noticeable in test_only_small_blocks_are_packed_together. The file "count.txt" is written first, but in the manifest "34603011:10:count.txt 34603008:3:foo.txt" we see that "foo.txt" was packed first (34603008 < 34603011). Without a stable packing order, this test could fail randomly.

In repack_small_blocks(), call self.alloc_bufferblock() instead of creating a new BufferBlock directly.

In repack_small_blocks(), instead of maintaining a separate "size" variable you can use new_bb.write_pointer.

alloc_bufferblock() is using len(self._bufferblocks) to generate bufferblock ids, which seems like a bad idea as it could easily lead to block id reuse (this one is my fault). Change it to generate id with a random number or uuid.uuid4().

In repack_small_blocks(), it should recalculate pending_write_size at the bottom of the loop after self.commit_bufferblock.

Actions #11

Updated by Lucas Di Pentima over 7 years ago

Peter Amstutz wrote:

Why does add_writer() check isinstance(writer, ArvadosFileWriter) but remove_writer() does not?

In a way, that's was the idea of keeping object references instead of a counter: Do the validations on the "add" part of the code, and every ArvadosFileWriter has its "self" pointer to remove itself from the list. One use that I can think of validations, is to log relevant error messages instead of crashing with a KeyError, do you think that would be desirable?

repack_small_blocks() needs @synchronized.

Updated! I didn't include that because I was getting locked up when called from commit_all(), but moved that call out of the with self.lock: statement, and all is well now.

We can simplify this:
[...]
We should be able to make the following assumptions:
  • A WRITABLE block always has an owner (if it doesn't, that's a bug).
  • If the block is WRITABLE and b.owner.closed() is True, we can assume b.owner.size() <= (config.KEEP_BLOCK_SIZE / 2) because otherwise the block would have been put into PENDING state when it was flushed.

Nice simplification, thanks! I updated as requested and added comments for future reviewers of that region to ease code understanding.

Change self._bufferblocks = {} to use collections.OrderedDict. This will pack the blocks in the order the buffer blocks were created, rather than the scrambled order when iterating over a regular dict.
This is noticeable in test_only_small_blocks_are_packed_together. The file "count.txt" is written first, but in the manifest "34603011:10:count.txt 34603008:3:foo.txt" we see that "foo.txt" was packed first (34603008 < 34603011). Without a stable packing order, this test could fail randomly.

Updated!

In repack_small_blocks(), call self.alloc_bufferblock() instead of creating a new BufferBlock directly.

This gets me locked up because repack_small_blocks() is using the same lock as self.alloc_bufferblock(). The same happens with delete_bufferblock(). I could transform those two into a wrapper of its private counterpart, do you agree?

In repack_small_blocks(), instead of maintaining a separate "size" variable you can use new_bb.write_pointer.

Updated.

alloc_bufferblock() is using len(self._bufferblocks) to generate bufferblock ids, which seems like a bad idea as it could easily lead to block id reuse (this one is my fault). Change it to generate id with a random number or uuid.uuid4().

Updated.

In repack_small_blocks(), it should recalculate pending_write_size at the bottom of the loop after self.commit_bufferblock.

pending_write_size is used only once per call. Do you mean to add an outer loop to try to get more repackable small blocks after the first one? I think it's not necessary because repack_small_blocks() is called on every file being closed, so there shouldn't be a lot of small blocks with a combined size of more than a KEEP_BLOCK_SIZE hanging around."

Updates @ 31568ea

Actions #12

Updated by Peter Amstutz over 7 years ago

Lucas Di Pentima wrote:

In repack_small_blocks(), call self.alloc_bufferblock() instead of creating a new BufferBlock directly.

This gets me locked up because repack_small_blocks() is using the same lock as self.alloc_bufferblock(). The same happens with delete_bufferblock(). I could transform those two into a wrapper of its private counterpart, do you agree?

Yes, I think that would be better. It avoids confusion over whether or not repack_small_blocks() is actually doing something different from alloc_bufferblock() and delete_bufferblock().

The rest LGTM.

Actions #13

Updated by Lucas Di Pentima over 7 years ago

Wrapped unsynchronized versions of _alloc_bufferblock() and _delete_bufferblock() into their synchronized counterparts.
Merged master into branch at: e892c7ee96f28bef7d5b2a9314eb9549ee56634d
Running all tests before merging into master: https://ci.curoverse.com/job/developer-test-job/208/

Actions #14

Updated by Lucas Di Pentima over 7 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:c98596a46dc7c166acb2a26b742a09ae0493224a.

Actions

Also available in: Atom PDF