Bug #6194

[SDK] Support writing more than 64MiB at a time in Python SDK

Added by Peter Amstutz about 4 years ago. Updated about 4 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
SDKs
Target version:
Start date:
06/01/2015
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-

Subtasks

Task #6210: Review 6194-python-arvfile-large-writeResolvedTom Clegg


Related issues

Related to Arvados - Story #3198: [FUSE] Writable streaming arv-mountResolved04/14/2015

Associated revisions

Revision 4b3b3064
Added by Peter Amstutz about 4 years ago

Merge branch '6194-python-arvfile-large-write' closes #6194

History

#1 Updated by Peter Amstutz about 4 years ago

  • Target version changed from Bug Triage to 2015-06-10 sprint

#2 Updated by Peter Amstutz about 4 years ago

  • Project changed from Arvados Private to Arvados
  • Category set to SDKs
  • Status changed from New to In Progress
  • Assigned To set to Peter Amstutz

#3 Updated by Peter Amstutz about 4 years ago

  • Subject changed from [SDK] SUpport writing more than 64MiB at a time in Python SDK to [SDK] Support writing more than 64MiB at a time in Python SDK

#4 Updated by Tom Clegg about 4 years ago

c766dd2

It looks like the second condition will always be True:

            while (n + config.KEEP_BLOCK_SIZE) < len(data):
                self.writeto(offset+n, dataview[n:n + config.KEEP_BLOCK_SIZE].tobytes(), num_retries)
                n += config.KEEP_BLOCK_SIZE
            if n < len(data):
                self.writeto(offset+n, dataview[n:].tobytes(), num_retries)
            return

The last three lines could be just "return self.writeto(offset+n, ....)"

Or, for even less copy & paste, we could just do this, since dataview[n:n+config.KEEP_BLOCK_SIZE] already returns a short segment when appropriate:

            while n < len(data):
                self.writeto(offset+n, dataview[n:n + config.KEEP_BLOCK_SIZE].tobytes(), num_retries)
                n += config.KEEP_BLOCK_SIZE
            return

I think it's worthwhile to make the test a bit stronger by using a non-zero offset (if there were an error in offset math, this is the only test that would catch it).

#5 Updated by Peter Amstutz about 4 years ago

Simplified the write loop. Thanks.

I changed the test to do two writes:

text = "0123456789" 
writer.write(text)
text = "0123456789" * 9999999
writer.write(text)

This has an interesting side effect. The first write ends up committed in its own block, because the current logic in writeto() commits the buffer block and starts a new one when (buffered data + new data) is bigger than the block size.

. 781e5e245d69b566979b86e28d23f2c7+10 48dd23ea1645fd47d789804d71b5bb8e+67108864 77c57dc6ac5a10bb2205caaa73187994+32891126 0:100000000:count.txt\n

Changing the behavior of writeto() to split differently should be out of scope for this fix, but I wanted to flag it in case you had thoughts. I should mention that the existing behavior is intentional, the reasoning being that when a record or line of text is written in a single write() call, this biases towards not splitting records or lines across block boundaries.

#6 Updated by Tom Clegg about 4 years ago

Peter Amstutz wrote:

I changed the test to do two writes:

[...]

Hm. The new test supplies a non-zero offset, which is good, but it doesn't seem to test that the offset was actually used: the file is 100000000 bytes long when the test starts, so it will still be 100000000 bytes long if offset is ignored -- for that matter, even if no writing happens at all.

Using "----------" * 3 in the first write(), and asserting that the manifest has the right md5sums after doing the big write, might be a reasonable way to make the test effective...?

Meanwhile, most of the following assertions seem to be just copied from other tests and I'm not sure why they add any value here:

+            self.assertEqual(None, c.manifest_locator())
+            self.assertEqual(True, c.modified())
+            c.save_new("test_write_large")
+            self.assertEqual("zzzzz-4zz18-mockcollection0", c.manifest_locator())
+            self.assertEqual(False, c.modified())

(In particular, do we really need to check that the UUID hasn't changed? I suppose this isn't the right time to wonder why we're using a method called *_locator() to get a UUID...)

This has an interesting side effect. The first write ends up committed in its own block, because the current logic in writeto() commits the buffer block and starts a new one when (buffered data + new data) is bigger than the block size.

Indeed, I'm pretty sure it would be better in the long run to batch the writes according to max block size irrespective of the way the client chunks them on the way into write(). For example, that way "cat <a >b" and "dd if=a of=b" and "egrep . <a >b" would result in the same chunking. But I agree that should go into #3198 or some other story.

#7 Updated by Tom Clegg about 4 years ago

Sorry, cancel that... misread the MockApi as being the initial content.

The boilerplate assertions still look kind of superfluous, but that doesn't need to block the bugfix.

LGTM, thanks.

#8 Updated by Peter Amstutz about 4 years ago

Tom Clegg wrote:

Peter Amstutz wrote:

I changed the test to do two writes:

[...]

Hm. The new test supplies a non-zero offset, which is good, but it doesn't seem to test that the offset was actually used: the file is 100000000 bytes long when the test starts, so it will still be 100000000 bytes long if offset is ignored -- for that matter, even if no writing happens at all.

That's not true, the file is 0 length when the test starts. Added an assertion to that effect.

Using "----------" * 3 in the first write(), and asserting that the manifest has the right md5sums after doing the big write, might be a reasonable way to make the test effective...?

Well, the thought was that test_write_large (which writes 1000 bytes 100000 times) and test_large_write (which writes of 1000 bytes and then writes 99999990 bytes) would produce the same manifest, then I realized that they don't because of the size of writes affects how blocks are packed.

Meanwhile, most of the following assertions seem to be just copied from other tests and I'm not sure why they add any value here:

[...]

Those assertions are not necessary to test what we're testing, so I simplified it and just assert the manifest_text() directly.

Indeed, I'm pretty sure it would be better in the long run to batch the writes according to max block size irrespective of the way the client chunks them on the way into write(). For example, that way "cat <a >b" and "dd if=a of=b" and "egrep . <a >b" would result in the same chunking. But I agree that should go into #3198 or some other story.

I see the argument for consistency in how blocks are packed, independent of the size of writes. We can do that in #3198 (not here).

#9 Updated by Peter Amstutz about 4 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:4b3b3064b87a07b2ba8035dd5c8f3660dd3b2a67.

Also available in: Atom PDF