Bug #11510

[SDK] Support writes to offsets beyond end of file

Added by Peter Amstutz 2 months ago. Updated about 1 month ago.

Status:ResolvedStart date:04/20/2017
Priority:NormalDue date:
Assignee:Peter Amstutz% Done:

100%

Category:-
Target version:2017-04-26 sprint
Story points-Remaining (hours)0.00 hour
Velocity based estimate-

Description

Two distinct but closely related issues encountered by a customer:

  1. Using s3.download_fileobj() with an ArvadosFileWriter target: files over some threshold (likely 8MB) are arbitrarily truncated.
  2. Using "aws s3 cp" to download to a writable keep mount: a 15 GB file results in only an 8 MB file appearing it keep(!)

I believe the common theme here is that they both rely on the Python SDK and are using multipart download (request and commit chunks 8 MB at a time) which suggests some issue with the way it writes chunks non-sequentially (possibly relating to seek() or ftruncate()).


Subtasks

Task #11528: Review 11510-sdk-extend-filesResolvedPeter Amstutz


Related issues

Related to Arvados - Bug #10035: [FUSE] Determine why bcl2fastq doesn't work with writable... In Progress 09/13/2016

Associated revisions

Revision b0463827
Added by Peter Amstutz 2 months ago

Merge branch '11510-sdk-extend-files' refs #11510

History

#1 Updated by Peter Amstutz 2 months ago

  • Subject changed from S3 multipart downloads fail truncated writing to keep to S3 multipart download truncated writing to keep
  • Description updated (diff)

#2 Updated by Peter Amstutz 2 months ago

  • Description updated (diff)

#3 Updated by Peter Amstutz 2 months ago

$ python
Python 2.7.9 (default, Jun 29 2016, 13:08:31) 
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> f = open("ttt", "wb")
>>> f.seek(10, 1)
>>> f.tell()
10
>>> f.write("blub")
>>> f.close()
$ ls -l ttt
-rw-r--r-- 1 peter peter 14 Apr 17 16:54 ttt

Oh dear.

#4 Updated by Peter Amstutz 2 months ago

So apparently this is POSIX behavior for creating sparse files, and the Arvados Python SDK (and hence FUSE) currently don't support it (both writeto() and truncate() will throw exceptions if you try to increase the size of a file or start writing at a point past the end of the file).

This messes up tools which rely on it this behavior that we need to support. We should implement it.

#5 Updated by Peter Amstutz 2 months ago

  • Subject changed from S3 multipart download truncated writing to keep to [SDK] Support writes to offsets beyond end of file
  • Status changed from New to In Progress
  • Assignee set to Peter Amstutz
  • Target version set to 2017-04-26 sprint

#6 Updated by Tom Clegg 2 months ago

11510-sdk-extend-files @ f5638851ddb5a859a54a16fe963ee98591af17a9
  • remove debugging print statements in arvfile.py → ArvadosFile → writeto()
  • remove unused PADDING_BLOCK_LOCATOR in config.py (or perhaps move it into arvados_testutil and use it in the test cases)
  • I don't have any particular reason to suspect this won't work, and it doesn't become any more necessary now than before, but... it seems like we should test truncating a file in such a way that one or more blocks become unreferenced.
  • test_sparse_write3 might be even more edge-casey if it left out i=3, so self.assertEqual(writer.read(), "000011112222\x00\x00\x00\x004444") ...?
  • test seek to a negative offset
  • test seek past end of file, read (returns ''), seek back to non-sparse portion using SEEK_CUR, check file size did not change
  • test return value of seek() is new file position

#7 Updated by Peter Amstutz 2 months ago

Tom Clegg wrote:

11510-sdk-extend-files @ f5638851ddb5a859a54a16fe963ee98591af17a9
  • remove debugging print statements in arvfile.py → ArvadosFile → writeto()

Fixed.

  • remove unused PADDING_BLOCK_LOCATOR in config.py (or perhaps move it into arvados_testutil and use it in the test cases)

Moved it into a docstring.

  • I don't have any particular reason to suspect this won't work, and it doesn't become any more necessary now than before, but... it seems like we should test truncating a file in such a way that one or more blocks become unreferenced.

Added test_truncate3() which does this.

  • test_sparse_write3 might be even more edge-casey if it left out i=3, so self.assertEqual(writer.read(), "000011112222\x00\x00\x00\x004444") ...?

Added test_sparse_write4() which does this.

  • test seek to a negative offset

test_seek_min_zero() tests this.

  • test seek past end of file, read (returns ''), seek back to non-sparse portion using SEEK_CUR, check file size did not change
  • test return value of seek() is new file position

Also tested in test_truncate3().

Now @ 5180238a10bd15302a1c15b9a428f2fdeeabdf4e

#8 Updated by Tom Clegg 2 months ago

Peter Amstutz wrote:

  • test return value of seek() is new file position

Also tested in test_truncate3().

seek return value is still untested and seems to be incorrect. If I add a check:

Traceback (most recent call last):
  File "/home/tom/src/arvados/sdk/python/tests/test_arvfile.py", line 169, in test_write_to_end
    self.assertEqual(5, writer.seek(5, os.SEEK_SET))
AssertionError: 5 != None

https://docs.python.org/2/library/io.html#io.IOBase.seek

Rest LGTM, thanks

#9 Updated by Peter Amstutz about 1 month ago

  • Status changed from In Progress to Resolved

Also available in: Atom PDF