Project

General

Profile

Actions

Bug #11510

closed

[SDK] Support writes to offsets beyond end of file

Added by Peter Amstutz over 7 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Story points:
-

Description

Two distinct but closely related issues encountered by a customer:

  1. Using s3.download_fileobj() with an ArvadosFileWriter target: files over some threshold (likely 8MB) are arbitrarily truncated.
  2. Using "aws s3 cp" to download to a writable keep mount: a 15 GB file results in only an 8 MB file appearing it keep(!)

I believe the common theme here is that they both rely on the Python SDK and are using multipart download (request and commit chunks 8 MB at a time) which suggests some issue with the way it writes chunks non-sequentially (possibly relating to seek() or ftruncate()).


Subtasks 1 (0 open1 closed)

Task #11528: Review 11510-sdk-extend-filesResolvedPeter Amstutz04/20/2017Actions

Related issues 1 (0 open1 closed)

Related to Arvados - Bug #10035: [FUSE] Determine why bcl2fastq doesn't work with writable keep mountResolvedPeter Amstutz09/13/2016Actions
Actions #1

Updated by Peter Amstutz over 7 years ago

  • Subject changed from S3 multipart downloads fail truncated writing to keep to S3 multipart download truncated writing to keep
  • Description updated (diff)
Actions #2

Updated by Peter Amstutz over 7 years ago

  • Description updated (diff)
Actions #3

Updated by Peter Amstutz over 7 years ago

$ python
Python 2.7.9 (default, Jun 29 2016, 13:08:31) 
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> f = open("ttt", "wb")
>>> f.seek(10, 1)
>>> f.tell()
10
>>> f.write("blub")
>>> f.close()
$ ls -l ttt
-rw-r--r-- 1 peter peter 14 Apr 17 16:54 ttt

Oh dear.

Actions #4

Updated by Peter Amstutz over 7 years ago

So apparently this is POSIX behavior for creating sparse files, and the Arvados Python SDK (and hence FUSE) currently don't support it (both writeto() and truncate() will throw exceptions if you try to increase the size of a file or start writing at a point past the end of the file).

This messes up tools which rely on it this behavior that we need to support. We should implement it.

Actions #5

Updated by Peter Amstutz over 7 years ago

  • Subject changed from S3 multipart download truncated writing to keep to [SDK] Support writes to offsets beyond end of file
  • Status changed from New to In Progress
  • Assigned To set to Peter Amstutz
  • Target version set to 2017-04-26 sprint
Actions #6

Updated by Tom Clegg over 7 years ago

11510-sdk-extend-files @ f5638851ddb5a859a54a16fe963ee98591af17a9
  • remove debugging print statements in arvfile.py → ArvadosFile → writeto()
  • remove unused PADDING_BLOCK_LOCATOR in config.py (or perhaps move it into arvados_testutil and use it in the test cases)
  • I don't have any particular reason to suspect this won't work, and it doesn't become any more necessary now than before, but... it seems like we should test truncating a file in such a way that one or more blocks become unreferenced.
  • test_sparse_write3 might be even more edge-casey if it left out i=3, so self.assertEqual(writer.read(), "000011112222\x00\x00\x00\x004444") ...?
  • test seek to a negative offset
  • test seek past end of file, read (returns ''), seek back to non-sparse portion using SEEK_CUR, check file size did not change
  • test return value of seek() is new file position
Actions #7

Updated by Peter Amstutz over 7 years ago

Tom Clegg wrote:

11510-sdk-extend-files @ f5638851ddb5a859a54a16fe963ee98591af17a9
  • remove debugging print statements in arvfile.py → ArvadosFile → writeto()

Fixed.

  • remove unused PADDING_BLOCK_LOCATOR in config.py (or perhaps move it into arvados_testutil and use it in the test cases)

Moved it into a docstring.

  • I don't have any particular reason to suspect this won't work, and it doesn't become any more necessary now than before, but... it seems like we should test truncating a file in such a way that one or more blocks become unreferenced.

Added test_truncate3() which does this.

  • test_sparse_write3 might be even more edge-casey if it left out i=3, so self.assertEqual(writer.read(), "000011112222\x00\x00\x00\x004444") ...?

Added test_sparse_write4() which does this.

  • test seek to a negative offset

test_seek_min_zero() tests this.

  • test seek past end of file, read (returns ''), seek back to non-sparse portion using SEEK_CUR, check file size did not change
  • test return value of seek() is new file position

Also tested in test_truncate3().

Now @ 5180238a10bd15302a1c15b9a428f2fdeeabdf4e

Actions #8

Updated by Tom Clegg over 7 years ago

Peter Amstutz wrote:

  • test return value of seek() is new file position

Also tested in test_truncate3().

seek return value is still untested and seems to be incorrect. If I add a check:

Traceback (most recent call last):
  File "/home/tom/src/arvados/sdk/python/tests/test_arvfile.py", line 169, in test_write_to_end
    self.assertEqual(5, writer.seek(5, os.SEEK_SET))
AssertionError: 5 != None

https://docs.python.org/2/library/io.html#io.IOBase.seek

Rest LGTM, thanks

Actions #9

Updated by Peter Amstutz over 7 years ago

  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF