Bug #11510
closed
[SDK] Support writes to offsets beyond end of file
Added by Peter Amstutz over 7 years ago.
Updated over 7 years ago.
Description
Two distinct but closely related issues encountered by a customer:
- Using s3.download_fileobj() with an ArvadosFileWriter target: files over some threshold (likely 8MB) are arbitrarily truncated.
- Using "aws s3 cp" to download to a writable keep mount: a 15 GB file results in only an 8 MB file appearing it keep(!)
I believe the common theme here is that they both rely on the Python SDK and are using multipart download (request and commit chunks 8 MB at a time) which suggests some issue with the way it writes chunks non-sequentially (possibly relating to seek() or ftruncate()).
- Subject changed from S3 multipart downloads fail truncated writing to keep to S3 multipart download truncated writing to keep
- Description updated (diff)
- Description updated (diff)
$ python
Python 2.7.9 (default, Jun 29 2016, 13:08:31)
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> f = open("ttt", "wb")
>>> f.seek(10, 1)
>>> f.tell()
10
>>> f.write("blub")
>>> f.close()
$ ls -l ttt
-rw-r--r-- 1 peter peter 14 Apr 17 16:54 ttt
Oh dear.
So apparently this is POSIX behavior for creating sparse files, and the Arvados Python SDK (and hence FUSE) currently don't support it (both writeto() and truncate() will throw exceptions if you try to increase the size of a file or start writing at a point past the end of the file).
This messes up tools which rely on it this behavior that we need to support. We should implement it.
- Subject changed from S3 multipart download truncated writing to keep to [SDK] Support writes to offsets beyond end of file
- Status changed from New to In Progress
- Assigned To set to Peter Amstutz
- Target version set to 2017-04-26 sprint
11510-sdk-extend-files @
f5638851ddb5a859a54a16fe963ee98591af17a9
- remove debugging print statements in arvfile.py → ArvadosFile → writeto()
- remove unused PADDING_BLOCK_LOCATOR in config.py (or perhaps move it into arvados_testutil and use it in the test cases)
- I don't have any particular reason to suspect this won't work, and it doesn't become any more necessary now than before, but... it seems like we should test truncating a file in such a way that one or more blocks become unreferenced.
- test_sparse_write3 might be even more edge-casey if it left out i=3, so
self.assertEqual(writer.read(), "000011112222\x00\x00\x00\x004444")
...?
- test seek to a negative offset
- test seek past end of file, read (returns ''), seek back to non-sparse portion using SEEK_CUR, check file size did not change
- test return value of seek() is new file position
Tom Clegg wrote:
11510-sdk-extend-files @ f5638851ddb5a859a54a16fe963ee98591af17a9
- remove debugging print statements in arvfile.py → ArvadosFile → writeto()
Fixed.
- remove unused PADDING_BLOCK_LOCATOR in config.py (or perhaps move it into arvados_testutil and use it in the test cases)
Moved it into a docstring.
- I don't have any particular reason to suspect this won't work, and it doesn't become any more necessary now than before, but... it seems like we should test truncating a file in such a way that one or more blocks become unreferenced.
Added test_truncate3() which does this.
- test_sparse_write3 might be even more edge-casey if it left out i=3, so
self.assertEqual(writer.read(), "000011112222\x00\x00\x00\x004444")
...?
Added test_sparse_write4() which does this.
- test seek to a negative offset
test_seek_min_zero() tests this.
- test seek past end of file, read (returns ''), seek back to non-sparse portion using SEEK_CUR, check file size did not change
- test return value of seek() is new file position
Also tested in test_truncate3().
Now @ 5180238a10bd15302a1c15b9a428f2fdeeabdf4e
Peter Amstutz wrote:
- test return value of seek() is new file position
Also tested in test_truncate3().
seek return value is still untested and seems to be incorrect. If I add a check:
Traceback (most recent call last):
File "/home/tom/src/arvados/sdk/python/tests/test_arvfile.py", line 169, in test_write_to_end
self.assertEqual(5, writer.seek(5, os.SEEK_SET))
AssertionError: 5 != None
https://docs.python.org/2/library/io.html#io.IOBase.seek
Rest LGTM, thanks
- Status changed from In Progress to Resolved
Also available in: Atom
PDF