Bug #5856

gzip.open(f) sometimes throws struct.error (unpack requires a string argument of length 4) on File handle from CollectionReader

Added by Sarah Guthrie over 6 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
04/29/2015
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-

Description

One of the tasks from https://workbench.su92l.arvadosapi.com/pipeline_instances/su92l-d1hrv-pfu68urmcrq00mo failed with this error
Three tasks from https://workbench.su92l.arvadosapi.com/pipeline_instances/su92l-d1hrv-8sgdxqj8dq8tqxa failed with this error as well

2015-04-29_17:40:17 su92l-8i9sb-bb7pxsb146b8suq 720 182 stderr Traceback (most recent call last):
2015-04-29_17:40:17 su92l-8i9sb-bb7pxsb146b8suq 720 182 stderr File "/tmp/crunch-job/src/crunch_scripts/add_callsets.py", line 234, in <module>
2015-04-29_17:40:17 su92l-8i9sb-bb7pxsb146b8suq 720 182 stderr for line in callset_fastj_file_handle:
2015-04-29_17:40:17 su92l-8i9sb-bb7pxsb146b8suq 720 182 stderr File "/usr/lib/python2.7/gzip.py", line 455, in readline
2015-04-29_17:40:17 su92l-8i9sb-bb7pxsb146b8suq 720 182 stderr c = self.read(readsize)
2015-04-29_17:40:17 su92l-8i9sb-bb7pxsb146b8suq 720 182 stderr File "/usr/lib/python2.7/gzip.py", line 261, in read
2015-04-29_17:40:17 su92l-8i9sb-bb7pxsb146b8suq 720 182 stderr self._read(readsize)
2015-04-29_17:40:17 su92l-8i9sb-bb7pxsb146b8suq 720 182 stderr File "/usr/lib/python2.7/gzip.py", line 325, in _read
2015-04-29_17:40:17 su92l-8i9sb-bb7pxsb146b8suq 720 182 stderr self._read_eof()
2015-04-29_17:40:17 su92l-8i9sb-bb7pxsb146b8suq 720 182 stderr File "/usr/lib/python2.7/gzip.py", line 344, in _read_eof
2015-04-29_17:40:17 su92l-8i9sb-bb7pxsb146b8suq 720 182 stderr isize = read32(self.fileobj) # may exceed 2GB
2015-04-29_17:40:17 su92l-8i9sb-bb7pxsb146b8suq 720 182 stderr File "/usr/lib/python2.7/gzip.py", line 25, in read32
2015-04-29_17:40:17 su92l-8i9sb-bb7pxsb146b8suq 720 182 stderr return struct.unpack("<I", input.read(4))[0]
2015-04-29_17:40:17 su92l-8i9sb-bb7pxsb146b8suq 720 182 stderr struct.error: unpack requires a string argument of length 4

Log file: https://workbench.su92l.arvadosapi.com/collections/bb3292a7fb7b482be8dbfb75690f804f+133/su92l-8i9sb-bb7pxsb146b8suq.log.txt?disposition=inline&size=79308197

zcat-ing the file that broke the task from the first pipeline (using keepmount) did not produce any errors


Subtasks

Task #5864: Review 5856-read-exactResolvedPeter Amstutz

Associated revisions

Revision c807aef7 (diff)
Added by Peter Amstutz over 6 years ago

ArvadosFileReader object always tries to return the exact amount of data asked for.

This is to avoid breaking stuff like gzip that assumes it always gets the exact
amount of data unless EOF. (From the Python file docs for read(): Note that
this method may call the underlying C function fread() more than once in an
effort to acquire as close to size bytes as possible.) closes #5856

Revision 4570521e (diff)
Added by Peter Amstutz over 6 years ago

ArvadosFileReader object always tries to return the exact amount of data asked for.

This is to avoid breaking stuff like gzip that assumes it always gets the exact
amount of data unless EOF. (From the Python file docs for read(): Note that
this method may call the underlying C function fread() more than once in an
effort to acquire as close to size bytes as possible.) closes #5856

Revision dffb8b5e
Added by Peter Amstutz over 6 years ago

Merge branch '5856-read-exact' closes #5856

History

#1 Updated by Peter Amstutz over 6 years ago

  • Status changed from New to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:c807aef7059b253512ba5da5d535e89e800b9c11.

#2 Updated by Peter Amstutz over 6 years ago

  • Status changed from Resolved to In Progress

#3 Updated by Peter Amstutz over 6 years ago

  • Target version changed from Bug Triage to 2015-05-20 sprint

#4 Updated by Peter Amstutz over 6 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:4570521e9d639806b1f25c93822d5cf0dbfe42d0.

#5 Updated by Peter Amstutz over 6 years ago

  • Status changed from Resolved to In Progress
  • Assigned To set to Peter Amstutz

#6 Updated by Brett Smith over 6 years ago

Reviewing 4570521

I'm not sure it makes much sense for the tests to manipulate sfile.arvadosfile directly. While it's not marked as private, it's not really part of the interface users are expected to use. What do you think of this:

  • Rename test_read_returns_first_block to test_read_block_crossing_behavior in both test case classes. In ArvadosFileReaderTestCase, this would have the code you added for test_read_crosses_blocks. This way the name is sensible in both places and we don't have to try to erase one in the child.
  • test_successive_reads is meant to ensure that state is maintained correctly across calls, so calling the stateless readfrom misses the mark. What if it called sfile.read(4) multiple times and checked for correct results until EOF? That would maintain the spirit of the test, and double-check that block crossing happens correctly.

In the docstring, "upto" should be two words.

Thanks.

#7 Updated by Peter Amstutz over 6 years ago

  • Status changed from In Progress to Resolved
  • Assigned To deleted (Peter Amstutz)

Brett Smith wrote:

Reviewing 4570521

I'm not sure it makes much sense for the tests to manipulate sfile.arvadosfile directly. While it's not marked as private, it's not really part of the interface users are expected to use. What do you think of this:

  • Rename test_read_returns_first_block to test_read_block_crossing_behavior in both test case classes. In ArvadosFileReaderTestCase, this would have the code you added for test_read_crosses_blocks. This way the name is sensible in both places and we don't have to try to erase one in the child.
  • test_successive_reads is meant to ensure that state is maintained correctly across calls, so calling the stateless readfrom misses the mark. What if it called sfile.read(4) multiple times and checked for correct results until EOF? That would maintain the spirit of the test, and double-check that block crossing happens correctly.

Done.

In the docstring, "upto" should be two words.

Fixed.

#8 Updated by Brett Smith over 6 years ago

  • Assigned To set to Peter Amstutz

c42ccd1f is good to merge. Thanks.

Also available in: Atom PDF