Bug #7852

[SDKs] arvados.arvfile.StreamFileReader.readlines() returns empty list after calling .readline()

Added by Joshua Randall over 3 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Brett Smith
Category:
SDKs
Target version:
Start date:
03/17/2016
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
0.5

Description

If one calls `readline` on an `arvados.arvfile.StreamFileReader` then a subsequent call to `readlines` will return an empty list. My expectation was that the first sfr.readline() call would give me the first line and then the sfr.readlines() call would give me the rest of the lines, but this appears to not work.


Subtasks

Task #8755: Review 7852-readline-cache-fix-wipResolvedPeter Amstutz

Associated revisions

Revision 0c7f9737 (diff)
Added by Brett Smith over 3 years ago

7156/7852: Bugfix _filepos tracking in PySDK readline method.

Make sure self._filepos is always in the right position to return
uncached data before another read method could be called, to avoid
inconsistent results when interleaving those calls with readline
calls.

Closes #7156, #7852.

Revision e66f7687
Added by Brett Smith over 3 years ago

Merge branch '7852-readline-cache-fix-wip'

Closes #7852, #8755.

History

#1 Updated by Joshua Randall over 3 years ago

FWIW, the python built-in docs for file object state that readlines() repeatedly calls readline(), so perhaps a fix is to reimplement it that way rather than via readall().

#2 Updated by Brett Smith over 3 years ago

  • Subject changed from problems with mixing readline / readlines in arvados.arvfile.StreamFileReader to [SDKs] arvados.arvfile.StreamFileReader.readlines() returns empty list after calling .readline()

#3 Updated by Brett Smith over 3 years ago

  • Target version set to Arvados Future Sprints

#4 Updated by Radhika Chippada over 3 years ago

fc308658aeeeb007fa3eb710b4bdc9179a13db10

branch 7852-test-readlines

Added a test that fails when realine and readlines are invoked on the stream reader in that sequence. The readlines call is only fetching the data after the "first block" which was read during the first readline call. It appears that self._readline_cache and self._filepos are playing a role here.

#5 Updated by Radhika Chippada over 3 years ago

d1c1f71d3d2e4170c51855d23557d58322627fd8

Added one more test "test_readline_then_readall" which is failing exactly like the previous readline_the_readlines test, for apparently the same reason.

#6 Updated by Radhika Chippada over 3 years ago

Should we reset self._readline_cache and self._filepos when a readall or a readlines call is made? Looking at the python api documentation, it seems like the readall and readlines calls are expected to read the entire file contents.

https://docs.python.org/2/library/io.html

#7 Updated by Brett Smith over 3 years ago

  • Status changed from New to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:0c7f9737786fd64177f8210585ec620402d0d9b6.

#8 Updated by Brett Smith over 3 years ago

7852-readline-cache-fix-wip is up for review. It starts from Radhika's 7852-test-readlines branch, rebases that on top of current master, then adds a commit that fixes the bug and gets the tests passing.

#9 Updated by Brett Smith over 3 years ago

  • Status changed from Resolved to In Progress
  • Assigned To set to Brett Smith
  • Target version changed from Arvados Future Sprints to 2016-03-30 sprint
  • Story points set to 0.5

#10 Updated by Peter Amstutz over 3 years ago

Since this bug is about the interaction between readline() and readlines(), I'm looking at the branch now and readlines() doesn't actually use readline(). Worse, it looks like readlines() sucks in the entire file into a buffer before using splitlines() to return a list. I suggest refactoring readlines() to be an iterator over readline().

#11 Updated by Brett Smith over 3 years ago

Peter Amstutz wrote:

Since this bug is about the interaction between readline() and readlines(), I'm looking at the branch now and readlines() doesn't actually use readline().

That's how the bug was reported, but it's worth noting that the bug will strike any time a caller interleaves calls to readline() and any other file read method. This is demonstrated by Radhika's readall test, and noted in the commit message.

Worse, it looks like readlines() sucks in the entire file into a buffer before using splitlines() to return a list. I suggest refactoring readlines() to be an iterator over readline().

readlines() is a standard file method that returns a list, so changing that to an iterator would break the API. Obviously calling it in any non-trivial case is a Bad Idea, but that's for the caller to decide, not us.

Given that API, sucking in whole blocks actually improves the performance of the method, because it avoids the overhead of calling readline() many times when the Keep API currently requires us to fetch whole blocks anyway.

#12 Updated by Brett Smith over 3 years ago

  • Status changed from In Progress to Resolved

Applied in changeset arvados|commit:e66f7687796f4ed3513fbc5a469d578ce76334e0.

Also available in: Atom PDF