Bug #7852
closed
[SDKs] arvados.arvfile.StreamFileReader.readlines() returns empty list after calling .readline()
Added by Joshua Randall over 9 years ago.
Updated about 9 years ago.
Description
If one calls `readline` on an `arvados.arvfile.StreamFileReader` then a subsequent call to `readlines` will return an empty list. My expectation was that the first sfr.readline() call would give me the first line and then the sfr.readlines() call would give me the rest of the lines, but this appears to not work.
FWIW, the python built-in docs for file object state that readlines() repeatedly calls readline(), so perhaps a fix is to reimplement it that way rather than via readall().
- Subject changed from problems with mixing readline / readlines in arvados.arvfile.StreamFileReader to [SDKs] arvados.arvfile.StreamFileReader.readlines() returns empty list after calling .readline()
- Target version set to Arvados Future Sprints
fc308658aeeeb007fa3eb710b4bdc9179a13db10
branch 7852-test-readlines
Added a test that fails when realine and readlines are invoked on the stream reader in that sequence. The readlines call is only fetching the data after the "first block" which was read during the first readline call. It appears that self._readline_cache and self._filepos are playing a role here.
Should we reset self._readline_cache and self._filepos when a readall or a readlines call is made? Looking at the python api documentation, it seems like the readall and readlines calls are expected to read the entire file contents.
https://docs.python.org/2/library/io.html
- Status changed from New to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|commit:0c7f9737786fd64177f8210585ec620402d0d9b6.
7852-readline-cache-fix-wip is up for review. It starts from Radhika's 7852-test-readlines branch, rebases that on top of current master, then adds a commit that fixes the bug and gets the tests passing.
- Status changed from Resolved to In Progress
- Assigned To set to Brett Smith
- Target version changed from Arvados Future Sprints to 2016-03-30 sprint
- Story points set to 0.5
Since this bug is about the interaction between readline() and readlines(), I'm looking at the branch now and readlines() doesn't actually use readline(). Worse, it looks like readlines() sucks in the entire file into a buffer before using splitlines() to return a list. I suggest refactoring readlines() to be an iterator over readline().
Peter Amstutz wrote:
Since this bug is about the interaction between readline() and readlines(), I'm looking at the branch now and readlines() doesn't actually use readline().
That's how the bug was reported, but it's worth noting that the bug will strike any time a caller interleaves calls to readline() and any other file read method. This is demonstrated by Radhika's readall test, and noted in the commit message.
Worse, it looks like readlines() sucks in the entire file into a buffer before using splitlines() to return a list. I suggest refactoring readlines() to be an iterator over readline().
readlines() is a standard file method that returns a list, so changing that to an iterator would break the API. Obviously calling it in any non-trivial case is a Bad Idea, but that's for the caller to decide, not us.
Given that API, sucking in whole blocks actually improves the performance of the method, because it avoids the overhead of calling readline() many times when the Keep API currently requires us to fetch whole blocks anyway.
- Status changed from In Progress to Resolved
Applied in changeset arvados|commit:e66f7687796f4ed3513fbc5a469d578ce76334e0.
Also available in: Atom
PDF