Bug #4661

[SDKs] Python Keep client's retry/rescue should not make an OOM exception look like a Keep problem

Added by Tom Clegg over 5 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
-
Category:
SDKs
Target version:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Story points:
-

Related issues

Is duplicate of Arvados - Bug #3835: [SDKs] Python and CLI tools should give more helpful error messages after a Keep failureResolved01/13/2015

History

#1 Updated by Tom Clegg over 5 years ago

  • Category set to SDKs

Example: 9tee4-8i9sb-n2nvt7slia8m0im

The real problem was that Python ran out of memory while trying to write to Keep (the same job failed several times with max_tasks=20 and succeeded with max_tasks=5) but the log makes it look like a Keep problem. If the "wanted 2 but wrote 1" message propagates the error that caused the second write to fail, this sort of problem should be much easier to diagnose.

2014-11-21_23:02:43 9tee4-8i9sb-n2nvt7slia8m0im 2705 13 stderr Traceback (most recent call last):
2014-11-21_23:02:43 9tee4-8i9sb-n2nvt7slia8m0im 2705 13 stderr   File "/tmp/crunch-src/crunch_scripts/addRefMemEff.py", line 143, in <module>
2014-11-21_23:02:43 9tee4-8i9sb-n2nvt7slia8m0im 2705 13 stderr     output_id = out.finish()
2014-11-21_23:02:43 9tee4-8i9sb-n2nvt7slia8m0im 2705 13 stderr   File "/usr/local/lib/python2.7/dist-packages/arvados/collection.py", line 538, in finish
2014-11-21_23:02:43 9tee4-8i9sb-n2nvt7slia8m0im 2705 13 stderr     return self._my_keep().put(self.manifest_text())
2014-11-21_23:02:43 9tee4-8i9sb-n2nvt7slia8m0im 2705 13 stderr   File "/usr/local/lib/python2.7/dist-packages/arvados/retry.py", line 157, in num_retries_setter
2014-11-21_23:02:43 9tee4-8i9sb-n2nvt7slia8m0im 2705 13 stderr     return orig_func(self, *args, **kwargs)
2014-11-21_23:02:43 9tee4-8i9sb-n2nvt7slia8m0im 2705 13 stderr   File "/usr/local/lib/python2.7/dist-packages/arvados/keep.py", line 719, in put
2014-11-21_23:02:43 9tee4-8i9sb-n2nvt7slia8m0im 2705 13 stderr     (data_hash, copies, thread_limiter.done()))
2014-11-21_23:02:43 9tee4-8i9sb-n2nvt7slia8m0im 2705 13 stderr arvados.errors.KeepWriteError: Write fail for 4ff205e7317925f2f92ee4a7c8bb8980: wanted 2 but wrote 1
2014-11-21_23:02:44 9tee4-8i9sb-n2nvt7slia8m0im 2705 13 stderr 2014/11/21 23:05:24 Error response from daemon: Cannot destroy container 4740a8f66ce2a16550e98b480b3bbb3da997ae3391c2c39d40e873190fa0c898: Driver aufs failed to remove root filesystem 4740a8f66ce2a16550e98b480b3bbb3da997ae3391c2c39d40e873190fa0c898: rename /tmp/docker/aufs/diff/4740a8f66ce2a16550e98b480b3bbb3da997ae3391c2c39d40e873190fa0c898 /tmp/docker/aufs/diff/4740a8f66ce2a16550e98b480b3bbb3da997ae3391c2c39d40e873190fa0c898-removing: device or resource busy
2014-11-21_23:02:45 9tee4-8i9sb-n2nvt7slia8m0im 2705 13 stderr srun: error: compute0: task 0: Exited with exit code 1

As an aside: keepstore logged 500, probably because it didn't receive the entire data block. Unfortunately it doesn't currently log an error message, just the HTTP status code and the number of bytes in the response. It would be nice to fix that too.

#2 Updated by Tom Clegg over 5 years ago

  • Target version deleted (Bug Triage)

#3 Updated by Tom Clegg over 5 years ago

  • Story points deleted (0.5)

#4 Updated by Brett Smith over 5 years ago

  • Status changed from New to Resolved

Applied in changeset arvados|commit:952bfa87465a27f83dca7feca7d369fda4200eb5.

Also available in: Atom PDF