Actions
Bug #4661
closed[SDKs] Python Keep client's retry/rescue should not make an OOM exception look like a Keep problem
Status:
Resolved
Priority:
Normal
Assigned To:
-
Category:
SDKs
Target version:
-
Story points:
-
Updated by Tom Clegg about 10 years ago
- Category set to SDKs
Example: 9tee4-8i9sb-n2nvt7slia8m0im
The real problem was that Python ran out of memory while trying to write to Keep (the same job failed several times with max_tasks=20 and succeeded with max_tasks=5) but the log makes it look like a Keep problem. If the "wanted 2 but wrote 1" message propagates the error that caused the second write to fail, this sort of problem should be much easier to diagnose.
2014-11-21_23:02:43 9tee4-8i9sb-n2nvt7slia8m0im 2705 13 stderr Traceback (most recent call last): 2014-11-21_23:02:43 9tee4-8i9sb-n2nvt7slia8m0im 2705 13 stderr File "/tmp/crunch-src/crunch_scripts/addRefMemEff.py", line 143, in <module> 2014-11-21_23:02:43 9tee4-8i9sb-n2nvt7slia8m0im 2705 13 stderr output_id = out.finish() 2014-11-21_23:02:43 9tee4-8i9sb-n2nvt7slia8m0im 2705 13 stderr File "/usr/local/lib/python2.7/dist-packages/arvados/collection.py", line 538, in finish 2014-11-21_23:02:43 9tee4-8i9sb-n2nvt7slia8m0im 2705 13 stderr return self._my_keep().put(self.manifest_text()) 2014-11-21_23:02:43 9tee4-8i9sb-n2nvt7slia8m0im 2705 13 stderr File "/usr/local/lib/python2.7/dist-packages/arvados/retry.py", line 157, in num_retries_setter 2014-11-21_23:02:43 9tee4-8i9sb-n2nvt7slia8m0im 2705 13 stderr return orig_func(self, *args, **kwargs) 2014-11-21_23:02:43 9tee4-8i9sb-n2nvt7slia8m0im 2705 13 stderr File "/usr/local/lib/python2.7/dist-packages/arvados/keep.py", line 719, in put 2014-11-21_23:02:43 9tee4-8i9sb-n2nvt7slia8m0im 2705 13 stderr (data_hash, copies, thread_limiter.done())) 2014-11-21_23:02:43 9tee4-8i9sb-n2nvt7slia8m0im 2705 13 stderr arvados.errors.KeepWriteError: Write fail for 4ff205e7317925f2f92ee4a7c8bb8980: wanted 2 but wrote 1 2014-11-21_23:02:44 9tee4-8i9sb-n2nvt7slia8m0im 2705 13 stderr 2014/11/21 23:05:24 Error response from daemon: Cannot destroy container 4740a8f66ce2a16550e98b480b3bbb3da997ae3391c2c39d40e873190fa0c898: Driver aufs failed to remove root filesystem 4740a8f66ce2a16550e98b480b3bbb3da997ae3391c2c39d40e873190fa0c898: rename /tmp/docker/aufs/diff/4740a8f66ce2a16550e98b480b3bbb3da997ae3391c2c39d40e873190fa0c898 /tmp/docker/aufs/diff/4740a8f66ce2a16550e98b480b3bbb3da997ae3391c2c39d40e873190fa0c898-removing: device or resource busy 2014-11-21_23:02:45 9tee4-8i9sb-n2nvt7slia8m0im 2705 13 stderr srun: error: compute0: task 0: Exited with exit code 1
As an aside: keepstore logged 500, probably because it didn't receive the entire data block. Unfortunately it doesn't currently log an error message, just the HTTP status code and the number of bytes in the response. It would be nice to fix that too.
Updated by Brett Smith almost 10 years ago
- Status changed from New to Resolved
Applied in changeset arvados|commit:952bfa87465a27f83dca7feca7d369fda4200eb5.
Actions