Bug #5515

Job failure due to 'arv-put' 'ConnectionError'?

Added by Abram Connelly over 4 years ago. Updated 5 months ago.

Status:
Closed
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
-
Start date:
03/19/2015
Due date:
% Done:

0%

Estimated time:
Story points:
-

Description

pipeline instance tb05z-d1hrv-ojcxrzlohzyir4r fails from what looks like a 'ConnectionError' from 'arv-put':

2015-03-19_14:57:15 tb05z-8i9sb-vz3vjgv5l05c7w9 31965 25 stderr Traceback (most recent call last):
2015-03-19_14:57:15 tb05z-8i9sb-vz3vjgv5l05c7w9 31965 25 stderr   File "/tmp/crunch-job-work/.arvados.venv/bin/arv-put", line 4, in <module>
2015-03-19_14:57:15 tb05z-8i9sb-vz3vjgv5l05c7w9 31965 25 stderr     main()
2015-03-19_14:57:15 tb05z-8i9sb-vz3vjgv5l05c7w9 31965 25 stderr   File "/tmp/crunch-job-work/.arvados.venv/local/lib/python2.7/site-packages/arvados/commands/put.py", line 470, i
n main
2015-03-19_14:57:15 tb05z-8i9sb-vz3vjgv5l05c7w9 31965 25 stderr     path, max_manifest_depth=args.max_manifest_depth)
2015-03-19_14:57:15 tb05z-8i9sb-vz3vjgv5l05c7w9 31965 25 stderr   File "/tmp/crunch-job-work/.arvados.venv/local/lib/python2.7/site-packages/arvados/commands/put.py", line 329, i
n write_directory_tree
2015-03-19_14:57:15 tb05z-8i9sb-vz3vjgv5l05c7w9 31965 25 stderr     path, stream_name, max_manifest_depth)
2015-03-19_14:57:15 tb05z-8i9sb-vz3vjgv5l05c7w9 31965 25 stderr   File "/tmp/crunch-job-work/.arvados.venv/local/lib/python2.7/site-packages/arvados/collection.py", line 216, in 
write_directory_tree
2015-03-19_14:57:15 tb05z-8i9sb-vz3vjgv5l05c7w9 31965 25 stderr     self.do_queued_work()
2015-03-19_14:57:15 tb05z-8i9sb-vz3vjgv5l05c7w9 31965 25 stderr   File "/tmp/crunch-job-work/.arvados.venv/local/lib/python2.7/site-packages/arvados/collection.py", line 144, in 
do_queued_work
2015-03-19_14:57:15 tb05z-8i9sb-vz3vjgv5l05c7w9 31965 25 stderr     self._work_file()
2015-03-19_14:57:15 tb05z-8i9sb-vz3vjgv5l05c7w9 31965 25 stderr   File "/tmp/crunch-job-work/.arvados.venv/local/lib/python2.7/site-packages/arvados/collection.py", line 157, in 
_work_file
2015-03-19_14:57:15 tb05z-8i9sb-vz3vjgv5l05c7w9 31965 25 stderr     self.write(buf)
2015-03-19_14:57:15 tb05z-8i9sb-vz3vjgv5l05c7w9 31965 25 stderr   File "/tmp/crunch-job-work/.arvados.venv/local/lib/python2.7/site-packages/arvados/collection.py", line 471, in 
write
2015-03-19_14:57:15 tb05z-8i9sb-vz3vjgv5l05c7w9 31965 25 stderr     return super(ResumableCollectionWriter, self).write(data)
2015-03-19_14:57:15 tb05z-8i9sb-vz3vjgv5l05c7w9 31965 25 stderr   File "/tmp/crunch-job-work/.arvados.venv/local/lib/python2.7/site-packages/arvados/collection.py", line 227, in 
write
2015-03-19_14:57:15 tb05z-8i9sb-vz3vjgv5l05c7w9 31965 25 stderr     self.flush_data()
2015-03-19_14:57:15 tb05z-8i9sb-vz3vjgv5l05c7w9 31965 25 stderr   File "/tmp/crunch-job-work/.arvados.venv/local/lib/python2.7/site-packages/arvados/commands/put.py", line 305, i
n flush_data
2015-03-19_14:57:15 tb05z-8i9sb-vz3vjgv5l05c7w9 31965 25 stderr     super(ArvPutCollectionWriter, self).flush_data()
2015-03-19_14:57:15 tb05z-8i9sb-vz3vjgv5l05c7w9 31965 25 stderr   File "/tmp/crunch-job-work/.arvados.venv/local/lib/python2.7/site-packages/arvados/collection.py", line 264, in 
flush_data
2015-03-19_14:57:15 tb05z-8i9sb-vz3vjgv5l05c7w9 31965 25 stderr     copies=self.replication))
2015-03-19_14:57:15 tb05z-8i9sb-vz3vjgv5l05c7w9 31965 25 stderr   File "/tmp/crunch-job-work/.arvados.venv/local/lib/python2.7/site-packages/arvados/retry.py", line 157, in num_r
etries_setter
2015-03-19_14:57:15 tb05z-8i9sb-vz3vjgv5l05c7w9 31965 25 stderr     return orig_func(self, *args, **kwargs)
2015-03-19_14:57:15 tb05z-8i9sb-vz3vjgv5l05c7w9 31965 25 stderr   File "/tmp/crunch-job-work/.arvados.venv/local/lib/python2.7/site-packages/arvados/keep.py", line 808, in put
2015-03-19_14:57:15 tb05z-8i9sb-vz3vjgv5l05c7w9 31965 25 stderr     data_hash, copies, thread_limiter.done()), service_errors, label="service")
2015-03-19_14:57:15 tb05z-8i9sb-vz3vjgv5l05c7w9 31965 25 stderr arvados.errors.KeepWriteError: failed to write d9fcbec13e21983498de7e8a489d89c1 (wanted 2 copies but wrote 1): ser
vice http://[keep1.tb05z.arvadosapi.com]:25107/ raised ConnectionError (('Connection aborted.', timeout('timed out',)))
2015-03-19_14:57:15 tb05z-8i9sb-vz3vjgv5l05c7w9 31965 25 stderr Traceback (most recent call last):
2015-03-19_14:57:15 tb05z-8i9sb-vz3vjgv5l05c7w9 31965 25 stderr   File "/tmp/crunch-job/src/crunch_scripts/arv-dax", line 154, in <module>
2015-03-19_14:57:15 tb05z-8i9sb-vz3vjgv5l05c7w9 31965 25 stderr     outcollection = upload( outdir )
2015-03-19_14:57:15 tb05z-8i9sb-vz3vjgv5l05c7w9 31965 25 stderr   File "/tmp/crunch-job/src/crunch_scripts/arv-dax", line 27, in upload
2015-03-19_14:57:15 tb05z-8i9sb-vz3vjgv5l05c7w9 31965 25 stderr     pdh = sp.check_output( ["arv-put", "--no-progress", "--portable-data-hash", source_dir ] )
2015-03-19_14:57:15 tb05z-8i9sb-vz3vjgv5l05c7w9 31965 25 stderr   File "/usr/lib/python2.7/subprocess.py", line 544, in check_output
2015-03-19_14:57:15 tb05z-8i9sb-vz3vjgv5l05c7w9 31965 25 stderr     raise CalledProcessError(retcode, cmd, output=output)
2015-03-19_14:57:15 tb05z-8i9sb-vz3vjgv5l05c7w9 31965 25 stderr subprocess.CalledProcessError: Command '['arv-put', '--no-progress', '--portable-data-hash', '/tmp/crunch-job-task
-work/compute0.13/output']' returned non-zero exit status 1


Related issues

Related to Arvados - Story #5468: [SDKs] Refactor arv-get/put/copy into the Python "arv" wrapper using common exception-handling and argument parsingNew

Blocked by Arvados - Bug #5524: [Crunch] Add "arvados.errors.Keep" to list of magic strings in crunch-job that signify transient failureResolved

History

#1 Updated by Tom Clegg over 4 years ago

This would have been interpreted as a temporary failure by crunch-job if:
  1. arv-put caught KeepWriteError and exited 111 (see #5468), and
  2. arv-dax caught CalledProcessError and exited with the same exit code as the called process, and
  3. the shell script calling arv-dax exited $? after an error
Or:
  • crunch-job looked for some magic strings like "arvados.errors.Keep..." and treated them like the existing magic "srun: ..." error strings, counting a subsequent failure as transient. (#5524)

#2 Updated by Tom Clegg over 4 years ago

  • Status changed from New to Feedback

#3 Updated by Peter Amstutz over 4 years ago

  • Target version changed from Bug Triage to Arvados Future Sprints

#4 Updated by Tom Morris 5 months ago

  • Status changed from Feedback to Closed
  • Target version deleted (Arvados Future Sprints)

Closing as obsolete.

Also available in: Atom PDF