Project

General

Profile

Actions

Bug #8229

closed

Node failure on compute0

Added by Bryan Cosca about 8 years ago. Updated almost 7 years ago.

Status:
Closed
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
-
Story points:
-

Description

https://workbench.wx7k5.arvadosapi.com/pipeline_instances/wx7k5-d1hrv-lxz185njaw51kab#Log

2016-01-19_18:22:56 salloc: error: Node failure on compute0
2016-01-19_18:22:56 salloc: Job allocation 5212 has been revoked.
2016-01-19_18:22:56 wx7k5-8i9sb-aokuuio6bbjjk82 24364 1 stderr srun: error: Node failure on compute0
2016-01-19_18:22:56 wx7k5-8i9sb-aokuuio6bbjjk82 24364  backing off node compute0 for 60 seconds
2016-01-19_18:22:56 wx7k5-8i9sb-aokuuio6bbjjk82 24364 1 stderr srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
2016-01-19_18:22:56 wx7k5-8i9sb-aokuuio6bbjjk82 24364 1 child 26911 on compute0.1 exit 0 success=
2016-01-19_18:22:57 wx7k5-8i9sb-aokuuio6bbjjk82 24364 1 ERROR: Task process exited 0, but never updated its task record to indicate success and record its output.
2016-01-19_18:22:57 wx7k5-8i9sb-aokuuio6bbjjk82 24364 1 failure (#1, temporary) after 88627 seconds
2016-01-19_18:22:57 wx7k5-8i9sb-aokuuio6bbjjk82 24364 1 task output (0 bytes):
2016-01-19_18:22:57 wx7k5-8i9sb-aokuuio6bbjjk82 24364  status: 1 done, 0 running, 1 todo
2016-01-19_18:22:57 wx7k5-8i9sb-aokuuio6bbjjk82 24364  Every node has failed -- giving up
2016-01-19_18:22:57 wx7k5-8i9sb-aokuuio6bbjjk82 24364  wait for last 0 children to finish
2016-01-19_18:22:57 wx7k5-8i9sb-aokuuio6bbjjk82 24364  collate
2016-01-19_18:22:57 wx7k5-8i9sb-aokuuio6bbjjk82 24364  collated output manifest text to send to API server is 0 bytes with access tokens
2016-01-19_18:22:57 wx7k5-8i9sb-aokuuio6bbjjk82 24364  job output d41d8cd98f00b204e9800998ecf8427e+0
2016-01-19_18:23:27 Traceback (most recent call last):
2016-01-19_18:23:27 File "/usr/local/bin/arv-put", line 4, in <module>
2016-01-19_18:23:27 main()
2016-01-19_18:23:27 File "/usr/local/lib/python2.7/dist-packages/arvados/commands/put.py", line 488, in main
2016-01-19_18:23:27 writer.finish_current_stream()
2016-01-19_18:23:27 File "/usr/local/lib/python2.7/dist-packages/arvados/collection.py", line 318, in finish_current_stream
2016-01-19_18:23:27 self.flush_data()
2016-01-19_18:23:27 File "/usr/local/lib/python2.7/dist-packages/arvados/commands/put.py", line 310, in flush_data
2016-01-19_18:23:27 super(ArvPutCollectionWriter, self).flush_data()
2016-01-19_18:23:27 File "/usr/local/lib/python2.7/dist-packages/arvados/collection.py", line 264, in flush_data
2016-01-19_18:23:27 copies=self.replication))
2016-01-19_18:23:27 File "/usr/local/lib/python2.7/dist-packages/arvados/retry.py", line 153, in num_retries_setter
2016-01-19_18:23:27 return orig_func(self, *args, **kwargs)
2016-01-19_18:23:27 File "/usr/local/lib/python2.7/dist-packages/arvados/keep.py", line 1063, in put
2016-01-19_18:23:27 data_hash, copies, thread_limiter.done()), service_errors, label="service")
2016-01-19_18:23:27 arvados.errors.KeepWriteError: failed to write 6afd3a1524d5b92acb906b21ec5aa7da (wanted 2 copies but wrote 0): service http://keep2.wx7k5.arvadosapi.com:25107/ responded with 503 HTTP/1.1 100 Continue
2016-01-19_18:23:27 HTTP/1.1 503 Service Unavailable; service http://keep1.wx7k5.arvadosapi.com:25107/ responded with 503 HTTP/1.1 100 Continue
2016-01-19_18:23:27 HTTP/1.1 503 Service Unavailable; service http://keep4.wx7k5.arvadosapi.com:25107/ responded with 503 HTTP/1.1 100 Continue
2016-01-19_18:23:27 HTTP/1.1 503 Service Unavailable; service http://keep5.wx7k5.arvadosapi.com:25107/ responded with 0 (28, 'Connection timed out after 2001 milliseconds')
2016-01-19_18:23:27 wx7k5-8i9sb-aokuuio6bbjjk82 24364  log_writer_finish: arv-put exited 1
Actions #1

Updated by Peter Amstutz almost 7 years ago

  • Status changed from New to Closed
Actions

Also available in: Atom PDF