Actions
Bug #13113
closedarvados-cwl-runner error while getting output object
Status:
Closed
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Story points:
-
Release:
Release relationship:
Auto
Description
We are occasionally (when the API server is under heavy load) getting errors such as this from arvados-cwl-runner (a-c-r):
2018-02-21T11:49:31.321102281Z arvados.cwl-runner INFO: [container haplotype_caller] reused container ncucu-dz642-f0nyoalc70zb00d 2018-02-21T11:49:37.106085474Z arvados.cwl-runner ERROR: [container haplotype_caller] while getting output object: <HttpError 502 when requesting https://arvados-api-ncucu.hgi.sanger.ac.uk/arvados/v1/collections/b32ff390527c3570482e73bbbdf8fe3f%2B307?alt=json returned "Bad Gateway"> 2018-02-21T11:49:37.106085474Z Traceback (most recent call last): 2018-02-21T11:49:37.106085474Z File "/usr/lib/python2.7/dist-packages/arvados_cwl/arvcontainer.py", line 281, in done 2018-02-21T11:49:37.106085474Z outputs = done.done_outputs(self, container, "/tmp", self.outdir, "/keep") 2018-02-21T11:49:37.106085474Z File "/usr/lib/python2.7/dist-packages/arvados_cwl/done.py", line 53, in done_outputs 2018-02-21T11:49:37.106085474Z return self.collect_outputs("keep:" + record["output"]) 2018-02-21T11:49:37.106085474Z File "/usr/lib/python2.7/dist-packages/cwltool/draft2tool.py", line 519, in collect_output_ports 2018-02-21T11:49:37.106085474Z if fs_access.exists(custom_output): 2018-02-21T11:49:37.106085474Z File "/usr/lib/python2.7/dist-packages/arvados_cwl/fsaccess.py", line 123, in exists 2018-02-21T11:49:37.106085474Z collection, rest = self.get_collection(fn) 2018-02-21T11:49:37.106085474Z File "/usr/lib/python2.7/dist-packages/arvados_cwl/fsaccess.py", line 82, in get_collection 2018-02-21T11:49:37.106085474Z return (self.collection_cache.get(pdh), sp[1] if len(sp) == 2 else None) 2018-02-21T11:49:37.106085474Z File "/usr/lib/python2.7/dist-packages/arvados_cwl/fsaccess.py", line 57, in get 2018-02-21T11:49:37.106085474Z keep_client=self.keep_client) 2018-02-21T11:49:37.106085474Z File "/usr/lib/python2.7/dist-packages/arvados/collection.py", line 1662, in __init__ 2018-02-21T11:49:37.106085474Z super(CollectionReader, self).__init__(manifest_locator_or_text, *args, **kwargs) 2018-02-21T11:49:37.106085474Z File "/usr/lib/python2.7/dist-packages/arvados/collection.py", line 1259, in __init__ 2018-02-21T11:49:37.106085474Z self._populate() 2018-02-21T11:49:37.106085474Z File "/usr/lib/python2.7/dist-packages/arvados/collection.py", line 1352, in _populate 2018-02-21T11:49:37.106085474Z self._populate_from_api_server() 2018-02-21T11:49:37.106085474Z File "/usr/lib/python2.7/dist-packages/arvados/collection.py", line 1339, in _populate_from_api_server 2018-02-21T11:49:37.106085474Z num_retries=self.num_retries)) 2018-02-21T11:49:37.106085474Z File "/usr/lib/python2.7/dist-packages/oauth2client/util.py", line 140, in positional_wrapper 2018-02-21T11:49:37.106085474Z return wrapped(*args, **kwargs) 2018-02-21T11:49:37.106085474Z File "/usr/lib/python2.7/dist-packages/googleapiclient/http.py", line 840, in execute 2018-02-21T11:49:37.106085474Z raise HttpError(resp, content, uri=self.uri) 2018-02-21T11:49:37.106085474Z ApiError: <HttpError 502 when requesting https://arvados-api-ncucu.hgi.sanger.ac.uk/arvados/v1/collections/b32ff390527c3570482e73bbbdf8fe3f%2B307?alt=json returned "Bad Gateway">
It seems that these failures are not being retried by a-c-r -- it looks like a-c-r is not passing `num_retries` when it creates a `CollectionReader` in fsaccess: https://github.com/wtsi-hgi/arvados/blob/master/sdk/cwl/arvados_cwl/fsaccess.py#L56-L57
I will submit a pull request to address this.
Actions