Project

General

Profile

Actions

Bug #16132

closed

ce8i5 test timeouts

Added by Peter Amstutz about 4 years ago. Updated about 4 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Story points:
-

Description

The run-cwl-test-ce8i5 job runs the CWL test suite against the test cluster ce8i5.

Tests have been failing with timeouts. There is a 900s (15m) limit on tests, so that means it is taking longer than that to either bring up a node or schedule a container.

It is unclear if the issue is Azure delay, arvados-dispatch-cloud delay, some interaction between the two (waiting for something it shouldn't), stalled ssh connections, or what.

One way to begin diagnosing this would be to start collecting prometheus metrics, and see if we can capture the delay on a graph.

Actions #1

Updated by Peter Amstutz about 4 years ago

  • Status changed from New to In Progress
Actions #2

Updated by Peter Amstutz about 4 years ago

  • Status changed from In Progress to New
Actions #3

Updated by Peter Amstutz about 4 years ago

  • Description updated (diff)
Actions #4

Updated by Peter Amstutz about 4 years ago

  • Assigned To set to Peter Amstutz
Actions #5

Updated by Peter Amstutz about 4 years ago

ce8i5 has a really tiny quota - 10 cores. The containers that are timing out seem to be hitting the quota, that prevents starting a new compute node to run the container.

We need a way to communicate back to the user when the cluster limited by quota.

Actions #6

Updated by Peter Amstutz about 4 years ago

  • Status changed from New to Resolved
Actions

Also available in: Atom PDF