Bug #16132

ce8i5 test timeouts

Added by Peter Amstutz 8 months ago. Updated 7 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
Story points:
-

Description

The run-cwl-test-ce8i5 job runs the CWL test suite against the test cluster ce8i5.

Tests have been failing with timeouts. There is a 900s (15m) limit on tests, so that means it is taking longer than that to either bring up a node or schedule a container.

It is unclear if the issue is Azure delay, arvados-dispatch-cloud delay, some interaction between the two (waiting for something it shouldn't), stalled ssh connections, or what.

One way to begin diagnosing this would be to start collecting prometheus metrics, and see if we can capture the delay on a graph.

History

#1 Updated by Peter Amstutz 8 months ago

  • Status changed from New to In Progress

#2 Updated by Peter Amstutz 8 months ago

  • Status changed from In Progress to New

#3 Updated by Peter Amstutz 8 months ago

  • Description updated (diff)

#4 Updated by Peter Amstutz 8 months ago

  • Assigned To set to Peter Amstutz

#5 Updated by Peter Amstutz 8 months ago

ce8i5 has a really tiny quota - 10 cores. The containers that are timing out seem to be hitting the quota, that prevents starting a new compute node to run the container.

We need a way to communicate back to the user when the cluster limited by quota.

#6 Updated by Peter Amstutz 7 months ago

  • Status changed from New to Resolved

Also available in: Atom PDF