Bug #11097

Updated by Tom Clegg almost 5 years ago

h2. Background

Sometimes, running the same container twice on the same inputs can result
Disable this check in two successes with two different outputs. This can mean a number of things, including
* undetected failure in one or both cases, perhaps resulting in bogus output
* both outputs are correct, but have non-meaningful differences (like an "output produced at {timestamp}" comment in an output file)

The second case is common in practice.

Currently, the API server disables the container re-use logic entirely when it detects that two re-use candidates produced
<pre><code class="ruby">
if outputs.count.count != 1
Rails.logger.debug("Found #{outputs.count.length}
different outputs. This causes the following undesirable pattern: outputs")
# Run container "X" as part of a workflow w1
# Re-use container "X" automatically in subsequent workflows w2..w5, saving time
# Run workflow w4 with re-use disabled, e.g., to get runtime stats or verify reproducibility -- this runs container "X1" which is identical to "X" but produces different (but still correct) output
# Run workflow w5..w9 with re-use enabled
# Oops, even when re-running workflow w5, container "X" is not eligible for reuse ever again, because "X1" exists.

h2. Desired behavior

Use the oldest matching container whose output and log collections exist, aren't trashed, and are readable by the current user.

If we used the newest matching container, we would have the following problem:
# Run container X, producing out1
# Run workflows w1..w9 that reuse X and do a lot of downstream work on out1
# Re-run workflows w1..w9 &rarr; lots of reused containers
# Re-run container X1, producing out2
# Re-run workflows w1..w9 &rarr; arvados chooses X1 now, so all downstream work has to be redone

Using the oldest matching container fixes the problems given above, while admitting the converse problem:
# Run container "X"
# Notice that container "X" exited 0 but produced bogus output because of a bug in the container process or Arvados itself
# Run container again with re-use disabled: "X1" produces correct output
# Run a workflow that makes use of this container
# Oops, the workflow gets the bogus "X" output instead of the newer "X1" output

This is the lesser evil in that re-running the same container -- i.e., without fixing the underlying problem that allowed it to exit 0 with bogus output -- is not a viable solution anyway.

h2. Implementation

Disable this check in source:services/api/app/models/container.rb

<pre><code class="ruby">
if outputs.count.count != 1
Rails.logger.debug("Found #{outputs.count.length} different outputs")