Story New In Progress Resolved Feedback
Sprint Impediments
9433
[OPS] use official repos for docker. Stop packaging docker.io
Nico César
288
3
impediments
-c-a
1
6429
[API] [Crunch2] Implement "containers" and "container requests" tables, models and controllers
Peter Amstutz
47
3
impediments
-c-a
6
10690
"dump config yaml" for Go programs and Rails projects
Tom Clegg
3
3
impediments
-c-a
1
11017
[API Server] Implement Docker version compatibility fallback support
Tom Clegg
3
3
impediments
-c-a
6
9632
[OPS] Upgrade docker to 1.9.1 in all clusters
Nico César
288
3
impediments
-c-a
1
Subject: Reduce amount of parallelism in crunchstat-summary
Tracker ID: Bug
Status: In Progress
Category:
Points: 0.5
Estimation (hours):
Spent Time: 0.0
Remaining (hours):
Assignee: Tom Morris
Project: Arvados
Release:

Currently crunchstat-summary processes all components of a pipeline in parallel. This can mean hundreds of threads all competing for memory and cycles at the same time, leading to memory exhaustion in extreme cases.

We should dial this back to a reasonable number of threads for the machine and workload being processed.

10359 Tom Morris (0 hours)
Reduce amount of parallelism in crunchstat-summary
0.5
10379
Review 10359-crunchstat-summary-serial
Tom Morris
388
10359
2
36
-c-a
5
Subject: crunchstat-summary should work on arvados-cwl-runner --submit pipelines
Tracker ID: Bug
Status: In Progress
Category:
Points: 0.5
Estimation (hours):
Spent Time: 0.0
Remaining (hours):
Assignee: Tom Morris
Project: Arvados
Release:

Background/example

For example, if you run crunchstat-summary on e51c5-d1hrv-xkf8r1ycnuxn4gc,

the .txt is only the cwl-runner job, it needs to be all the jobs in the pipeline under the cwl-runner job.

### Summary for cwl-runner (e51c5-8i9sb-zvoopbdyejvmcpg)
category    metric    task_max    task_max_rate    job_total
blkio:0:0    read    29639    2963.90    29639
blkio:0:0    write    0    0    0
cpu    cpus    1    -    -
cpu    sys    5.83    0.02    5.83
cpu    user    371.69    0.66    371.69
cpu    user+sys    377.52    0.68    377.52
fuseops    read    37    3.70    37
fuseops    write    0    0    0
keepcache    hit    35    3.50    35
keepcache    miss    1    0.10    1
keepcalls    get    36    3.60    36
keepcalls    put    0    0    0
mem    cache    761856    -    -
mem    pgmajfault    0    -    0
mem    rss    1990295552    -    -
net:eth0    rx    84131105    645680.36    84131105
net:eth0    tx    3697604    12749.24    3697604
net:eth0    tx+rx    87828709    655656.96    87828709
net:keep0    rx    41628    4162.80    41628
net:keep0    tx    0    0    0
net:keep0    tx+rx    41628    4162.80    41628
time    elapsed    6939    -    6939
# Number of tasks: 1
# Max CPU time spent by a single task: 377.52s
# Max CPU usage in a single interval: 67.90%
# Overall CPU usage: 5.44%
# Max memory used by a single task: 1.99GB
# Max network traffic in a single task: 0.09GB
# Max network speed in a single interval: 0.66MB/s
# Keep cache miss rate 2.78%
# Keep cache utilization 71.20%
#!! cwl-runner e51c5-8i9sb-zvoopbdyejvmcpg max CPU usage was 68% -- try runtime_constraints "min_cores_per_node":1
#!! cwl-runner e51c5-8i9sb-zvoopbdyejvmcpg max RSS was 1899 MiB -- try runtime_constraints "min_ram_mb_per_node":1945
#!! cwl-runner e51c5-8i9sb-zvoopbdyejvmcpg Keep cache utilization was 71.20% -- try runtime_constraints "keep_cache_mb_per_task":512 (or more)

Implementation

Currently, crunchstat-summary finds child jobs by looking at
  • the "components" field, when processing a pipeline instance
  • "Queued job {uuid}" text in the log messages, when processing a job

Since crunchstat-summary was written, we have added a "components" field to job records, and the CWL runner saves the child job UUIDs there instead of logging them as stderr text. Therefore, crunchstat-summary does not see them.

10472 Tom Morris (0 hours)
crunchstat-summary should work on arvados-cwl-runner --submit pipelines
0.5
10628
Review 10472-crunchstat-summary-job-components
Tom Morris
388
10472
2
36
-c-a
5
Subject: [arv-put] crash in arvfile on upload NoneType object has no attribute 'closed'
Tracker ID: Bug
Status: In Progress
Category:
Points:
Estimation (hours):
Spent Time: 0.0
Remaining (hours):
Assignee: Lucas Di Pentima
Project: Arvados
Release:

8995M / 387184M 2.3% Traceback (most recent call last):
File "/home/tfmorris/venv/bin/arv-put", line 4, in <module>
main()
File "/home/tfmorris/venv/local/lib/python2.7/site-packages/arvados/commands/put.py", line 906, in main
writer.start(save_collection=not(args.stream or args.raw))
File "/home/tfmorris/venv/local/lib/python2.7/site-packages/arvados/commands/put.py", line 454, in start
self._local_collection.manifest_text()
File "/home/tfmorris/venv/local/lib/python2.7/site-packages/arvados/arvfile.py", line 240, in synchronized_wrapper
return orig_func(self, *args, **kwargs)
File "/home/tfmorris/venv/local/lib/python2.7/site-packages/arvados/collection.py", line 934, in manifest_text
self._my_block_manager().commit_all()
File "/home/tfmorris/venv/local/lib/python2.7/site-packages/arvados/arvfile.py", line 681, in commit_all
self.repack_small_blocks(force=True, sync=True)
File "/home/tfmorris/venv/local/lib/python2.7/site-packages/arvados/arvfile.py", line 240, in synchronized_wrapper
return orig_func(self, *args, **kwargs)
File "/home/tfmorris/venv/local/lib/python2.7/site-packages/arvados/arvfile.py", line 574, in repack_small_blocks
small_blocks = [b for b in self._bufferblocks.values() if b.state() == _BufferBlock.WRITABLE and b.owner.closed()]
AttributeError: 'NoneType' object has no attribute 'closed'
Exception in thread Thread-3 (most likely raised during interpreter shutdown):
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
File "/usr/lib/python2.7/threading.py", line 763, in run
File "/home/tfmorris/venv/local/lib/python2.7/site-packages/arvados/arvfile.py", line 481, in _commit_bufferblock_worker
File "/home/tfmorris/venv/local/lib/python2.7/site-packages/arvados/retry.py", line 158, in num_retries_setter
File "/home/tfmorris/venv/local/lib/python2.7/site-packages/arvados/keep.py", line 1069, in put
<type 'exceptions.TypeError'>: 'NoneType' object is not callable

11002 Lucas Di Pentima (0 hours)
[arv-put] crash in arvfile on upload NoneType object has no attribute 'closed'
11041
Review
Peter Amstutz
47
11002
1
36
-c-a
5
Subject: [Node Manager] [Crunch2] Take queued containers into account when computing how many nodes should be up
Tracker ID: Feature
Status: In Progress
Category: Node Manager
Points: 0.5
Estimation (hours):
Spent Time: 0.0
Remaining (hours):
Assignee: Peter Amstutz
Project: Arvados
Release: Crunch v2

Add one node to the wishlist for each queued container, just like we currently add one (or more) nodes to the wishlist for queued jobs. While Crunch v2 will support running multiple containers per node, that's less critical in the cloud: as long as we can boot approximately the right size node, there's not too much overhead in just having one node per container. And it's something we can do relatively quickly with the current Node Manager code.

This won't be perfect from a scheduling perspective, especially in the interaction between Crunch v1 and Crunch v2. We expect that Crunch v2 jobs will generally "take priority" over Crunch v1 jobs, because SLURM will dispatch them from its own queue before crunch-dispatch has a chance to look and allocate nodes. We're OK with that limitation for the time being.

Node Manager should get the list of queued containers from SLURM itself, because that's the most direct source of truth about what is waiting to run. Node Manager can get information about the runtime constraints of each container either from SLURM, or from the Containers API.

Acceptance criteria:

  • Node Manager can generate a wishlist that is informed by containers in the SLURM queue. (Whether that's the existing wishlist or a new one is an implementation detail, not an acceptance criteria either way.)
  • The node sizes in that wishlist are the smallest able to meet the runtime constraints of the respective containers.
  • The Daemon actor considers these wishlist items when deciding whether or not to boot or shut down nodes, just as it does with the wishlist generated from the job queue today.

Implementation notes:

  • Node Manager will use sinfo to determine node status (alloc/idle/drained/down) instead of using the information from the node table. A crunch v2 installation won't store node state in the nodes table, other tools like Workbench will be modified accordingly.
6520 Peter Amstutz (0 hours)
[Node Manager] [Crunch2] Take queued containers into account when computing how many nodes should be up
0.5
11141
Review 6520-nodemanager-docs
6520
1
36
-c-a
5
11123
Add node manager to install guide
Peter Amstutz
47
6520
1
36
-c-a
5
11031
Review 6520-nodemanager-crunchv2
Peter Amstutz
47
6520
3
36
-c-a
5
11106
Review 6520-skip-compute0
Peter Amstutz
47
6520
3
36
-c-a
5
11061
crunch-dispatch-slurm running on cloud clusters
Nico César
288
6520
3
36
-c-a
5
Subject: [FUSE] [SDKs] When reading data through Collection et al., signatures should refresh automatically when needed
Tracker ID: Bug
Status: In Progress
Category: FUSE
Points: 0.5
Estimation (hours):
Spent Time: 0.0
Remaining (hours):
Assignee: Tom Clegg
Project: Arvados
Release:

Background

TokenExpiryTest in source:services/fuse/tests/test_mount.py is slow and unreliable because it relies on sleep().

Fix

Mock time.time() instead of sleeping to make time advance.

10008 Tom Clegg (0 hours)
[FUSE] [SDKs] When reading data through Collection et al., signatures should refresh automatically when needed
0.5
10330
Review 10008-check-token-exp-on-open
Tom Clegg
3
10008
3
36
-c-a
5
Subject: [API] Reuse containers even when multiple matching containers exist with differing outputs
Tracker ID: Bug
Status: In Progress
Category: API
Points: 0.5
Estimation (hours):
Spent Time: 0.0
Remaining (hours):
Assignee: Tom Clegg
Project: Arvados
Release:

Background

Sometimes, running the same container twice on the same inputs can result in two successes with two different outputs. This can mean a number of things, including
  • undetected failure in one or both cases, perhaps resulting in bogus output
  • both outputs are correct, but have non-meaningful differences (like an "output produced at {timestamp}" comment in an output file)

The second case is common in practice.

Currently, the API server disables the container re-use logic entirely when it detects that two re-use candidates produced different outputs. This causes the following undesirable pattern:
  1. Run container "X" as part of a workflow w1
  2. Re-use container "X" automatically in subsequent workflows w2..w5, saving time
  3. Run workflow w4 with re-use disabled, e.g., to get runtime stats or verify reproducibility -- this runs container "X1" which is identical to "X" but produces different (but still correct) output
  4. Run workflow w5..w9 with re-use enabled
  5. Oops, even when re-running workflow w5, container "X" is not eligible for reuse ever again, because "X1" exists.

Desired behavior

Use the oldest matching container whose output and log collections exist, aren't trashed, and are readable by the current user.

If we used the newest matching container, we would have the following problem:
  1. Run container X, producing out1
  2. Run workflows w1..w9 that reuse X and do a lot of downstream work on out1
  3. Re-run workflows w1..w9 → lots of reused containers
  4. Re-run container X1, producing out2
  5. Re-run workflows w1..w9 → arvados chooses X1 now, so all downstream work has to be redone
Using the oldest matching container fixes the problems given above, while admitting the converse problem:
  1. Run container "X"
  2. Notice that container "X" exited 0 but produced bogus output because of a bug in the container process or Arvados itself
  3. Run container again with re-use disabled: "X1" produces correct output
  4. Run a workflow that makes use of this container
  5. Oops, the workflow gets the bogus "X" output instead of the newer "X1" output

This is the lesser evil in that re-running the same container -- i.e., without fixing the underlying problem that allowed it to exit 0 with bogus output -- is not a viable solution anyway.

Implementation

Disable this check in source:services/api/app/models/container.rb

    if outputs.count.count != 1
      Rails.logger.debug("Found #{outputs.count.length} different outputs")
11097 Tom Clegg (0 hours)
[API] Reuse containers even when multiple matching containers exist with differing outputs
0.5
11111
Review 11097-reuse-impure
Radhika Chippada
72
11097
1
36
-c-a
5
11140
Update tests
Tom Clegg
3
11097
3
36
-c-a
5
Subject: [Documentation] Document Keep Balance setup in the Install Guide
Tracker ID: Story
Status: In Progress
Category: Documentation
Points: 1.0
Estimation (hours):
Spent Time: 0.0
Remaining (hours):
Assignee: Tom Clegg
Project: Arvados
Release:

It should be as complete as any other page in the install guide. The only caveats are:

  • It should come with a huge unmissable disclaimer at the top that Keep Balance is still being tested.
  • There are only two cases where we think it might be safe:
    • All your Keepstores are backed by their own POSIX filesystem(s)
    • All your Keepstores are backed by shared object storage, one of which has a special service_type, and Data Manager talks to that one alone through its corresponding service_type switch
  • It should not be linked from the TOC. Enough people want it that we want a single reference to give to interested deployers, but we don't want to generally advertise it.

Functional requirements:

  • Document how to do a dry run/log-only run first, then how to switch that to actually deleting blocks once you're satisfied with the result.

This is how the datamanager token is generated:

7995 Tom Clegg (0 hours)
[Documentation] Document Keep Balance setup in the Install Guide
1.0
11119
Review 7995-keep-balance-docs
Tom Morris
388
7995
1
36
-c-a
5
11132
Add page to install guide
Tom Clegg
3
7995
3
36
-c-a
5
11133
Set -enable-trash in keepstore docs
Tom Clegg
3
7995
3
36
-c-a
5
11134
Explain limitations re shared volumes and service_type
Tom Clegg
3
7995
3
36
-c-a
5
Subject: [DOC] add cookbook section with code snippets
Tracker ID: Bug
Status: New
Category:
Points: 0.5
Estimation (hours):
Spent Time: 0.0
Remaining (hours):
Assignee: Tom Morris
Project: Arvados
Release:

More examples for api calls with real use cases, e.g. users/links/groups, or collections/projects, i.e. let’s add a cookbook section with code snippets to doc.arvados.org

10349 Tom Morris (0 hours)
[DOC] add cookbook section with code snippets
0.5
10381
Review
Nico César
288
10349
1
36
-c-a
5
Subject: [Crunch2] crunch-run: stop the container and fail if arv-mount dies before the container finishes
Tracker ID: Bug
Status: New
Category: Crunch
Points:
Estimation (hours):
Spent Time: 0.0
Remaining (hours):
Assignee: Tom Clegg
Project: Arvados
Release: Crunch v2
10777 Tom Clegg (0 hours)
[Crunch2] crunch-run: stop the container and fail if arv-mount dies before the container finishes
11118
Review
Radhika Chippada
72
10777
1
36
-c-a
5
Subject: arv-mount pathologically slow enumerating files
Tracker ID: Bug
Status: New
Category: Performance
Points: 1.0
Estimation (hours):
Spent Time: 0.0
Remaining (hours):
Assignee: Peter Amstutz
Project: Arvados
Release:

It takes 38 minutes (!!) to enumerate 242K files. During this time the arv-mount process is pegged at 100% CPU and no network traffic is being done. The manifest text totals just 9MB (9243914 char), so it's taking over 4 min / MB to parse this.

$ time find keep/by_id/e51c5-4zz18-l3dq8bw20uwz0qd -print | wc -l
241751
real 38m0.969s
user 0m0.224s
sys 0m0.300s

$ wc *.manifest
2497 252893 9243914 e51c5-4zz18-l3dq8bw20uwz0qd.manifest

10629 Peter Amstutz (0 hours)
arv-mount pathologically slow enumerating files
1.0
11114
Review
Lucas Di Pentima
375
10629
1
36
-c-a
5
Subject: [arvados-ws] Write unit tests
Tracker ID: Feature
Status: New
Category: API
Points:
Estimation (hours):
Spent Time: 0.0
Remaining (hours):
Assignee: Tom Clegg
Project: Arvados
Release:
10764 Tom Clegg (0 hours)
[arvados-ws] Write unit tests
11112
unit tests
Tom Clegg
3
10764
1
36
-c-a
5
Subject: [FUSE] Determine why bcl2fastq doesn't work with writable keep mount
Tracker ID: Bug
Status: New
Category:
Points: 1.0
Estimation (hours):
Spent Time: 0.0
Remaining (hours):
Assignee: Peter Amstutz
Project: Arvados
Release:

I'm not sure there's anything we can do about this other than to make a private copy of blobxfer which plays more nicely with writable keep mounts, so this is primarily a reminder to document the current behavior. The file space preallocation code fragment below fails with an "I/O error" on the file close. The full output is below that.

        if allocatesize > 0:
            filedesc.seek(allocatesize - 1)
            filedesc.write(b'\0')
        filedesc.close()

$ time blobxfer rawseqdata 160819-e00504-0013-ahnht2ccxx keep/home/foo --saskey=$SASKEY --download --remoteresource=. --disable-urllib-warnings
!!! WARNING: DISABLING URLLIB3 WARNINGS !!! =====================================
azure blobxfer parameters [v0.11.4] =====================================
platform: Linux-3.19.0-49-generic-x86_64-with-Ubuntu-14.04-trusty
python interpreter: CPython 2.7.6
package versions: az.common=1.1.4 az.sml=0.20.4 az.stor=0.33.0 crypt=1.5 req=2.11.1
subscription id: None
management cert: None
transfer direction: Azure->local
local resource: keep/home/foo
include pattern: None
remote resource: .
max num of workers: 6
timeout: None
storage account: rawseqdata
use SAS: True
upload as page blob: False
auto vhd->page blob: False
upload to file share: False
container/share name: 160819-e00504-0013-ahnht2ccxx
container/share URI: https://rawseqdata.blob.core.windows.net/160819-e00504-0013-ahnht2ccxx
compute block MD5: False
compute file MD5: True
skip on MD5 match: True
chunk size (bytes): 4194304
create container: False
keep mismatched MD5: False
recursive if dir: True
component strip on up: 1
remote delete: False
collate to: disabled
local overwrite: True
encryption mode: disabled
RSA key file: disabled
RSA key type: disabled =======================================

script start time: 2016-09-13 22:18:11
attempting to copy entire container 160819-e00504-0013-ahnht2ccxx to keep/home/foo
generating local directory structure and pre-allocating space
created local directory: keep/home/foo/HiSeq/160819_E00504_0013_AHNHT2CCXX/Data/Intensities/BaseCalls/L005/C163.1
remote blob: HiSeq/160819_E00504_0013_AHNHT2CCXX/Data/Intensities/BaseCalls/L005/C163.1/s_5_2215.bcl.gz length: 2300498 bytes, md5: DWpLFJdWfz1sdF0LG2bOkg==
Traceback (most recent call last):
File "/home/tfmorris/venv/bin/blobxfer", line 11, in <module>
sys.exit(main())
File "/home/tfmorris/venv/local/lib/python2.7/site-packages/blobxfer.py", line 2525, in main
localfile, blob, False, blobdict[blob])
File "/home/tfmorris/venv/local/lib/python2.7/site-packages/blobxfer.py", line 1858, in generate_xferspec_download
filedesc.close()
IOError: [Errno 5] Input/output error

real 12m16.266s
user 6m56.856s
sys 0m2.144s

10035 Peter Amstutz (0 hours)
[FUSE] Determine why bcl2fastq doesn't work with writable keep mount
1.0
11115
Review
Lucas Di Pentima
375
10035
1
36
-c-a
5
Subject: Improve throughput of crunch-run output-uploading stage using multi-threaded transfers
Tracker ID: Feature
Status: New
Category:
Points: 2.0
Estimation (hours):
Spent Time: 0.0
Remaining (hours):
Assignee: Radhika Chippada
Project: Arvados
Release:

To improve throughput of crunch-run job output uploading, add support for multi-threaded asynchronous transfers to hide the latency inherent in cloud environments.

Refactoring to support public APIs in the Go SDK is a separate task.

11015 Radhika Chippada (0 hours)
Improve throughput of crunch-run output-uploading stage using multi-threaded transfers
2.0
11108
Review
Lucas Di Pentima
375
11015
1
36
-c-a
5
Subject: [Crunch] Set output collection owner_uuid to match job
Tracker ID: Bug
Status: Resolved
Category: Crunch
Points:
Estimation (hours):
Spent Time: 0.0
Remaining (hours):
Assignee: Lucas Di Pentima
Project: Arvados
Release:

Currently, if a job is started with owner_uuid set to a shared project, the job and its log are shared, but its output is not.

We do this with the log collection

  my $log_coll = api_call(
    "collections/create", ensure_unique_name => 1, collection => {
      manifest_text => $log_manifest,
      owner_uuid => $Job->{owner_uuid},
      name => sprintf("Log from %s job %s", $Job->{script}, $Job->{uuid}),
    });

but we need to do it with the output collection too.

  my $pid = open2($child_out, $child_in, 'python', '-c', q{
import arvados
import sys
print (arvados.api("v1").collections().
       create(body={"manifest_text": sys.stdin.read()}).
       execute(num_retries=int(sys.argv[1]))["portable_data_hash"])
}, retry_count());

source:sdk/cli/bin/crunch-job

11121 Lucas Di Pentima (0 hours)
[Crunch] Set output collection owner_uuid to match job
11125
Review 11121-crunch-output-collection-owner
11121
3
36
-c-a
5
Subject: [Crunch2] System-owned container outputs should be garbage-collected
Tracker ID: Story
Status: Resolved
Category: API
Points: 0.5
Estimation (hours):
Spent Time: 0.0
Remaining (hours):
Assignee: Peter Amstutz
Project: Arvados
Release: Crunch v2

Background

When a container finishes, crunch-run creates an output collection and records its PDH in the container record.

On the API server, the container update triggers a hook that creates one copy of the output collection for each of N container requests that have been waiting on this container.

The original output collection is owned by root and there is no cleanup process has trash_at set to now+defaultTrashLifetime.

Proposed fix

In crunch-run, create the output collection with trash_at=now.

In the API server hook that creates a collection for each relevant container request, when looking up the container output manifest, make sure to include trashed collections in the search.

9277 Peter Amstutz (0 hours)
[Crunch2] System-owned container outputs should be garbage-collected
0.5
11107
Review 9277-trash-container-outputs
Radhika Chippada
72
9277
3
36
-c-a
5
Subject: [API Server] Implement Docker version compatibility fallback support
Tracker ID: Story
Status: Resolved
Category:
Points: 1.0
Estimation (hours):
Spent Time: 0.0
Remaining (hours):
Assignee: Tom Clegg
Project: Arvados
Release:

1. provide a python program that runs arv-keepdocker, looks for pairs like the ones below, and creates a tag link for each pair

wardv@shell.4xphq:~$ arv-keepdocker 
REPOSITORY                      TAG         IMAGE ID      COLLECTION                     CREATED             
arvados/jobs                    c1a8e01539932e2f0153cfb2ffc4eaa2c3dc00f1  sha256:eda7c  4xphq-4zz18-5cn1x9u4ki574i7    Mon Dec 19 16:25:22 2016
arvados/jobs                    c1a8e01539932e2f0153cfb2ffc4eaa2c3dc00f1  decde0035258  4xphq-4zz18-sr3yairajekj11z    Mon Dec 19 16:25:22 2016

2. update API server to check for those tag links when creating a new job or container request, so when someone asks for "decde0035258" or "4xphq-4zz18-sr3yairajekj11z" or {whatever 4xphq-4zz18-sr3yairajekj11z's PDH is}, it gets transparently rewritten to sha256:eda7c or {whatever 4xphq-4zz18-5cn1x9u4ki574i7's PDH is}.

Tag would look something like:

owner_uuid: zzzzz-tpzed-000000000000000
link_class: docker_image_migration
name: {whatever 4xphq-4zz18-sr3yairajekj11z's PDH is}
head_uuid: {whatever 4xphq-4zz18-5cn1x9u4ki574i7's PDH is}
11017 Tom Clegg (0 hours)
[API Server] Implement Docker version compatibility fallback support
1.0
11035
Review 11017-docker-migration
Radhika Chippada
72
11017
3
36
-c-a
5
11090
Add arv-migrate-docker19 program
Tom Clegg
3
11017
3
36
-c-a
5
Subject: crunch doesn't end jobs when their arv-mount dies
Tracker ID: Bug
Status: Feedback
Category: Crunch
Points: 0.0
Estimation (hours):
Spent Time: 0.0
Remaining (hours):
Assignee: Tom Clegg
Project: Arvados
Release:

We have slowly been accumulating a large number of jobs across our cluster that are still "running" although they are not doing anything. This appears to be because their corresponding arv-mount process has died for some reason (if one does a `docker exec` to get into the crunch container, `ls /keep` says "Transport endpoint not connected":

crunch@4e50cdfd50db:/tmp/crunch-job$ ls /keep
ls: cannot access /keep: Transport endpoint is not connected

My expectation would be that if arv-mount dies, the crunch container should be destroyed so that the resources can be freed up.

10585 Tom Clegg (0 hours)
crunch doesn't end jobs when their arv-mount dies
0.0
10755
Add test
Tom Clegg
3
10585
3
36
-c-a
5
10725
crunchstat option to kill child when parent dies
Tom Clegg
3
10585
3
36
-c-a
5
10730
Review 10585-crunchstat-lost-parent
Lucas Di Pentima
375
10585
3
36
-c-a
5
1
-c-a
1
impediments
-c-a
February 20, 2017 22:03:32.23650503158569336 +0000