Story New In Progress Resolved Feedback Closed
Sprint Impediments
10972
[OPS] Migrate all remaining clusters from Ubuntu 12.04
Javier Bértoli
398
3
impediments
-c-a
6
10980
[OPS] add ubuntu1604 packages
Ward Vandewege
1
3
impediments
-c-a
2
Subject: [FUSE] Hang on simple FUSE operation and when logging in again later
Tracker ID: Bug
Status: In Progress
Category: Performance
Points:
Estimation (hours):
Spent Time: 0.0
Remaining (hours):
Assignee: Peter Amstutz
Project: Arvados
Release:

I changed directory on a shell node to a ~/keep/home/... project which each had a single sub project, which each had a single collection, which each had the file I was looking for one directory down and did:

ls */*/*/SampleSheet.csv

which should have in <1 second returned the ~20 files that I was looking for. Instead, it went away for hours, and when I lost the SSH connection and tried to log in again later, the login hung trying to do the arv-mount.

11158 Peter Amstutz (0 hours)
[FUSE] Hang on simple FUSE operation and when logging in again later
11191
Review 11158-fuse-projects
Peter Amstutz
47
11158
2
36
-c-a
5
Subject: [Node manager] Better communication when job is unsatisfiable
Tracker ID: Story
Status: In Progress
Category:
Points: 1.0
Estimation (hours):
Spent Time: 0.0
Remaining (hours):
Assignee: Lucas Di Pentima
Project: Arvados
Release:

When a job cannot be satisfied by node manager, it will be queued forever with no feedback to the user (and almost no feedback to the admin, either). There are two distinct cases:

1) A job's min_nodes request is greater than node manager's configured max_nodes. In this case, node manager silently skips over the job with no feedback as to why no nodes are being started.
2) A job's resource requirements for a single node exceed the available cloud node size. In this case, the only indication this is a problem is a message of "job XXX not satisfiable" in the node manager log (and even then only if debug logging is turned on).

If a job request cannot be satisfied under its current configuration, node manager should have some way of signaling this to the user.

7475 Lucas Di Pentima (0 hours)
[Node manager] Better communication when job is unsatisfiable
1.0
11759
Review 7475-nodemgr-unsatisfiable-job-comms
Tom Clegg
3
7475
2
36
-c-a
5
Subject: [DOC] add cookbook section with code snippets
Tracker ID: Bug
Status: In Progress
Category:
Points: 0.5
Estimation (hours):
Spent Time: 0.0
Remaining (hours):
Assignee: Tom Morris
Project: Arvados
Release:

More examples for api calls with real use cases, e.g. users/links/groups, or collections/projects, i.e. let’s add a cookbook section with code snippets to doc.arvados.org

10349 Tom Morris (0 hours)
[DOC] add cookbook section with code snippets
0.5
10381
Review
Nico César
288
10349
1
36
-c-a
5
Subject: Reduce amount of parallelism in crunchstat-summary
Tracker ID: Bug
Status: In Progress
Category:
Points: 0.5
Estimation (hours):
Spent Time: 0.0
Remaining (hours):
Assignee: Tom Morris
Project: Arvados
Release:

Currently crunchstat-summary processes all components of a pipeline in parallel. This can mean hundreds of threads all competing for memory and cycles at the same time, leading to memory exhaustion in extreme cases.

We should dial this back to a reasonable number of threads for the machine and workload being processed.

10359 Tom Morris (0 hours)
Reduce amount of parallelism in crunchstat-summary
0.5
10379
Review 10359-crunchstat-summary-serial
Tom Morris
388
10359
2
36
-c-a
5
Subject: [CWL] arv:RunInSingleContainer should take max() of ResourceRequirements of substeps
Tracker ID: Bug
Status: In Progress
Category:
Points: 0.5
Estimation (hours):
Spent Time: 0.0
Remaining (hours):
Assignee: Jiayong Li
Project: Arvados
Release:

When creating a arv:RunInSingleContainer container, arvados-cwl-runner should look at the substeps to determine the maximum expected resource requirements to run the container.

11850 Jiayong Li (0 hours)
[CWL] arv:RunInSingleContainer should take max() of ResourceRequirements of substeps
0.5
11879
Review
Peter Amstutz
47
11850
1
36
-c-a
5
Subject: [Workbench] Remove arv-get file download fallback
Tracker ID: Story
Status: In Progress
Category:
Points: 1.0
Estimation (hours):
Spent Time: 0.0
Remaining (hours):
Assignee: Lucas Di Pentima
Project: Arvados
Release:

Currently, if keep-web isn't configured, Workbench will fallback to arv-get and Rails streaming to serve files. However, this is a very bad fallback:

  1. It fails silently if calling arv-get doesn't work
  2. It ties up a workbench worker for the duration of the download
  3. It doesn't report content-length, so user agents are unable to render a progress bar or determine if the entire file was transferred.
  4. It sometimes silently drops out in the middle of downloads
  5. It sometimes consumes huge amount of RAM, crashing the workbench server.
  6. It can't handle [some?] range requests

Instead we should:
- Workbench should refuse start if keep-web is not configured
- Documentation should be updated to emphasize that keep-web is mandatory, Workbench config for keep-web, and the new Workbench startup failure mode
- Remove the arv-get fallback code and adjust any related tests
- For file downloads, prefer to link directly to keep-web instead of redirecting through workbench (especially useful for sharing links) (done in #8784)

11167 Lucas Di Pentima (0 hours)
[Workbench] Remove arv-get file download fallback
1.0
11937
Review 11167-wb-remove-arvget
Lucas Di Pentima
375
11167
2
36
-c-a
5
Subject: Basic authenticated http health check ("ping") for each system service
Tracker ID: Feature
Status: In Progress
Category: Deployment
Points: 3.0
Estimation (hours):
Spent Time: 0.0
Remaining (hours):
Assignee: Radhika Chippada
Project: Arvados
Release:
Functional details:
  • respond to “GET /_health/ping” on same addr:port as the microservice’s main http server (or “management” server in nodemanager’s case)
  • expect header “Authorization: Bearer XXX” where XXX is “ManagementToken” in config file (note "Bearer", not "OAuth2")
  • return 404 if configured management token is blank / missing
  • return 401 if Authorization: header is missing
  • return 403 if Authorization: header does not match configured token
  • return JSON, either {"health":"OK"} or {"health":"ERROR"}

It’s OK if the “ping” healthcheck has lots of false-positive potential -- even if “OK” merely means the process is running. The focus here is instrumenting all services, not detecting all failure modes.

11906 Radhika Chippada (0 hours)
Basic authenticated http health check ("ping") for each system service
3.0
11978
Review branch 11906-keepstore-ping
Radhika Chippada
72
11906
3
36
-c-a
5
11981
Review branch 11906-api-ping
Lucas Di Pentima
375
11906
3
36
-c-a
5
11987
Review branch 11906-wb-ping
Lucas Di Pentima
375
11906
3
36
-c-a
5
12000
Review 11906-health-check-lib
Radhika Chippada
72
11906
3
36
-c-a
5
11996
Refactor health check handlers into an SDK library
Tom Clegg
3
11906
3
36
-c-a
5
12021
Review branch 11906-keepproxy-ping
Lucas Di Pentima
375
11906
3
36
-c-a
5
Subject: [keep-web] tests hang
Tracker ID: Bug
Status: New
Category: Tests
Points:
Estimation (hours):
Spent Time: 0.0
Remaining (hours):
Assignee: Tom Clegg
Project: Arvados
Release:

From https://ci.curoverse.com/job/run-tests-remainder/1152/console

             ********** Running services/keep-web tests **********

/var/lib/jenkins/workspace/run-tests-remainder/build/run-tests.sh: line 560: 21662 Terminated              go test ${short:+-short} ${coverflags[@]} "git.curoverse.com/arvados.git/$1" 

       ********** !!!!!! services/keep-web tests FAILED !!!!!! **********

          ********** End of services/keep-web tests (9789s) **********

This could be interacting with a "go test" bug where (despite the 10-minute test timeout) an open pipe to a child process can keep "go test" alive indefinitely (unless output buffering is disabled with go test -v).

12010 Tom Clegg (0 hours)
[keep-web] tests hang
Subject: [Tests] Flaky test in Python SDK tests.test_events.WebsocketTest.test_subscribe_poll
Tracker ID: Bug
Status: New
Category: Tests
Points:
Estimation (hours):
Spent Time: 0.0
Remaining (hours):
Assignee: Lucas Di Pentima
Project: Arvados
Release:
======================================================================
ERROR: test_subscribe_poll (tests.test_events.WebsocketTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/tmp.gK2qha3T2M/VENVDIR/local/lib/python2.7/site-packages/mock/mock.py", line 1305, in patched
    return func(*args, **keywargs)
  File "/data/1/jenkins/workspace/run-tests-remainder/sdk/python/tests/test_events.py", line 93, in test_subscribe_poll
    poll_fallback=0.25, expect_type=arvados.events.PollClient, expected=1)
  File "/data/1/jenkins/workspace/run-tests-remainder/sdk/python/tests/test_events.py", line 75, in _test_subscribe
    log_object_uuids.append(events.get(True, 5)['object_uuid'])
  File "/usr/lib/python2.7/Queue.py", line 176, in get
    raise Empty
Empty

-- https://ci.curoverse.com/job/run-tests-remainder/1159/consoleText

12020 Lucas Di Pentima (0 hours)
[Tests] Flaky test in Python SDK tests.test_events.WebsocketTest.test_subscribe_poll
Subject: [Spike] Investigate CI packaging improvements
Tracker ID: Story
Status: New
Category:
Points: 2.0
Estimation (hours):
Spent Time: 0.0
Remaining (hours):
Assignee: Tom Clegg
Project: Arvados
Release:
11997 Tom Clegg (0 hours)
[Spike] Investigate CI packaging improvements
2.0
Subject: arv-get should abort on ctrl/C
Tracker ID: Bug
Status: New
Category:
Points:
Estimation (hours):
Spent Time: 0.0
Remaining (hours):
Assignee: Lucas Di Pentima
Project: Arvados
Release:

1423 MiB / 43967 MiB 3.2%^CTraceback (most recent call last):
File "/home/tfmorris/venv/local/lib/python2.7/site-packages/arvados/keep.py", line 480, in _headerfunction
def _headerfunction(self, header_line):
KeyboardInterrupt
2255 MiB / 43967 MiB 5.1%^CTraceback (most recent call last):
File "/home/tfmorris/venv/local/lib/python2.7/site-packages/arvados/keep.py", line 480, in _headerfunction
def _headerfunction(self, header_line):
KeyboardInterrupt
2319 MiB / 43967 MiB 5.3%^C^C^CTraceback (most recent call last):
File "/home/tfmorris/venv/local/lib/python2.7/site-packages/arvados/keep.py", line 480, in _headerfunction
def _headerfunction(self, header_line):
KeyboardInterrupt
^C^C^C^C^C^C^CTraceback (most recent call last):
File "/home/tfmorris/venv/local/lib/python2.7/site-packages/arvados/keep.py", line 480, in _headerfunction
def _headerfunction(self, header_line):
KeyboardInterrupt
^CTraceback (most recent call last):
File "/home/tfmorris/venv/local/lib/python2.7/site-packages/arvados/keep.py", line 480, in _headerfunction
def _headerfunction(self, header_line):
KeyboardInterrupt
^CTraceback (most recent call last):
File "/home/tfmorris/venv/local/lib/python2.7/site-packages/arvados/keep.py", line 480, in _headerfunction
def _headerfunction(self, header_line):
KeyboardInterrupt
^C

11519 Lucas Di Pentima (0 hours)
arv-get should abort on ctrl/C
11571
Review
Tom Clegg
3
11519
1
36
-c-a
5
Subject: [FUSE] high memory consumption (possible leak) in long-running arv-mount
Tracker ID: Bug
Status: New
Category:
Points:
Estimation (hours):
Spent Time: 0.0
Remaining (hours):
Assignee: Peter Amstutz
Project: Arvados
Release:

We have a (little used) arv-mount that has been running since 6th September.

It was started with the command line:
`/usr/bin/python2.7 /usr/bin/arv-mount /tmp/keep_jr17`

Since no `--file-cache` or `--directory-cache` options were given, those should have been the defaults of 256MiB and 128MiB. If I start a new arv-cache also with defaults and then read some large data through it and exercise some large directories (such as doing a find in `by_tag`), I am able to get memory usage up to 514MB, which seems reasonable.

However, the arv-mount that has been running for the past 77 days is now taking up 15GB of RAM!

I suspect this issue might be related to the increasing memory usage I observed and reported in #10535 when the python SDK test suite got stuck in a tight PollClient loop forever (where "forever" is until it ran the system out of memory).

10584 Peter Amstutz (0 hours)
[FUSE] high memory consumption (possible leak) in long-running arv-mount
11890
Review 10584-fuse-stop-threads
Peter Amstutz
47
10584
3
36
-c-a
5
Subject: Create a CWL stress test for node manager
Tracker ID: Story
Status: New
Category:
Points: 1.0
Estimation (hours):
Spent Time: 0.0
Remaining (hours):
Assignee: Peter Amstutz
Project: Arvados
Release:

We'd like to have a a stress test with high fanout (~100 nodes?) which creates a lot of node manager work and logs how long it takes to spin up all the nodes. Multi-step scatter not needed.

Definition of done includes documentation and testing on test clusters, per usual.

11545 Peter Amstutz (0 hours)
Create a CWL stress test for node manager
1.0
11757
Review
Lucas Di Pentima
375
11545
1
36
-c-a
5
Subject: [Nodemanager] Fix watchdog test
Tracker ID: Bug
Status: New
Category:
Points:
Estimation (hours):
Spent Time: 0.0
Remaining (hours):
Assignee: Peter Amstutz
Project: Arvados
Release:
11925 Peter Amstutz (0 hours)
[Nodemanager] Fix watchdog test
12001
Review
Lucas Di Pentima
375
11925
1
36
-c-a
5
Subject: Deprecate / remove arv-web
Tracker ID: Bug
Status: New
Category:
Points:
Estimation (hours):
Spent Time: 0.0
Remaining (hours):
Assignee: Tom Morris
Project: Arvados
Release:
11794 Tom Morris (0 hours)
Deprecate / remove arv-web
12003
Review
Radhika Chippada
72
11794
1
36
-c-a
5
Subject: [Crunch2] crunchstat-summary --container UUID should summarize container logs
Tracker ID: Story
Status: New
Category:
Points:
Estimation (hours):
Spent Time: 0.0
Remaining (hours):
Assignee: Tom Morris
Project: Arvados
Release:
11309 Tom Morris (0 hours)
[Crunch2] crunchstat-summary --container UUID should summarize container logs
11894
Review
Tom Clegg
3
11309
1
36
-c-a
5
Subject: Update Pipeline Optimization wiki with CWL/Crunchv2
Tracker ID: Story
Status: New
Category:
Points:
Estimation (hours):
Spent Time: 0.0
Remaining (hours):
Assignee: Tom Morris
Project: Arvados
Release:

update https://dev.arvados.org/projects/arvados/wiki/Pipeline_Optimization with CWL references instead of pipeline templates

11285 Tom Morris (0 hours)
Update Pipeline Optimization wiki with CWL/Crunchv2
11392
Review
Bryan Cosca
189
11285
1
36
-c-a
5
Subject: Nginx config should speak JSON when returning its own response for unproxyable API requests
Tracker ID: Bug
Status: New
Category:
Points: 1.0
Estimation (hours):
Spent Time: 0.0
Remaining (hours):
Assignee: Nico César
Project: Arvados
Release:

Also update the install documentation to explain to customers how to do this.

11136 Nico César (0 hours)
Nginx config should speak JSON when returning its own response for unproxyable API requests
1.0
Subject: Deprecate arv-run
Tracker ID: Story
Status: New
Category:
Points:
Estimation (hours):
Spent Time: 0.0
Remaining (hours):
Assignee: Tom Morris
Project: Arvados
Release:
10717 Tom Morris (0 hours)
Deprecate arv-run
12005
Review
Radhika Chippada
72
10717
1
36
-c-a
5
Subject: [SDKs] Write integration test for when arv-put resumes from a cache with expired access tokens
Tracker ID: Story
Status: New
Category: SDKs
Points: 0.5
Estimation (hours):
Spent Time: 0.0
Remaining (hours):
Assignee: Lucas Di Pentima
Project: Arvados
Release:

arv-put does not check the validity of the permission signatures on its blocks when it resumes from cache. When it resumes from a cache with invalid permission signatures, it will continue uploading blocks, but the ultimate collection creation at the end of the process will fail.

Update the Python SDK:

  • Add a head() method to the KeepClient class. It takes a block locator and num_retries as arguments. It sends a HEAD request for that block to all accessible Keep services in the usual rendezvous order. It returns True if one service returns 200 OK, and raises an exception if it never gets that response. It retries following the same logic as the get() method.
  • Add tests for this method that check behavior when:
    • One accessible service returns 200 OK.
    • The last accessible service returns 200 OK.
    • All accessible services return a permanent error.
    • All accessible services return an error. Some of those errors are temporary. On a second request, at least one of the services returns 200 OK.
    • All accessible services return temporary errors, enough to exhaust the number of retries.

Update arv-put:

  • If the cache is otherwise usable (the file list is the same, the files are unchanged, etc.), use KeepClient's new head method to check the first block locator in the manifest in the cache.
    • The first will always be the oldest and most likely to fail, so it is the best one to check. We talked about potentially checking all of them, and that is more thorough, but it's also potentially much more expensive.
    • The rest of the cache checks are much cheaper than the HEAD request. We want to do this last, because if the cache is invalid for any other reason, we can notice and restart much faster.
  • If the head request confirms the block locator is still valid, continue from the cache as before.
  • Otherwise, invalidate the cache and start from cache.
  • Test that arv-put behaves correctly:
    • In existing tests that arv-put resumes from a valid cache, update those tests to simulate a 200 OK response to the HEAD request. The rest of the tested behavior should be preserved as before.
    • Write a test to verify that arv-put invalidates the cache in a situation where the HEAD request fails.
8937 Lucas Di Pentima (0 hours)
[SDKs] Write integration test for when arv-put resumes from a cache with expired access tokens
0.5
12006
Review
8937
1
36
-c-a
5
8979
Review branch: 8937-head-request-in-python-keep-client
Radhika Chippada
72
8937
3
36
-c-a
5
9041
Review branch 8937-arv-put-cache-check
Radhika Chippada
72
8937
3
36
-c-a
5
Subject: [Workbench] Support more filetypes for in-browser display
Tracker ID: Story
Status: Resolved
Category:
Points: 0.5
Estimation (hours):
Spent Time: 0.0
Remaining (hours):
Assignee: Radhika Chippada
Project: Arvados
Release:

We should offer the option for in-browser display for more filetypes including: .cwl, .yaml, .csv, .tsv, .vcf, and .bed.

Some of these can be large, but the user can click the Back button and abort the display if they accidentally start downloading a file which is too big.

11995 Radhika Chippada (0 hours)
[Workbench] Support more filetypes for in-browser display
0.5
11998
Review branch 11995-collecion-filetypes
Radhika Chippada
72
11995
2
36
-c-a
5
Subject: [FUSE] Upgrade llfuse to 1.2, fix deadlock in test suite
Tracker ID: Bug
Status: Resolved
Category: FUSE
Points: 1.0
Estimation (hours):
Spent Time: 0.0
Remaining (hours):
Assignee: Tom Clegg
Project: Arvados
Release:

Test suite deadlocks frequently if llfuse is upgraded past 0.41.1 (tested: 0.42, 0.42.1, 0.43, 1.1.1)

To reproduce, change llfuse version requirement in services/fuse/setup.py and repeat the fuse tests a few times in a row:

export WORKSPACE=`pwd`
./build/run-tests.sh --temp /tmp/buildtmp --only install
./build/run-tests.sh --temp /tmp/buildtmp --skip-install --only services/fuse --repeat 32

The deadlock seems to happen only while the mount is being shut down. It isn't clear whether the problem could affect real-life usage too, or is specific to the test suite (e.g., combining multiprocessing with threading).

https://pythonhosted.org/llfuse/changes.html#release-0-42-2016-01-30

10805 Tom Clegg (0 hours)
[FUSE] Upgrade llfuse to 1.2, fix deadlock in test suite
1.0
11930
Packaging for custom llfuse 1.2 fork
Nico César
288
10805
1
40
-c-a
5
11827
Review
Lucas Di Pentima
375
10805
1
36
-c-a
5
Subject: Fix failing CWL conformance tests
Tracker ID: Bug
Status: Resolved
Category:
Points:
Estimation (hours):
Spent Time: 0.0
Remaining (hours):
Assignee: Peter Amstutz
Project: Arvados
Release:

New CWL conformance tests were introduced which fail on Arvados.

Fix arvados-cwl-runner to pass these tests.

11948 Peter Amstutz (0 hours)
Fix failing CWL conformance tests
11963
Review 11948-cwl-conformance-fix
Peter Amstutz
47
11948
3
36
-c-a
5
1
-c-a
1
impediments
-c-a
July 25, 2017 20:47:24.7244570255279541 +0000