Story New In Progress Resolved Feedback
Sprint Impediments
12387
[OPS][puppet] refactor arvados-slurm-client / controller modules
11928
Nico César
288
1
impediments
-c-a
1
12063
[ELK] add monitoring of e51c5 api vm
Nico César
288
3
impediments
-c-a
1
11572
Separate 'stable/production' from 'dev' package for the Debian/Ubuntu repositories
Javier Bértoli
398
3
impediments
-c-a
2
10980
[OPS] add ubuntu1604 packages
Ward Vandewege
1
3
impediments
-c-a
2
10972
[OPS] Migrate all remaining clusters from Ubuntu 12.04
Javier Bértoli
398
3
impediments
-c-a
6
12106
[OPS] [Release] create a "promotion" jenkins jobs to move packages from dev to stable
11878
Javier Bértoli
398
4
impediments
-c-a
2
Subject: Healthcheck endpoint aggregator
Tracker ID: Feature
Status: Feedback
Category:
Points: 1.0
Estimation (hours):
Spent Time: 0.0
Remaining (hours):
Assignee: Tom Clegg
Project: Arvados
Release:
12260 Tom Clegg (0 hours)
Healthcheck endpoint aggregator
1.0
12362
Review 12260-system-health
Lucas Di Pentima
375
12260
3
36
-c-a
5
Subject: [crunch2] crunch-dispatch-slurm monitoring too many containers gets 414 error
Tracker ID: Bug
Status: Resolved
Category:
Points:
Estimation (hours):
Spent Time: 0.0
Remaining (hours):
Assignee: Peter Amstutz
Project: Arvados
Release:

It seems that the "tracked" list has gotten big enough that passing the list of UUIDs of being tracked is exceeding the request URI size. This brings crunch-dispatch-slurm to a screeching halt.

Oct 10 09:47:50 crunch-dispatch-slurm: 2017/10/10 09:47:50 Error getting list of containers: "arvados API server error: 414: 414 Request-URI Too Large returned by xxxxxxxxxxxxx.com" 

Possible solutions:

1) Use POST with method="get" so there is no limit on the query size

2) Adjust queries to avoid to avoid passing the "tracked" UUID list in the first place.

Related to: https://support.curoverse.com/rt/Ticket/Display.html?id=512

12446 Peter Amstutz (0 hours)
[crunch2] crunch-dispatch-slurm monitoring too many containers gets 414 error
12458
Review 12446-dispatcher-query
Tom Clegg
3
12446
3
36
-c-a
5
Subject: [CWL] arv:RunInSingleContainer should take max() of ResourceRequirements of substeps
Tracker ID: Bug
Status: In Progress
Category:
Points: 0.5
Estimation (hours):
Spent Time: 0.0
Remaining (hours):
Assignee: Jiayong Li
Project: Arvados
Release:

When creating a arv:RunInSingleContainer container, arvados-cwl-runner should look at the substeps to determine the maximum expected resource requirements to run the container.

11850 Jiayong Li (0 hours)
[CWL] arv:RunInSingleContainer should take max() of ResourceRequirements of substeps
0.5
11879
Review
Peter Amstutz
47
11850
1
36
-c-a
5
Subject: Synchronize group membership with external data source
Tracker ID: Feature
Status: In Progress
Category:
Points: 2.0
Estimation (hours):
Spent Time: 0.0
Remaining (hours):
Assignee: Lucas Di Pentima
Project: Arvados
Release:

As a user in a corporate environment, I want to be able to synchronize the users in my Arvados groups with my corporate directory service (ActiveDirectory, LDAP, etc).

This doesn't need to be instantaneous, but can instead by done either periodically on a scheduled based or on demand. A script-based solution is an acceptable answer.

Groups which get created by this mechanism get tagged so that they're known to be automatically created. Groups are not given any particular permissions when they are created.

Input is a two column CSV file with a column of Group name and one column of user IDs (either username or user email address) with a command flag which controls whether the user ID is username or email address. If a user is no longer included in the input file, they get removed from the group membership.

Workbench needs to be changed to not allow admins to modify group membership for synched.

Tool should report errors for any users who don't have matching user IDs. Groups which don't exist get created and their UUIDs get reported. If an untagged group exists and is also in the input file, a warning is issued.

12018 Lucas Di Pentima (0 hours)
Synchronize group membership with external data source
2.0
12264
Review 12018-sync-groups-tool
Tom Clegg
3
12018
2
36
-c-a
5
Subject: [OPS] cgroup_enable=memory swapaccount=1 grub parameters are not being set
Tracker ID: Bug
Status: In Progress
Category:
Points:
Estimation (hours):
Spent Time: 0.0
Remaining (hours):
Assignee: Javier Bértoli
Project: Arvados
Release:

#9431 suggests that both compute and shell nodes require these parameters set at boot time (in grub) (see also #7386)

cgroup_enable=memory swapaccount=1

But investigating #12305 Tom found that it's not being set at boot time.

Checking the nodes, the parameters are set in the /etc/default/grub file of the computes nodes but not in the shell nodes nor in being propagated to any of the /boot/grub/grub.cfg files.

Running update-grub2 does not seem to propagate the change, even when it IS upgrading the config

compute7.e51c5:~# grep swapacc /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="console=hvc0 cgroup_enable=memory swapaccount=1" 

compute7.e51c5:~# update-grub2 
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-4.10.0-33-generic
Found initrd image: /boot/initrd.img-4.10.0-33-generic
Found linux image: /boot/vmlinuz-4.4.0-93-generic
Found initrd image: /boot/initrd.img-4.4.0-93-generic
Found linux image: /boot/vmlinuz-4.4.0-92-generic
Found initrd image: /boot/initrd.img-4.4.0-92-generic
done

compute7.e51c5:~# grep swapacc /boot/grub/grub.cfg 
1!compute7.e51c5:~# 

compute7.e51c5:~# ls -l /boot/grub/grub.cfg
-r--r--r-- 1 root root 9844 Sep 22 18:07 /boot/grub/grub.cfg

compute7.e51c5:~# date
Fri Sep 22 18:09:12 UTC 2017
12307 Javier Bértoli (0 hours)
[OPS] cgroup_enable=memory swapaccount=1 grub parameters are not being set
Subject: Client support for deleting projects
Tracker ID: Story
Status: In Progress
Category:
Points:
Estimation (hours):
Spent Time: 0.0
Remaining (hours):
Assignee: Peter Amstutz
Project: Arvados
Release:
12125 Peter Amstutz (0 hours)
Client support for deleting projects
12441
Review arv-mount trashed support
12125
1
36
-c-a
5
12092
[arv-mount] Add support Trashed event for projects
Peter Amstutz
47
12125
1
36
-c-a
5
12266
Review 12125-workbench-project-trash
Peter Amstutz
47
12125
2
36
-c-a
5
12091
[Workbench] Add Projects tab to trash page
Peter Amstutz
47
12125
3
36
-c-a
5
Subject: [crunch-run] Handle symlinks with absolute paths into output directory
Tracker ID: Bug
Status: In Progress
Category:
Points:
Estimation (hours):
Spent Time: 0.0
Remaining (hours):
Assignee: Peter Amstutz
Project: Arvados
Release:

This is suspicious:

2017-08-28T11:09:17.982121623Z CMD: ln -s /var/spool/cwl/STAR-Fusion_outdir/star-fusion.preliminary/star-fusion.filter.intermediates_dir/star-fusion.filtered /var/spool/cwl/STAR-Fusion_outdir/star-fusion.preliminary/star-fusion.fusion_candidates.preliminary.filtered

This seems to be creating a symlink to an absolute path inside the container. However, crunch-run (which collects the outputs) executes outside the container, which means it cannot dereference symlinks to arbitrary paths inside the container. It is already able to handle symlinks to mounted input files, and relative symlinks within the output directory, but doesn't correctly handle this case of a symlink with an absolute path to another file in the output directory. This should be handled correctly.

Currently it looks like putting a symlink foo->/etc/shadow (or ->../../../../../../etc/shadow) will cause crunch-run to store /etc/shadow from the compute node, not the container. This seems bad. Also, it looks like we follow symlinks to files, but not symlinks to dirs, which seems like a confusing rule.

12183 Peter Amstutz (0 hours)
[crunch-run] Handle symlinks with absolute paths into output directory
12205
Review 12183-crunch-run-symlinks
Peter Amstutz
47
12183
2
36
-c-a
5
12312
Fix
Peter Amstutz
47
12183
3
36
-c-a
5
Subject: [keep-web] machine-readable file listings
Tracker ID: Feature
Status: Resolved
Category: Keep
Points: 2.0
Estimation (hours):
Spent Time: 0.0
Remaining (hours):
Assignee: Tom Clegg
Project: Arvados
Release:

Currently, keep-web serves human-readable directory listings using an HTML template but does not offer machine-readable listings.

Machine-readable listings will permit clients to browse data stored in Keep without having to parse collections' manifest_text. For example, to facilitate collection-browsing for Java programs, we would need to port the manifest-parsing code to Java.

This should be considered a step toward full WebDAV support in keep-web: if possible, the listing API should be compatible with WebDAV clients. Presumably, the easiest path is to implement a webdav.Filesystem backed by Keep, and use a webdav.Handler to serve PROPFIND requests.

refs
12216 Tom Clegg (0 hours)
[keep-web] machine-readable file listings
2.0
12443
Review 12216-webdav-list
Tom Clegg
3
12216
3
36
-c-a
5
Subject: Federated user identity which works across a network of Arvados clusters
Tracker ID: Story
Status: In Progress
Category:
Points: 2.0
Estimation (hours):
Spent Time: 0.0
Remaining (hours):
Assignee: Tom Clegg
Project: Arvados
Release:

Basic elements:
- a single login server which provides authentication for all clusters in the network
- a single user UUID is used across all nodes in the cluster.

API server needs two additional features:
1. Validate salted token by contacting origin cluster
2. As an origin cluster, validate a received token from a remote cluster

Validation requests return the user record which is used to populate the local user table, along with an expiration time after which revalidation should occur.

Draft: Federated identity

Migration process from local identity to network identity is separate

11453 Tom Clegg (0 hours)
Federated user identity which works across a network of Arvados clusters
2.0
11874
[Spike] Prototype federated identity
11453
1
36
-c-a
5
12424
Migration process to convert local user IDs to network cluster IDs
11453
1
36
-c-a
5
12440
Review 11453-federated-tokens
Peter Amstutz
47
11453
2
36
-c-a
5
12455
Validate v2-format salted tokens
Tom Clegg
3
11453
3
36
-c-a
5
Subject: Update documentation to reflect split of FUSE driver into its own package
Tracker ID: Bug
Status: New
Category:
Points:
Estimation (hours):
Spent Time: 0.0
Remaining (hours):
Assignee: Tom Morris
Project: Arvados
Release:

At least the arv-mount tutorial needs updating: https://doc.arvados.org/user/tutorials/tutorial-keep-mount.html
but perhaps there are other places as well.

There are also doesn't appear to be any installation instructions for the FUSE driver on the documentation web site.

12369 Tom Morris (0 hours)
Update documentation to reflect split of FUSE driver into its own package
12439
Review
Peter Amstutz
47
12369
1
36
-c-a
5
Subject: Update to standard libcloud 2.x
Tracker ID: Story
Status: New
Category:
Points: 0.5
Estimation (hours):
Spent Time: 0.0
Remaining (hours):
Assignee: Lucas Di Pentima
Project: Arvados
Release:

We want to stop using a private fork of libcloud, but we need the regression in the current version fixed and release first.

PR: https://github.com/apache/libcloud/pull/1110

Libcloud 2.2.1 was released September 21, 2017.

12268 Lucas Di Pentima (0 hours)
Update to standard libcloud 2.x
0.5
12367
Review
Tom Clegg
3
12268
1
36
-c-a
5
Subject: arv-get should abort on ctrl/C
Tracker ID: Bug
Status: New
Category:
Points:
Estimation (hours):
Spent Time: 0.0
Remaining (hours):
Assignee: Lucas Di Pentima
Project: Arvados
Release:

1423 MiB / 43967 MiB 3.2%^CTraceback (most recent call last):
File "/home/tfmorris/venv/local/lib/python2.7/site-packages/arvados/keep.py", line 480, in _headerfunction
def _headerfunction(self, header_line):
KeyboardInterrupt
2255 MiB / 43967 MiB 5.1%^CTraceback (most recent call last):
File "/home/tfmorris/venv/local/lib/python2.7/site-packages/arvados/keep.py", line 480, in _headerfunction
def _headerfunction(self, header_line):
KeyboardInterrupt
2319 MiB / 43967 MiB 5.3%^C^C^CTraceback (most recent call last):
File "/home/tfmorris/venv/local/lib/python2.7/site-packages/arvados/keep.py", line 480, in _headerfunction
def _headerfunction(self, header_line):
KeyboardInterrupt
^C^C^C^C^C^C^CTraceback (most recent call last):
File "/home/tfmorris/venv/local/lib/python2.7/site-packages/arvados/keep.py", line 480, in _headerfunction
def _headerfunction(self, header_line):
KeyboardInterrupt
^CTraceback (most recent call last):
File "/home/tfmorris/venv/local/lib/python2.7/site-packages/arvados/keep.py", line 480, in _headerfunction
def _headerfunction(self, header_line):
KeyboardInterrupt
^CTraceback (most recent call last):
File "/home/tfmorris/venv/local/lib/python2.7/site-packages/arvados/keep.py", line 480, in _headerfunction
def _headerfunction(self, header_line):
KeyboardInterrupt
^C

11519 Lucas Di Pentima (0 hours)
arv-get should abort on ctrl/C
11571
Review
Tom Clegg
3
11519
1
36
-c-a
5
Subject: [SDKs] Fix misleading arv-mount/pysdk error messages by removing obsolete "fetch manifest from Keep" code
Tracker ID: Bug
Status: In Progress
Category: SDKs
Points:
Estimation (hours):
Spent Time: 0.0
Remaining (hours):
Assignee: Tom Clegg
Project: Arvados
Release:

Current error message:

2017-03-01 17:27:55 arvados.arvados_fuse[10741] ERROR: Error fetching collection '{{PDH}}': Failed to retrieve collection '{{PDH}}' from either API server (<HttpError 404 when requesting https://tb05z.arvadosapi.com/arvados/v1/collections/{{PDH}}?alt=json returned "Path not found">) or Keep ({{PDH}} not found: http://keep0.tb05z.arvadosapi.com:25107/ responded with 403 HTTP/1.1 403 Forbidden

This should be reported as 404 / not found. The 403 part is a red herring.

Manifests are no longer written to Keep, and even if they were, reading without a permission token will never work on most installations, so this fallback seems pointless.

source:sdk/python/arvados/collection.py#L1362

11220 Tom Clegg (0 hours)
[SDKs] Fix misleading arv-mount/pysdk error messages by removing obsolete "fetch manifest from Keep" code
12438
Review 11220-manifest-fetch-error
Lucas Di Pentima
375
11220
2
36
-c-a
5
Subject: crunch-run not waiting for Docker image to finish loading.
Tracker ID: Bug
Status: Resolved
Category:
Points:
Estimation (hours):
Spent Time: 0.0
Remaining (hours):
Assignee: Peter Amstutz
Project: Arvados
Release:

It is failing to load Docker image, even though the image load seemingly succeeds:

2017-10-19T16:41:46.528938874Z Fetching Docker image from collection '997e3bc9579153e723b599b300d9123d+342'
2017-10-19T16:41:46.549138991Z Using Docker image id 'sha256:20a23446d37fac4cb2ec6462310aaee7119f4594948765a3397b13f9baec4812'
2017-10-19T16:41:46.549599576Z Loading Docker image from keep
2017-10-19T16:41:49.814403711Z Creating Docker container
2017-10-19T16:41:49.816658048Z While creating container: Error: No such image: sha256:20a23446d37fac4cb2ec6462310aaee7119f4594948765a3397b13f9baec4812

It seems that loading is asynchronous, the ImageLoad method returns after the data has been transmitted but before the layers have been decompressed. We need to wait for a response indicating that loading is finished.

12467 Peter Amstutz (0 hours)
crunch-run not waiting for Docker image to finish loading.
12468
Review 12467-read-imgload-response
Tom Clegg
3
12467
3
36
-c-a
5
Subject: Honor Retry-After headers on libcloud exceptions
Tracker ID: Bug
Status: New
Category: Node Manager
Points:
Estimation (hours):
Spent Time: 0.0
Remaining (hours):
Assignee: Lucas Di Pentima
Project: Arvados
Release:

It seems to be a bug on libcloud that could make nodemanager behave erratically on certain error situations with cloud providers.
As for libcloud 2.2.1, it seems that neither BaseHTTPError nor RateLimitReachedError get assigned the Retry-After header value, as it was removed from the exception_from_message() call on libcloud/libcloud/common/base.py.

Furthermore, the Retry-After header should accept a date in the future as detailed on https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html (relevant part copied below):

The Retry-After response-header field can be used with a 503 (Service Unavailable) response to indicate how long the service is expected to be unavailable to the requesting client. This field MAY also be used with any 3xx (Redirection) response to indicate the minimum time the user-agent is asked wait before issuing the redirected request. The value of this field can be either an HTTP-date or an integer number of seconds (in decimal) after the time of the response.

       Retry-After  = "Retry-After" ":" ( HTTP-date | delta-seconds )

Two examples of its use are

       Retry-After: Fri, 31 Dec 1999 23:59:59 GMT
       Retry-After: 120
12318 Lucas Di Pentima (0 hours)
Honor Retry-After headers on libcloud exceptions
12442
Review
Peter Amstutz
47
12318
1
36
-c-a
5
Subject: crunch-run memory leak
Tracker ID: Bug
Status: Resolved
Category:
Points:
Estimation (hours):
Spent Time: 0.0
Remaining (hours):
Assignee: Peter Amstutz
Project: Arvados
Release:

Crunch-run loading a 2 GiB Docker image uses 1.5 GiB of RAM, which is enough on a 3.5 GiB node to prevent fork/exec due to memory pressure.

See https://dev.arvados.org/issues/12433#note-3 and note-4

12447 Peter Amstutz (0 hours)
crunch-run memory leak
12449
Review 12447-crunch-run-leak
Peter Amstutz
47
12447
3
36
-c-a
5
1
-c-a
1
impediments
-c-a
October 20, 2017 18:04:01.9821159839630127 +0000