Feature #12216

[keep-web] machine-readable file listings

Added by Tom Clegg 3 months ago. Updated 28 days ago.

Status:ResolvedStart date:10/11/2017
Priority:NormalDue date:
Assignee:Tom Clegg% Done:

100%

Category:Keep
Target version:2017-10-25 Sprint
Story points2.0Remaining (hours)0.00 hour
Velocity based estimate0 days

Description

Currently, keep-web serves human-readable directory listings using an HTML template but does not offer machine-readable listings.

Machine-readable listings will permit clients to browse data stored in Keep without having to parse collections' manifest_text. For example, to facilitate collection-browsing for Java programs, we would need to port the manifest-parsing code to Java.

This should be considered a step toward full WebDAV support in keep-web: if possible, the listing API should be compatible with WebDAV clients. Presumably, the easiest path is to implement a webdav.Filesystem backed by Keep, and use a webdav.Handler to serve PROPFIND requests.

refs

Subtasks

Task #12443: Review 12216-webdav-listResolvedTom Clegg


Related issues

Related to Arvados - Feature #12090: Collections/data access API New 08/08/2017
Related to Arvados - Story #11876: [R SDK] Create a Bioconductor/R SDK New 06/20/2017

Associated revisions

Revision 1b5e5a3e
Added by Tom Clegg 30 days ago

Merge branch '12216-webdav-list'

closes #12216

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

History

#1 Updated by Tom Clegg 3 months ago

  • Subject changed from [keep-web] send file listings as JSON if requested by client to [keep-web] machine-readable file listings

#2 Updated by Tom Morris 2 months ago

  • Target version set to Arvados Future Sprints
  • Story points set to 2.0

#3 Updated by Peter Amstutz 2 months ago

We should also consider providing an S3-compatible API.

#4 Updated by Tom Morris about 1 month ago

  • Target version changed from Arvados Future Sprints to 2017-10-25 Sprint

#5 Updated by Tom Clegg about 1 month ago

  • Assignee set to Tom Clegg

#6 Updated by Tom Clegg about 1 month ago

  • Status changed from New to In Progress

#7 Updated by Peter Amstutz about 1 month ago

Does this include browsing projects? (Probably not, but for the desktop filesystem mount use case, it probably should). Ideally it would provide the same FS view as arv-mount.

#8 Updated by Tom Clegg about 1 month ago

12216-webdav-list @ a23fa06e9849f2ab76fa271624e22a245c2abc47
  • test case using cadaver client (run-tests.sh now needs "apt install cadaver")
  • manual testing with mount.davfs works (it prompts for user&pass; user can be anything, pass is api token)
Shortcomings/TODO:
  • I gave up trying to make cadaver do http authentication -- I'm guessing the compile-time option to support .netrc is not enabled in the debian package, and it seems to ignore u:p in http://u:p@host:port/
  • "find /mnt" is slow on a collection with many directories. It does one http request per directory instead of using depth>1, and keep-web uses lots of CPU. I think it's doing lots of un-optimized manifest wrangling. My plan is to fix this by caching the http.FileSystem instead of just the collection.
  • There's no new functionality like browsing available collections. You still need to specify the collection ID in the URL in one of the various supported ways. The only difference is that now a webdav client can get the directory listings that used to be available only to a human or html-scraper.

#9 Updated by Lucas Di Pentima about 1 month ago

As far as I can see, this looks good.

I've encountered the cached listing behavior I mentioned on the chat, where a listing gets cached and changes are not reflected. If this is client dependent, maybe it would be safe to force listing cache invalidation to avoid hard to debug issues with webdav clients?

#10 Updated by Tom Clegg about 1 month ago

Lucas Di Pentima wrote:

I've encountered the cached listing behavior I mentioned on the chat, where a listing gets cached and changes are not reflected. If this is client dependent, maybe it would be safe to force listing cache invalidation to avoid hard to debug issues with webdav clients?

I'm guessing you're seeing something like this:
  1. Get directory listing from keep-web → receive version 1
  2. Update collection using REST API → current version is 2
  3. Get directory listing from keep-web → receive version 1, but expect version 2

(Is this a more general issue, or is there also something I'm missing that makes cached webdav directory listings more confusing than cached file content?)

A couple of ideas
  • listen for cache invalidation events, either from arvados-ws or more directly from postgresql
  • option for a separate TTL config for the uuid→pdh cache (could be set to zero, or something else shorter than the pdh->manifest cache TTL)

#11 Updated by Tom Clegg about 1 month ago

Some follow-up fixes: 12216-webdav-list @ 337de2e3dfeacc5054cb644513be61f5d35585ae
  • allow Authorization header in cross-origin requests (see commit message 337de2e3d)
  • fix crash on some dir-listing reqs with no trailing slash
  • huge performance improvement in 991d7d796 (webdav does a lot more file-opening than I thought -- before this, we were parsing the whole manifest multiple times for each file returned in a dir listing!)

#12 Updated by Lucas Di Pentima about 1 month ago

Latest updates lgtm, lazy file opening is a cool idea!
Regarding cache invalidation, I was seeing something like you describe: being connected with cadaver client, asked a listing, then uploaded something with arv-put and asked a listing again, resulting in the same output. In my opinion, a uuid->pdh TTL config would be enough and simpler to implement than an event handler.

#13 Updated by Tom Clegg about 1 month ago

12216-webdav-list @ ec0c244be178aed7af0cf990a256dda557034b68
  • merged master
  • separate TTL for uuid->pdh cache (default 5 seconds)

#14 Updated by Lucas Di Pentima about 1 month ago

Updates at ec0c244be178aed7af0cf990a256dda557034b68 LGTM.
Local keep-web tests didn't complain so I suppose we're not testing those TTLs.
Are cache parameters configurable via keep-web.yml, they're part of the config struct but don't know if they get picked up from the file. If they're configurable, maybe we should document the difference between both somewhere.

#15 Updated by Anonymous 30 days ago

  • Status changed from In Progress to Resolved

Applied in changeset arvados|commit:1b5e5a3ef2c174358693f83849f05ed8276be657.

#16 Updated by Tom Clegg 30 days ago

There's an experiment using browser-side JS to get directory listings:

spike-wb-browse-collection @ 20dbebcdd863589f47bce138418cfcacd5f32b2e

Also available in: Atom PDF