Feature #12216

[keep-web] machine-readable file listings

Added by Tom Clegg 5 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Keep
Target version:
Start date:
10/11/2017
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
2.0

Description

Currently, keep-web serves human-readable directory listings using an HTML template but does not offer machine-readable listings.

Machine-readable listings will permit clients to browse data stored in Keep without having to parse collections' manifest_text. For example, to facilitate collection-browsing for Java programs, we would need to port the manifest-parsing code to Java.

This should be considered a step toward full WebDAV support in keep-web: if possible, the listing API should be compatible with WebDAV clients. Presumably, the easiest path is to implement a webdav.Filesystem backed by Keep, and use a webdav.Handler to serve PROPFIND requests.

refs

Subtasks

Task #12443: Review 12216-webdav-listResolvedTom Clegg


Related issues

Related to Arvados - Feature #12090: Collections/data access APINew2017-08-08

Related to Arvados - Story #11876: [R SDK] Create a Bioconductor/R SDKIn Progress2017-06-20

Associated revisions

Revision 1b5e5a3e
Added by Tom Clegg 3 months ago

Merge branch '12216-webdav-list'

closes #12216

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

History

#1 Updated by Tom Clegg 5 months ago

  • Subject changed from [keep-web] send file listings as JSON if requested by client to [keep-web] machine-readable file listings

#2 Updated by Tom Morris 5 months ago

  • Target version set to Arvados Future Sprints
  • Story points set to 2.0

#3 Updated by Peter Amstutz 4 months ago

We should also consider providing an S3-compatible API.

#4 Updated by Tom Morris 3 months ago

  • Target version changed from Arvados Future Sprints to 2017-10-25 Sprint

#5 Updated by Tom Clegg 3 months ago

  • Assigned To set to Tom Clegg

#6 Updated by Tom Clegg 3 months ago

  • Status changed from New to In Progress

#7 Updated by Peter Amstutz 3 months ago

Does this include browsing projects? (Probably not, but for the desktop filesystem mount use case, it probably should). Ideally it would provide the same FS view as arv-mount.

#8 Updated by Tom Clegg 3 months ago

12216-webdav-list @ a23fa06e9849f2ab76fa271624e22a245c2abc47
  • test case using cadaver client (run-tests.sh now needs "apt install cadaver")
  • manual testing with mount.davfs works (it prompts for user&pass; user can be anything, pass is api token)
Shortcomings/TODO:
  • I gave up trying to make cadaver do http authentication -- I'm guessing the compile-time option to support .netrc is not enabled in the debian package, and it seems to ignore u:p in http://u:p@host:port/
  • "find /mnt" is slow on a collection with many directories. It does one http request per directory instead of using depth>1, and keep-web uses lots of CPU. I think it's doing lots of un-optimized manifest wrangling. My plan is to fix this by caching the http.FileSystem instead of just the collection.
  • There's no new functionality like browsing available collections. You still need to specify the collection ID in the URL in one of the various supported ways. The only difference is that now a webdav client can get the directory listings that used to be available only to a human or html-scraper.

#9 Updated by Lucas Di Pentima 3 months ago

As far as I can see, this looks good.

I've encountered the cached listing behavior I mentioned on the chat, where a listing gets cached and changes are not reflected. If this is client dependent, maybe it would be safe to force listing cache invalidation to avoid hard to debug issues with webdav clients?

#10 Updated by Tom Clegg 3 months ago

Lucas Di Pentima wrote:

I've encountered the cached listing behavior I mentioned on the chat, where a listing gets cached and changes are not reflected. If this is client dependent, maybe it would be safe to force listing cache invalidation to avoid hard to debug issues with webdav clients?

I'm guessing you're seeing something like this:
  1. Get directory listing from keep-web → receive version 1
  2. Update collection using REST API → current version is 2
  3. Get directory listing from keep-web → receive version 1, but expect version 2

(Is this a more general issue, or is there also something I'm missing that makes cached webdav directory listings more confusing than cached file content?)

A couple of ideas
  • listen for cache invalidation events, either from arvados-ws or more directly from postgresql
  • option for a separate TTL config for the uuid→pdh cache (could be set to zero, or something else shorter than the pdh->manifest cache TTL)

#11 Updated by Tom Clegg 3 months ago

Some follow-up fixes: 12216-webdav-list @ 337de2e3dfeacc5054cb644513be61f5d35585ae
  • allow Authorization header in cross-origin requests (see commit message 337de2e3d)
  • fix crash on some dir-listing reqs with no trailing slash
  • huge performance improvement in 991d7d796 (webdav does a lot more file-opening than I thought -- before this, we were parsing the whole manifest multiple times for each file returned in a dir listing!)

#12 Updated by Lucas Di Pentima 3 months ago

Latest updates lgtm, lazy file opening is a cool idea!
Regarding cache invalidation, I was seeing something like you describe: being connected with cadaver client, asked a listing, then uploaded something with arv-put and asked a listing again, resulting in the same output. In my opinion, a uuid->pdh TTL config would be enough and simpler to implement than an event handler.

#13 Updated by Tom Clegg 3 months ago

12216-webdav-list @ ec0c244be178aed7af0cf990a256dda557034b68
  • merged master
  • separate TTL for uuid->pdh cache (default 5 seconds)

#14 Updated by Lucas Di Pentima 3 months ago

Updates at ec0c244be178aed7af0cf990a256dda557034b68 LGTM.
Local keep-web tests didn't complain so I suppose we're not testing those TTLs.
Are cache parameters configurable via keep-web.yml, they're part of the config struct but don't know if they get picked up from the file. If they're configurable, maybe we should document the difference between both somewhere.

#15 Updated by Anonymous 3 months ago

  • Status changed from In Progress to Resolved

Applied in changeset arvados|commit:1b5e5a3ef2c174358693f83849f05ed8276be657.

#16 Updated by Tom Clegg 3 months ago

There's an experiment using browser-side JS to get directory listings:

spike-wb-browse-collection @ 20dbebcdd863589f47bce138418cfcacd5f32b2e

Also available in: Atom PDF