Bug #17741

Golang SDK CollectionFileSystem uses time.Now() as default Last-Modified for all collections identified by portable data hash

Added by Joshua Randall 4 months ago. Updated 4 months ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Story points:
-

Description

We are using the Golang SDK to back endpoints on an http data-server that we use as part of our internal data services. This (existing) server serves data from posix backing filesystems and we have now added a keep collection backend to it.

For each endpoint that is backed by a keep collection, we create a CollectionFileSystem using the collection portable data hash (PDH) and then wrap it in an http.FileServer and use that as an http.Handler.

This works great - for a single backend server.

However, we typically have four backends.

Having multiple backends results in a problem in which clients (such as Google Chrome) that are requesting byte ranges (by setting the Range header in an XMLHttpRequest) will set the `If-Range` header on every access to a particular resource after the first one, with the value set to the value returned in the `Last-Modified` response header from the first access. Unfortunately, the CollectionFileSystem on each of our four backends essentially always have a slightly different timestamp. This is a problem because when a subsequent request from a client ends up on a different backend from the first request, the http.FileServer inspects the `If-Range` header and decides that the content it has does not match. The result of that is to respond 200 OK rather than the expected 206 Partial Content. This does not go over well with a client that requested 256kB of data and instead gets a 4GB response.

The problem is here:
https://github.com/arvados/arvados/blob/master/sdk/go/arvados/fs_collection.go#L50-L53

When one asks the Arvados API server for a collection identified by portable data hash rather than UUID, it returns a partial record that contains the manifest text but in which most other fields have been set to uninteresting values. The value chosen for `modified_at` is 0. That makes sense to me because any other value that the API server could come up with would be subject to change, and would thus continue to perpetuate this same problem. When a collection is identified by portable data hash, its `Last-Modified` time should be fixed, and for the API server, 0 seems like an appropriate value.

However, the CollectionFileSystem handles the `0` value by replacing it with `time.Now()` which is what causes this problem - since each backend server started up at a slightly different time, each one has a different idea of when this (same) collection was last modified.

A better way to handle a 0 timestamp might be to choose a valid fixed timestamp value, although I could also see an argument that the current behaviour is safer (e.g. because you don't know at that point where this collection is being mounted and it may be that I change portable_data_hash at which point that should result in a change to the Last-Modified time). I would note that it doesn't really matter to us anymore as we now strip off that header from the response and return an ETag instead (which is just the PDH of the collection).

I do think it would be good if this behaviour could be noted somewhere (probably alongside the docs for the CollectionFileSystem), as it was surprising.

I think the current behaviour of CollectionFileSystem is reasonable when collections are identified by UUID, but when identified by PDH the best practice should be to wrap the http.FileServer in a handler that sets an ETag header (to the PDH of the collection) on the ResponseWriter before passing control to the http.FileServer.

Something like this:

  collectionPdh := "887cd41e9c613463eab2f0d885c6dd96+83" 
  arvadosClient := arvados.NewClientFromEnv()
  keepClient, err := keepclient.MakeKeepClient(arvadosClient)
  if err != nil {
    err = fmt.Errorf("Error making keepClient: %s", err.Error())
    return
  }
  collection := &arvados.Collection{}
  err = arvadosClient.Get("collections", collectionPdh, arvadosclient.Dict{}, collection)
  if err != nil {
    err = fmt.Errorf("Error getting collection %s", err.Error())
    return
  }
  collectionFileSystem, err := collection.FileSystem(client, keepClient)
  if err != nil {
    err = fmt.Errorf("Error setting up FileSystem: %s", err.Error())
    return
  }
  collectionFileServer := http.FileServer(collectionFileSystem)
  http.ServeMux.HandleFunc("/path/to/collection", func(w http.ResponseWriter, r *http.Request) {
    w.Header().Set("ETag", fmt.Sprintf("\"%s\"", collectionPdh))
    collectionFileServer.ServeHTTP(w, r)
  }))

As far as I know there is no interface in http.FileSystem that would allow you to specify the ETag, but http.FileServer does (via serveContent) respect an ETag that has been set on the ResponseWriter.Header() before its ServeHTTP method is invoked. (see: https://golang.org/src/net/http/fs.go#L188).

History

#1 Updated by Tom Clegg 4 months ago

  • Description updated (diff)

#2 Updated by Tom Clegg 4 months ago

This is a good way to handle the cache/timestamp issue. (Just to add to the stdlib reference, http.ServeContent documents this: "If the caller has set w's ETag header formatted per RFC 7232, section 2.3, ServeContent uses it to handle requests using If-Match, If-None-Match, or If-Range.")

keep-web (webdav) should set Etag this way too when serving a collection requested by PDH.

Also available in: Atom PDF