Bug #16100

[keep-web] Avoid sniffing for content type when file extension matches a MIME type

Added by Tom Clegg 2 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Keep
Target version:
Start date:
02/14/2020
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
0.5
Release relationship:
Auto

Description

Currently, when serving a GET request for a file, the WebDAV service uses the Go standard library's content sniffing feature to guess an appropriate Content-Type if the filename extension is not listed in /etc/mime.types or a small built-in list of extensions. This is unreliable (and not just hypothetically -- users have been surprised by mysteriously broken previews).

For example, if the /etc/mime.types file does not exist, a file called "bmx.txt" containing the text "BMX bikes are awesome.\n" is currently served with Content-Type: image/bmp because the first two bytes "BM" satisfy the signature for a BMP image file, and this causes it to render incorrectly in the browser.

To avoid this problem:

Keep-web OS packages should list the package providing /etc/mime.types -- "mailcap" on centos, "mime-support" on debian and ubuntu -- as a dependency.

At startup, keep-web should check the mime type for a common extension like .txt that's not in the built-in list, and log a warning if it's missing.


Subtasks

Task #16147: Review 16100-mime-typesResolvedTom Clegg

Associated revisions

Revision 0a415b6c
Added by Tom Clegg about 2 months ago

Merge branch '16100-mime-types'

fixes #16100

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

History

#1 Updated by Michael Crusoe 2 months ago

Tom Clegg wrote:

Observed behavior: A file called "bmx.txt" containing the text "BMX bikes are awesome.\n" is currently served with Content-Type: image/bmp because the first two bytes "BM" satisfy the signature for a BMP image file, and this causes it to render incorrectly in the browser.

FYI, the unix "file" command correctly identifies said file:

$ echo "BMX bikes are awesome.\n" > bmx.txt 
$ file --mime bmx.txt 
bmx.txt: text/plain; charset=us-ascii
$ file --version
file-5.37

#2 Updated by Tom Clegg about 2 months ago

  • Status changed from New to In Progress
  • Description updated (diff)

#4 Updated by Lucas Di Pentima about 2 months ago

Although Jenkins says it's all fine, I've ran the services/keep-web tests on my dev VMs (debian9 & debian10) and I'm getting a failure like this:

[...]
{"health":"OK"}
arv-git-httpd pid 11288 ok
{"health":"OK"}
{"health":"OK"}
ws pid 11304 ok
ARVADOS_TEST_PROXY_SERVICES=1
ARVADOS_API_TOKEN=4axaw8zxe0qm22wa6urpp5nskcne8z88cvbupv653y1njyi05h
ARVADOS_CONFIG=/media/psf/arvados/tmp/arvados.yml
ARVADOS_API_HOST=0.0.0.0:45751
ARVADOS_TEST_API_INSTALLED=10501
ARVADOS_TEST_API_HOST=0.0.0.0:54431
ARVADOS_API_HOST_INSECURE=true
======= test services/keep-web
time="2020-02-14T17:52:07-03:00" level=error msg="stat.Size()==3 but only wrote 0 bytes; read(1024) returns 0, GET acbd18db4cc2f85cedef654fccc4a4d8+3+A2b8e58eafe2fb5db03583e062bd1aa7871103fa8@5e597d77 failed: [http://localhost:39073/acbd18db4cc2f85cedef654fccc4a4d8+3+A2b8e58eafe2fb5db03583e062bd1aa7871103fa8@5e597d77: Get http://localhost:39073/acbd18db4cc2f85cedef654fccc4a4d8+3+A2b8e58eafe2fb5db03583e062bd1aa7871103fa8@5e597d77: dial tcp [::1]:39073: connect: connection refused http://localhost:33737/acbd18db4cc2f85cedef654fccc4a4d8+3+A2b8e58eafe2fb5db03583e062bd1aa7871103fa8@5e597d77: Get http://localhost:33737/acbd18db4cc2f85cedef654fccc4a4d8+3+A2b8e58eafe2fb5db03583e062bd1aa7871103fa8@5e597d77: dial tcp [::1]:33737: connect: connection refused http://localhost:39073/acbd18db4cc2f85cedef654fccc4a4d8+3+A2b8e58eafe2fb5db03583e062bd1aa7871103fa8@5e597d77: Get http://localhost:39073/acbd18db4cc2f85cedef654fccc4a4d8+3+A2b8e58eafe2fb5db03583e062bd1aa7871103fa8@5e597d77: dial tcp [::1]:39073: connect: connection refused http://localhost:33737/acbd18db4cc2f85cedef654fccc4a4d8+3+A2b8e58eafe2fb5db03583e062bd1aa7871103fa8@5e597d77: Get http://localhost:33737/acbd18db4cc2f85cedef654fccc4a4d8+3+A2b8e58eafe2fb5db03583e062bd1aa7871103fa8@5e597d77: dial tcp [::1]:33737: connect: connection refused http://localhost:39073/acbd18db4cc2f85cedef654fccc4a4d8+3+A2b8e58eafe2fb5db03583e062bd1aa7871103fa8@5e597d77: Get http://localhost:39073/acbd18db4cc2f85cedef654fccc4a4d8+3+A2b8e58eafe2fb5db03583e062bd1aa7871103fa8@5e597d77: dial tcp [::1]:39073: connect: connection refused http://localhost:33737/acbd18db4cc2f85cedef654fccc4a4d8+3+A2b8e58eafe2fb5db03583e062bd1aa7871103fa8@5e597d77: Get http://localhost:33737/acbd18db4cc2f85cedef654fccc4a4d8+3+A2b8e58eafe2fb5db03583e062bd1aa7871103fa8@5e597d77: dial tcp [::1]:33737: connect: connection refused]" 
2020/02/14 17:52:09 authSettings: map[ARVADOS_API_HOST:0.0.0.0:54431 ARVADOS_API_HOST_INSECURE:true ARVADOS_API_TOKEN:4axaw8zxe0qm22wa6urpp5nskcne8z88cvbupv653y1njyi05h]
child.pid is 11450
child.pid is 11462
{"RequestID":"req-1rjf1w6yc4zgsu5kds3s","level":"info","msg":"request","remoteAddr":"127.0.0.1:50260","reqBytes":0,"reqForwardedFor":"","reqHost":"zzzzz-4zz18-znoi8dpi452e9qi.collections.example.com:36683","reqMethod":"GET","reqPath":"testdata.bin","reqQuery":"","time":"2020-02-14T17:52:12.078006301-03:00"}
[...]

Thought that I might have some local problem, but current master runs OK.

#5 Updated by Tom Clegg about 2 months ago

Here, I see that error in the logs but it doesn't cause a test failure. The test does a GET request for a file whose content can't be retrieved. The handler only fails after it's returned 200, and the test doesn't check the response body. Changing it to an integration test makes the logged error go away:

16100-mime-types @ 3836d53ef13841dad652e3faeb20660576279afd -- https://ci.arvados.org/view/Developer/job/developer-run-tests/1735/

#6 Updated by Lucas Di Pentima about 2 months ago

Thanks. Locally I was getting a test failure with errorlevel=29 when running the tests like this:

~/arvados/build/run-tests.sh WORKSPACE=~/arvados CONFIGSRC=~/arvados-test-config --temp ~/.cache/arvados-build --only services/keep-web --skip-install

Now this last fix makes the test pass, and correctly fail if I move the file /etc/mime.types to some other place, with the following message:

----------------------------------------------------------------------
FAIL: handler_test.go:919: IntegrationSuite.TestFileContentType

time="2020-02-17T13:23:06.120200890-03:00" level=warning msg="SystemRootToken missing from cluster config, falling back to ARVADOS_API_TOKEN environment variable" 
time="2020-02-17T13:23:06.120236123-03:00" level=warning msg="Services.Controller.ExternalURL missing from cluster config, falling back to ARVADOS_API_HOST(_INSECURE) environment variables" 
handler_test.go:974:
    c.Check(resp.Header().Get("Content-Type"), check.Equals, trial.contentType)
... obtained string = "image/bmp" 
... expected string = "text/plain; charset=utf-8" 

handler_test.go:974:
    c.Check(resp.Header().Get("Content-Type"), check.Equals, trial.contentType)
... obtained string = "image/bmp" 
... expected string = "image/x-ms-bmp" 

So this LGTM, thanks!!

#7 Updated by Anonymous about 2 months ago

  • % Done changed from 0 to 100
  • Status changed from In Progress to Resolved

Also available in: Atom PDF