Project

General

Profile

Actions

Bug #21998

closed

Encoding of URLs in directory listing generated by keep-web may be incorrect

Added by Zoë Ma 7 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Keep
Story points:
-
Release:
Release relationship:
Auto

Description

Symptom -- it is possible to craft a path such that on the directory-listing web page generated by keep-web, the links may point to incorrect location.

To trigger,

  1. upload a directory named test - %27 with some content to a collection using arv-put,
  2. navigate to the collection's view in WB2, click on the three-dot menu button in the Details panel (top right), click on "Open with 3rd party client", and find the http(s) URL from the "Windows/Macos" tab,
  3. follow that URL in the browser, and we will arrive at the keep-web listing page, on which the incorrect <a> tag appears. Clicking on the link for test - %27 (or files under it) will incur a 404 Not Found error.

(an example in the playground is at https://pirca-4zz18-w0rawdttbs8qz38.collections.pirca.arvadosapi.com/)

Cause -- this ultimately arises from the way the <a> tags' href attribute is generated by keep-web. Basically, right now it relies on the pretty complicated context-aware automatic substitutions provided by the html template engine. We are putting the decoded path directly into the href of a tag <a href="..."> (see source:services/keep-web/handler.go@7f01a21e3f250027510d16c32acd0c62173f1fec#L756 where the path obtained from the FS is associated with the "Name" field of a template-execution context that feeds into the href value). If we instead put the percent-encoded path there, the "complicated rules" will not be triggered, and only HTML-entity translation will be automatically performed.

Solution -- the solution is to apply the `EscapedPath()` method provided by the `url.URL` type to path levels, to encode the path for the href value to be inserted into the template.

There is a related problem with the automatically generated wget command-line example on the page. The URL argument to that command is neither encoded nor single-quoted. If the user, who might not be experienced enough to fix it on their own, pasted the command into the shell, they might be exposed to risk ranging from annoyance to code execution (albeit indirectly).

A safer way (and nicer to the user) is to percent-encode the URL paths and further replace any remaining single-quote character (with %27), then put the single quotes around the encoded URL. This way the shell sees it as a plain string argument and wget sees it as an unambiguous URL.

Some other minor issues include invalid HTML (non-standard "noshade" attribute in <hr> tag etc.)

I've provided a proposed patch that addresses these issues in services/keep-web/handler.go. Currently, the test cases in handler_tests.go rely on some (pretty complicated) regular expressions to scrape the listing page (example: source:services/keep-web/handler_test.go@7f01a21e3f250027510d16c32acd0c62173f1fec#L934). Theoretically this wouldn't work in general, but the practical problem is that with more complicated test inputs (paths) the regular expressions get increasingly hard to write and decipher.

So at the risk of introducing more utility code and dependencies in the tests, I use the HTML parser from the extension package golang.org/x/net/html to "read" the tags and attribute values (we already depend on other packages under golang.org/x/net). In this way, to test whether a link is correct, we can test whether the <a> tag's text value (anchor text) -- which is the decoded path shown to humans -- can be recovered by the decoding the href attribute value, without having to ever compute or hard-code those (encoded) href values in the tests (because the encoded form is non-unique and implementation-specific, in addition to their complexity and illegibility).

To minimize the HTML parsing utility code in the tests, I added some class or id attributes to certain elements in the template, to make it easier to find them with less logic.

Actions #1

Updated by Zoë Ma 7 months ago

GitHub PR created for this issue: https://github.com/arvados/arvados/pull/252

Actions #2

Updated by Peter Amstutz 6 months ago

  • Target version set to Development 2024-08-07 sprint
Actions #3

Updated by Tom Clegg 6 months ago

  • Assigned To set to Tom Clegg
Actions #4

Updated by Tom Clegg 6 months ago

  • Status changed from New to In Progress

Updated some comments and gofmt. Now at 4e82a1ab11964ba710b4a342a139663cc87c780c.

Tested on tordo.

before: 404 / terribly wrong

https://tordo-4zz18-3xhjf9oyo5r1d75.collections.tordo.arvadosapi.com/test%27%22&%23$%2500@%21%3Cpre%3E%3C/test%27%22&#$%00@!%3cpre%3e%3c.jpg
wget --mirror --no-parent --no-host --cut-dirs=1 https://tordo-4zz18-3xhjf9oyo5r1d75.collections.tordo.arvadosapi.com/test'"&#$%00@!< pre></

(I inserted a space in < pre above to prevent it from becoming an html tag in the redmine comment)

after: 200

https://tordo-4zz18-3xhjf9oyo5r1d75.collections.tordo.arvadosapi.com/test%27%22&%23$%2500@%21%3Cpre%3E%3C/test%27%22&%23$%2500@%21%3Cpre%3E%3C.jpg
$ wget --mirror --no-parent --no-host --cut-dirs=1 'https://tordo-4zz18-3xhjf9oyo5r1d75.collections.tordo.arvadosapi.com/test%27%22&%23$%2500@%21%3Cpre%3E%3C/'

21998-zoe-translates-keep-web-directory-listing-encode-url @ 4e82a1ab11964ba710b4a342a139663cc87c780c -- developer-run-tests: #4363

Actions #5

Updated by Tom Clegg 6 months ago

  • Status changed from In Progress to Resolved
Actions #6

Updated by Peter Amstutz 5 months ago

  • Release set to 70
Actions

Also available in: Atom PDF