Bug #21998
closedEncoding of URLs in directory listing generated by keep-web may be incorrect
Description
Symptom -- it is possible to craft a path such that on the directory-listing web page generated by keep-web, the links may point to incorrect location.
To trigger,
- upload a directory named
test - %27
with some content to a collection usingarv-put
, - navigate to the collection's view in WB2, click on the three-dot menu button in the Details panel (top right), click on "Open with 3rd party client", and find the http(s) URL from the "Windows/Macos" tab,
- follow that URL in the browser, and we will arrive at the keep-web listing page, on which the incorrect
<a>
tag appears. Clicking on the link fortest - %27
(or files under it) will incur a 404 Not Found error.
(an example in the playground is at https://pirca-4zz18-w0rawdttbs8qz38.collections.pirca.arvadosapi.com/
)
Cause -- this ultimately arises from the way the <a>
tags' href
attribute is generated by keep-web. Basically, right now it relies on the pretty complicated context-aware automatic substitutions provided by the html template engine. We are putting the decoded path directly into the href
of a tag <a href="...">
(see source:services/keep-web/handler.go@7f01a21e3f250027510d16c32acd0c62173f1fec#L756 where the path obtained from the FS is associated with the "Name"
field of a template-execution context that feeds into the href
value). If we instead put the percent-encoded path there, the "complicated rules" will not be triggered, and only HTML-entity translation will be automatically performed.
Solution -- the solution is to apply the `EscapedPath()` method provided by the `url.URL` type to path levels, to encode the path for the href
value to be inserted into the template.
There is a related problem with the automatically generated wget
command-line example on the page. The URL argument to that command is neither encoded nor single-quoted. If the user, who might not be experienced enough to fix it on their own, pasted the command into the shell, they might be exposed to risk ranging from annoyance to code execution (albeit indirectly).
A safer way (and nicer to the user) is to percent-encode the URL paths and further replace any remaining single-quote character (with %27
), then put the single quotes around the encoded URL. This way the shell sees it as a plain string argument and wget
sees it as an unambiguous URL.
Some other minor issues include invalid HTML (non-standard "noshade" attribute in <hr>
tag etc.)
I've provided a proposed patch that addresses these issues in services/keep-web/handler.go
. Currently, the test cases in handler_tests.go
rely on some (pretty complicated) regular expressions to scrape the listing page (example: source:services/keep-web/handler_test.go@7f01a21e3f250027510d16c32acd0c62173f1fec#L934). Theoretically this wouldn't work in general, but the practical problem is that with more complicated test inputs (paths) the regular expressions get increasingly hard to write and decipher.
So at the risk of introducing more utility code and dependencies in the tests, I use the HTML parser from the extension package golang.org/x/net/html
to "read" the tags and attribute values (we already depend on other packages under golang.org/x/net
). In this way, to test whether a link is correct, we can test whether the <a>
tag's text value (anchor text) -- which is the decoded path shown to humans -- can be recovered by the decoding the href
attribute value, without having to ever compute or hard-code those (encoded) href
values in the tests (because the encoded form is non-unique and implementation-specific, in addition to their complexity and illegibility).
To minimize the HTML parsing utility code in the tests, I added some class
or id
attributes to certain elements in the template, to make it easier to find them with less logic.
Updated by Zoë Ma 7 months ago
GitHub PR created for this issue: https://github.com/arvados/arvados/pull/252
Updated by Peter Amstutz 6 months ago
- Target version set to Development 2024-08-07 sprint
Updated by Tom Clegg 6 months ago
- Status changed from New to In Progress
Updated some comments and gofmt. Now at 4e82a1ab11964ba710b4a342a139663cc87c780c.
Tested on tordo.
before: 404 / terribly wrong
https://tordo-4zz18-3xhjf9oyo5r1d75.collections.tordo.arvadosapi.com/test%27%22&%23$%2500@%21%3Cpre%3E%3C/test%27%22&#$%00@!%3cpre%3e%3c.jpg wget --mirror --no-parent --no-host --cut-dirs=1 https://tordo-4zz18-3xhjf9oyo5r1d75.collections.tordo.arvadosapi.com/test'"&#$%00@!< pre></
(I inserted a space in < pre
above to prevent it from becoming an html tag in the redmine comment)
after: 200
https://tordo-4zz18-3xhjf9oyo5r1d75.collections.tordo.arvadosapi.com/test%27%22&%23$%2500@%21%3Cpre%3E%3C/test%27%22&%23$%2500@%21%3Cpre%3E%3C.jpg $ wget --mirror --no-parent --no-host --cut-dirs=1 'https://tordo-4zz18-3xhjf9oyo5r1d75.collections.tordo.arvadosapi.com/test%27%22&%23$%2500@%21%3Cpre%3E%3C/'
21998-zoe-translates-keep-web-directory-listing-encode-url @ 4e82a1ab11964ba710b4a342a139663cc87c780c -- developer-run-tests: #4363
Updated by Tom Clegg 6 months ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|744d3dd89c7da002c5dbd0566b374e6a7546b58e.