Bug #14482
closed[SDKs] Handle "." empty directory placeholder in manifests in Ruby SDK
Added by Tom Clegg about 6 years ago. Updated over 5 years ago.
Description
- In a stream or file name, "\056" and "." are equivalent
- Example: "0:0:\057" is not a valid file token, because "\057" is an encoding of "/"
- Example: "0:0:foo\057/bar" is not a valid file token, because "\057" is an encoding of "/" and double slashes are not allowed
Rails API manifest validation should accept a file token with size 0 and basename ".".
Ruby SDK (Rails API + Workbench) should not present a file named "." when encountering a file token like "0:0:." or "0:0:\056".
Updated by Tom Clegg about 6 years ago
- Description updated (diff)
- Target version set to Arvados Future Sprints
Updated by Tom Morris almost 6 years ago
- Target version changed from Arvados Future Sprints to 2019-01-16 Sprint
Updated by Lucas Di Pentima almost 6 years ago
- Assigned To set to Lucas Di Pentima
Updated by Lucas Di Pentima almost 6 years ago
- Status changed from New to In Progress
Updated by Lucas Di Pentima almost 6 years ago
Updates at 61248e845 - branch 14482-rubysdk-empty-dir
Test run: https://ci.curoverse.com/job/developer-run-tests/1024/
- Updated workbench to not include/count
"."
files on collection detail page. - Updated Ruby SDK's manifest validation to handle escaped chars.
- Added tests
Updated by Tom Clegg almost 6 years ago
It looks like the purpose of the "except" arg to unescape() in Keep::Manifest is to let validate!()
unescape enough to detect "/../", without unescaping to the point that valid tokens look invalid. This seems fragile to me.
Instead, could we have validate!()
pass the unescaped token to STRICT_*_TOKEN_REGEXP, then unescape it fully, then check for stuff like "/../"? I think that would avoid the subtlety of "partially unescaped token".
(One unintended consequence of the "partially unescaped" approach is that the current code incorrectly accepts ". d41d8cd98f00b204e9800998ecf8427e+0 \\040:\\040:foo.txt\n"
as a valid manifest.)
Updated by Lucas Di Pentima almost 6 years ago
Tom Clegg wrote:
[..]
Instead, could we havevalidate!()
pass the unescaped token to STRICT_*_TOKEN_REGEXP, then unescape it fully, then check for stuff like "/../"? I think that would avoid the subtlety of "partially unescaped token".
[...]
If by this you wanted to say "... pass the escaped token to STRICT_*_TOKEN_REGEXP, then unescape it fully...", some tests fail when the manifest includes a valid escaped char like .\057Hello
as a stream name token. Should I adapt STRICT_*_TOKEN_REGEXP to take into consideration this case? I think the cleaner way would be to add the whitespace on the regexes instead of trying to match escaped chars. The token was already splitted by " " so any new whitespace found on the tokens is clearly a result of string unescaping. What do you think?
Updated by Tom Clegg almost 6 years ago
Oops, yes, I did mean escaped (as in "un-unescaped" -- before decoding).
Checking whether a token is valid (e.g., contains no tab chars) has to be done before decoding, otherwise it's too late to tell the difference between the two encodings TAB (disallowed) and \011 (allowed).
Checking whether a filename is valid (stream name is "." or "./whatever", path component cannot be "..") has to be done after decoding, otherwise there are too many ways to spell "foo/../bar" (e.g., "foo\057\056./bar") to accommodate with a reasonable regexp.
Updated by Lucas Di Pentima almost 6 years ago
- Target version changed from 2019-01-16 Sprint to 2019-01-30 Sprint
Updated by Lucas Di Pentima almost 6 years ago
Tom Clegg wrote:
Checking whether a token is valid (e.g., contains no tab chars) has to be done before decoding, otherwise it's too late to tell the difference between the two encodings TAB (disallowed) and \011 (allowed).
Assuming that STRICT_STREAM_TOKEN_REGEXP
is correct as it is right now in master
, this stream token isn't valid: ".\057Folder"
... because the regex expects a forward slash, not its encoded version. Is that what we want or should I add \057
matching to the strict token regexp?
Updated by Tom Clegg almost 6 years ago
- valid stream name token as it appears in the manifest (no whitespace or control chars)
- valid stream name after decoding (no "..", no trailing slash, must be "." or start with "./", ...)
- valid filename token as it appears in the manifest (\d+:\d+:..., no whitespace or control chars)
- valid filename after decoding (no "..", no trailing slash, ...)
Updated by Tom Clegg almost 6 years ago
Stream token examples
encoding | content | example |
valid | valid | ./foo |
valid | valid | .\057foo |
valid | valid | \056\057foo |
valid | valid | ./\134444 |
invalid | n/a | ./\\444 |
valid | valid | ./\011foo |
valid | invalid | ./\011/.. |
valid | invalid | .foo |
valid | invalid | .\056\057 |
valid | invalid | .\057\056 |
invalid | n/a | ./<tab>foo |
invalid | n/a | ./foo\ |
invalid | n/a | ./foo\r |
invalid | n/a | ./foo\444 |
invalid | n/a | ./foo\888 |
File token examples
encoding | content | example |
valid | valid | 0:0:foo |
valid | valid | 0:0:foo\057bar |
valid | valid | 0:0:\072 |
valid | invalid | 0:0:.. |
valid | invalid | 0:0:\056\056 |
valid | invalid | 0:0:\056\056\057foo |
invalid | n/a | 0\0720\072foo |
invalid | n/a | \060:\060:foo |
Updated by Lucas Di Pentima almost 6 years ago
Updates at 75487f9f9
Test run: https://ci.curoverse.com/job/developer-run-tests/1036/
- Every stream & file token is checked in a 2-stage process, first checking for token format correctness, then un-escaping it and checking the result to see if the names are also acceptable.
- Escaped chars >
\377
aren't allowed. - Added examples in previous comment as test cases.
Updated by Tom Clegg almost 6 years ago
Hm. I still think we should be more strict about escape sequences. I've edited the above table to call "\\444" invalid since according to Keep manifest format backslash is spelled \134
. Could we do this, to accept valid escape sequences and reject all other uses of backslash?
STREAM_TOKEN_REGEXP = /^([^\000-\040\\]|\\[0-3][0-7][0-7])+$/
Added \888, \r, and trailing \ as invalid in note-14.
Updated by Lucas Di Pentima almost 6 years ago
Addressed above suggestions on 9363a29fa
Test run: https://ci.curoverse.com/job/developer-run-tests/1038/
Updated by Tom Clegg almost 6 years ago
LGTM, thanks!
As discussed offline, after this is merged and pushed to rubygems, we should make a script/one-liner that fetches all manifests (including trash and old versions) from a cluster and validates them. This will give us/ops a chance to check whether any existing manifests will become invalid when updating the API server bundle to this Ruby SDK version. If we find any such problems we might consider other options before updating the bundle.
Updated by Lucas Di Pentima almost 6 years ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|7337b18bf7b6996a7fe4df0aba5356a03bda452d.
Updated by Lucas Di Pentima almost 6 years ago
Tried the following script with the new Ruby SDK against 400k collections on 4xphq
. Got just 1 invalid collection: 4xphq-4zz18-67yvwc9c4wx75xy
(A test collection created by me when working on the Python SDK side of this story)
require 'arvados'
require 'arvados/keep'
api = Arvados.new
offset = 0
batch_size = 100
invalid = []
while true
begin
req = api.collection.index(
:select => [:uuid, :manifest_text],
:include_trash => true, :include_old_versions => true,
:limit => batch_size, :offset => offset)
rescue
puts invalid
raise
end
req[:items].each do |col|
begin
Keep::Manifest.validate! col[:manifest_text]
rescue
puts "Collection #{col[:uuid]} manifest not valid"
invalid << col[:uuid]
end
end
puts "Checked #{offset} / #{req[:items_available]} - Invalid: #{invalid.size}"
offset += req[:limit]
break if offset > req[:items_available]
end
puts invalid