Bug #14482

[SDKs] Handle "." empty directory placeholder in manifests in Ruby SDK

Added by Tom Clegg 3 months ago. Updated 26 days ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
01/07/2019
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-

Description

Rails API manifest validation should handle escaped characters correctly:
  • In a stream or file name, "\056" and "." are equivalent
  • Example: "0:0:\057" is not a valid file token, because "\057" is an encoding of "/"
  • Example: "0:0:foo\057/bar" is not a valid file token, because "\057" is an encoding of "/" and double slashes are not allowed

Rails API manifest validation should accept a file token with size 0 and basename ".".

Ruby SDK (Rails API + Workbench) should not present a file named "." when encountering a file token like "0:0:." or "0:0:\056".

See Keep manifest format


Subtasks

Task #14665: Review 14482-rubysdk-empty-dirResolvedLucas Di Pentima

Associated revisions

Revision 7337b18b
Added by Lucas Di Pentima about 1 month ago

Merge branch '14482-rubysdk-empty-dir'
Closes #14482

Arvados-DCO-1.1-Signed-off-by: Lucas Di Pentima <>

History

#2 Updated by Tom Clegg 3 months ago

  • Description updated (diff)
  • Target version set to Arvados Future Sprints

#3 Updated by Tom Morris about 2 months ago

  • Target version changed from Arvados Future Sprints to 2019-01-16 Sprint

#4 Updated by Lucas Di Pentima about 2 months ago

  • Assigned To set to Lucas Di Pentima

#5 Updated by Lucas Di Pentima about 2 months ago

  • Status changed from New to In Progress

#6 Updated by Lucas Di Pentima about 2 months ago

Updates at 61248e845 - branch 14482-rubysdk-empty-dir
Test run: https://ci.curoverse.com/job/developer-run-tests/1024/

  • Updated workbench to not include/count "." files on collection detail page.
  • Updated Ruby SDK's manifest validation to handle escaped chars.
  • Added tests

#7 Updated by Tom Clegg about 1 month ago

It looks like the purpose of the "except" arg to unescape() in Keep::Manifest is to let validate!() unescape enough to detect "/../", without unescaping to the point that valid tokens look invalid. This seems fragile to me.

Instead, could we have validate!() pass the unescaped token to STRICT_*_TOKEN_REGEXP, then unescape it fully, then check for stuff like "/../"? I think that would avoid the subtlety of "partially unescaped token".

(One unintended consequence of the "partially unescaped" approach is that the current code incorrectly accepts ". d41d8cd98f00b204e9800998ecf8427e+0 \\040:\\040:foo.txt\n" as a valid manifest.)

#8 Updated by Lucas Di Pentima about 1 month ago

Tom Clegg wrote:

[..]
Instead, could we have validate!() pass the unescaped token to STRICT_*_TOKEN_REGEXP, then unescape it fully, then check for stuff like "/../"? I think that would avoid the subtlety of "partially unescaped token".
[...]

If by this you wanted to say "... pass the escaped token to STRICT_*_TOKEN_REGEXP, then unescape it fully...", some tests fail when the manifest includes a valid escaped char like .\057Hello as a stream name token. Should I adapt STRICT_*_TOKEN_REGEXP to take into consideration this case? I think the cleaner way would be to add the whitespace on the regexes instead of trying to match escaped chars. The token was already splitted by " " so any new whitespace found on the tokens is clearly a result of string unescaping. What do you think?

#9 Updated by Lucas Di Pentima about 1 month ago

An example: 10a159a92

#10 Updated by Tom Clegg about 1 month ago

Oops, yes, I did mean escaped (as in "un-unescaped" -- before decoding).

Checking whether a token is valid (e.g., contains no tab chars) has to be done before decoding, otherwise it's too late to tell the difference between the two encodings TAB (disallowed) and \011 (allowed).

Checking whether a filename is valid (stream name is "." or "./whatever", path component cannot be "..") has to be done after decoding, otherwise there are too many ways to spell "foo/../bar" (e.g., "foo\057\056./bar") to accommodate with a reasonable regexp.

#11 Updated by Lucas Di Pentima about 1 month ago

  • Target version changed from 2019-01-16 Sprint to 2019-01-30 Sprint

#12 Updated by Lucas Di Pentima about 1 month ago

Tom Clegg wrote:

Checking whether a token is valid (e.g., contains no tab chars) has to be done before decoding, otherwise it's too late to tell the difference between the two encodings TAB (disallowed) and \011 (allowed).

Assuming that STRICT_STREAM_TOKEN_REGEXP is correct as it is right now in master, this stream token isn't valid: ".\057Folder" ... because the regex expects a forward slash, not its encoded version. Is that what we want or should I add \057 matching to the strict token regexp?

#13 Updated by Tom Clegg about 1 month ago

I think in order to do this correctly we need separate regexps/checks for
  • valid stream name token as it appears in the manifest (no whitespace or control chars)
  • valid stream name after decoding (no "..", no trailing slash, must be "." or start with "./", ...)
  • valid filename token as it appears in the manifest (\d+:\d+:..., no whitespace or control chars)
  • valid filename after decoding (no "..", no trailing slash, ...)

#14 Updated by Tom Clegg about 1 month ago

Stream token examples

encoding content example
valid valid ./foo
valid valid .\057foo
valid valid \056\057foo
valid valid ./\134444
invalid n/a ./\\444
valid valid ./\011foo
valid invalid ./\011/..
valid invalid .foo
valid invalid .\056\057
valid invalid .\057\056
invalid n/a ./<tab>foo
invalid n/a ./foo\
invalid n/a ./foo\r
invalid n/a ./foo\444
invalid n/a ./foo\888

File token examples

encoding content example
valid valid 0:0:foo
valid valid 0:0:foo\057bar
valid valid 0:0:\072
valid invalid 0:0:..
valid invalid 0:0:\056\056
valid invalid 0:0:\056\056\057foo
invalid n/a 0\0720\072foo
invalid n/a \060:\060:foo

#15 Updated by Lucas Di Pentima about 1 month ago

Updates at 75487f9f9
Test run: https://ci.curoverse.com/job/developer-run-tests/1036/

  • Every stream & file token is checked in a 2-stage process, first checking for token format correctness, then un-escaping it and checking the result to see if the names are also acceptable.
  • Escaped chars >\377 aren't allowed.
  • Added examples in previous comment as test cases.

#16 Updated by Tom Clegg about 1 month ago

Hm. I still think we should be more strict about escape sequences. I've edited the above table to call "\\444" invalid since according to Keep manifest format backslash is spelled \134. Could we do this, to accept valid escape sequences and reject all other uses of backslash?

STREAM_TOKEN_REGEXP = /^([^\000-\040\\]|\\[0-3][0-7][0-7])+$/

Added \888, \r, and trailing \ as invalid in note-14.

#18 Updated by Tom Clegg about 1 month ago

LGTM, thanks!

As discussed offline, after this is merged and pushed to rubygems, we should make a script/one-liner that fetches all manifests (including trash and old versions) from a cluster and validates them. This will give us/ops a chance to check whether any existing manifests will become invalid when updating the API server bundle to this Ruby SDK version. If we find any such problems we might consider other options before updating the bundle.

#19 Updated by Lucas Di Pentima about 1 month ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

#20 Updated by Lucas Di Pentima 26 days ago

Tried the following script with the new Ruby SDK against 400k collections on 4xphq. Got just 1 invalid collection: 4xphq-4zz18-67yvwc9c4wx75xy (A test collection created by me when working on the Python SDK side of this story)

require 'arvados'
require 'arvados/keep'

api = Arvados.new
offset = 0
batch_size = 100
invalid = []

while true
    begin
        req = api.collection.index(
            :select => [:uuid, :manifest_text],
            :include_trash => true, :include_old_versions => true,
            :limit => batch_size, :offset => offset)
    rescue
        puts invalid
        raise
    end

    req[:items].each do |col|
        begin
            Keep::Manifest.validate! col[:manifest_text]
        rescue
            puts "Collection #{col[:uuid]} manifest not valid" 
            invalid << col[:uuid]
        end
    end
    puts "Checked #{offset} / #{req[:items_available]} - Invalid: #{invalid.size}" 
    offset += req[:limit]
    break if offset > req[:items_available]
end

puts invalid

Also available in: Atom PDF