Bug #14074
closeda-c-r does not work with input keep locators containing '#' characters
Description
It appears that if an input locator filename contains a '#' character, arvados-cwl-runner (a-c-r) does not handle this properly, logging an error such as `Cannot make job: ../../lib/cwl/cwl.input.json:1:237: Don't know what to do with 'keep:3cfbe520dc6af7b368da3651828014d1+19483/25454_6#1/25454_6#1.cram'`
2018-08-18T23:19:58.681473932Z cwltool INFO: /usr/bin/arvados-cwl-runner 1.1.4.20180607205857, arvados-python-client 1.1.4.20180607152043, cwltool 1.0.20180524215209 2018-08-18T23:19:58.685007065Z cwltool INFO: Resolved '/var/lib/cwl/workflow.json#main' to 'file:///var/lib/cwl/workflow.json#main' 2018-08-18T23:21:41.422453390Z cwltool INFO: [workflow workflow.json#main] start 2018-08-18T23:21:41.493540795Z cwltool INFO: [step cram_get_fasta] start 2018-08-18T23:21:41.533163404Z cwltool INFO: [workflow cram_get_fasta] start 2018-08-18T23:21:41.581015548Z cwltool INFO: [step samtools_seq_cache_populate] start 2018-08-18T23:21:45.865025554Z arvados.cwl-runner INFO: [container samtools_seq_cache_populate] reused container eglyx-dz642-oqkg4rvdy6g9298 2018-08-18T23:21:53.454735004Z arvados.cwl-runner INFO: [container samtools_seq_cache_populate] eglyx-xvhdp-gbajxzk1vtcotfy is Final 2018-08-18T23:21:53.737324987Z arvados.cwl-runner INFO: [container samtools_seq_cache_populate] Intermediate output eglyx-4zz18-34nv1m01g1jniq9 (3184b10f225a48f1c79cbb29dc2ab9c7+333031) will be trashed at 2018-08-25 23:21:45 UTC. 2018-08-18T23:21:54.784467630Z cwltool INFO: [step samtools_seq_cache_populate] completed success 2018-08-18T23:21:54.816996868Z cwltool INFO: [step samtools_fastaref] start 2018-08-18T23:21:55.829041349Z cwltool ERROR: Exception on step 'samtools_fastaref' 2018-08-18T23:21:55.829041349Z cwltool ERROR: [step samtools_fastaref] Cannot make job: ../../lib/cwl/cwl.input.json:1:237: Don't know what to do with 'keep:3cfbe520dc6af7b368da3651828014d1+19483/25454_6#1/25454_6#1.cram' 2018-08-18T23:21:55.829041349Z cwltool INFO: [workflow cram_get_fasta] completed permanentFail 2018-08-18T23:21:55.829041349Z cwltool WARNING: [step cram_get_fasta] completed permanentFail 2018-08-18T23:21:55.829041349Z cwltool INFO: [workflow workflow.json#main] completed permanentFail 2018-08-18T23:21:55.829041349Z arvados.cwl-runner WARNING: Overall process status is permanentFail 2018-08-18T23:21:56.364305285Z arvados.cwl-runner INFO: Final output collection cf4ca1135b9a6d28f6d8141347be6563+61 "Output of main (2018-08-18T23:21:56.111Z)" (eglyx-4zz18-wpk8eucgv6naecz) 2018-08-18T23:21:57.139935452Z cwltool WARNING: Final process status is permanentFail
Updated by Joshua Randall over 6 years ago
This problem can be avoided by urlencoding the portion of the keep locator after the PDH.
e.g.`keep:3cfbe520dc6af7b368da3651828014d1+19483/25454_6#1/25454_6#1.cram` becomes `keep:3cfbe520dc6af7b368da3651828014d1+19483/25454_6%231%2F25454_6%231.cram`
Updated by Joshua Randall over 6 years ago
Actually the URL encoding causes a problem later on. This looks like it may have been just a YAML quoting issue, or at least, it appears that the problem can be avoided by quoting the keep URL string in the inputs YAML (although note that is should not be necessary to quote a string just because it has a '#' character in it, as long as the '#' character it not separated by whitespace.
Updated by Joshua Randall over 6 years ago
I've checked the `/var/lib/cwl/cwl.input.json` passed into one of these failing runner containers (i.e. the failure is coming from the a-c-r instance that has been submitted using `--submit`).
In it, the record for one of these failing files looks something like this:
"library_cram": {"secondaryFiles": [{"class": "File", "location": "keep:73d210db5daa1ba0050dfec60ca16626+20156/26298_3#1/26298_3#1.cram.crai", "basename": "26298_3"}], "class": "File", "location": "keep:73d210db5daa1ba0050dfec60ca16626+20156/26298_3#1/26298_3#1.cram", "basename": "26298_3"}
Note that the basename field has been truncated at the `#` character rather than being the full filename.
Updated by Joshua Randall over 6 years ago
Because cwltool parses the location as a URL, the `#` characters are being interpreted as delimiting the fragment identifier from the rest of the URL, and so basename is being set to what it thinks is the path component.
If we urlencode just the `#` characters (but not the `/` path separators) then it gets farther but yields an explicit exception from cwltool `Invalid filename: '26298_3#1.cram' contains illegal characters`
It appears that cwltool explicitly disallows any characters in filenames other than:
ACCEPTLIST_EN_STRICT_RE = re.compile(r"^[a-zA-Z0-9._+-]+$")
I did not realize filenames in CWL were so restrictive - have not found reference to this in the spec, perhaps it is just a cwltool thing.
Updated by Joshua Randall over 6 years ago
Ok, I see now in the spec for File (https://www.commonwl.org/v1.0/CommandLineTool.html#File) it does say that `path` cannot have any shell metacharacters or disallowed IDNA characters. Since `path` is set by the implementation, I didn't think I needed to worry about this.
However, on closer reading it seems that there is a chain of rules that means the implementations hands are tied:
- the final `path` component must match the value of `basename
- if not provided, `basename` must be set based on the `location` field by taking the final path component after parsing as an IRI
This implies that there is a rule that the `location` IRIs final path component must not contain any shell metacharacters or disallowed IDNA characters (even if they are escaped!?!).
I guess the issue here is that it would be better if a-c-r would check its keep URIs to see if this condition is satisfied, and if not to suggest to the user that they set `basename` on the input to a safe value.
Updated by Joshua Randall over 6 years ago
I think that having inputs such as this should work:
library_cram: secondaryFiles: - basename: "26216_2_1.cram.crai" class: File location: "keep:a929679c8e11147216dba8869ae754c1+17972/26216_2%231/26216_2%231.cram.crai" basename: "26216_2_1.cram" class: File location: "keep:a929679c8e11147216dba8869ae754c1+17972/26216_2%231/26216_2%231.cram"
I.e. the path segments in the keep URL are url escaped and the basename is set so that a nice friendly filename can be used inside the container.
However, this fails with a-c-r with a message like this:
2018-08-23T04:52:20.339088302Z cwltool INFO: [step samtools_fastaref] start 2018-08-23T04:52:20.346879754Z arvados.arv-run INFO: Using empty collection d41d8cd98f00b204e9800998ecf8427e+0 2018-08-23T04:52:20.380719351Z cwltool ERROR: Unexpected exception 2018-08-23T04:52:20.380719351Z Traceback (most recent call last): 2018-08-23T04:52:20.380719351Z File "/usr/lib/python2.7/dist-packages/cwltool/workflow.py", line 645, in job 2018-08-23T04:52:20.380719351Z **kwargs): 2018-08-23T04:52:20.380719351Z File "/usr/lib/python2.7/dist-packages/cwltool/command_line_tool.py", line 382, in job 2018-08-23T04:52:20.380719351Z builder.pathmapper = self.makePathMapper(reffiles, builder.stagedir, **make_path_mapper_kwargs) 2018-08-23T04:52:20.380719351Z File "/usr/lib/python2.7/dist-packages/arvados_cwl/arvtool.py", line 30, in makePathMapper 2018-08-23T04:52:20.380719351Z **kwargs) 2018-08-23T04:52:20.380719351Z File "/usr/lib/python2.7/dist-packages/arvados_cwl/pathmapper.py", line 53, in __init__ 2018-08-23T04:52:20.380719351Z super(ArvPathMapper, self).__init__(referenced_files, input_basedir, None) 2018-08-23T04:52:20.380719351Z File "/usr/lib/python2.7/dist-packages/cwltool/pathmapper.py", line 218, in __init__ 2018-08-23T04:52:20.380719351Z self.setup(dedup(referenced_files), basedir) 2018-08-23T04:52:20.380719351Z File "/usr/lib/python2.7/dist-packages/arvados_cwl/pathmapper.py", line 168, in setup 2018-08-23T04:52:20.380719351Z self.addentry(srcobj, c, ".", remap) 2018-08-23T04:52:20.380719351Z File "/usr/lib/python2.7/dist-packages/arvados_cwl/pathmapper.py", line 105, in addentry 2018-08-23T04:52:20.380719351Z c.copy(srcpath, path + "/" + obj["basename"], source_collection=src, overwrite=True) 2018-08-23T04:52:20.380719351Z File "/usr/lib/python2.7/dist-packages/arvados/arvfile.py", line 472, in must_be_writable_wrapper 2018-08-23T04:52:20.380719351Z return orig_func(self, *args, **kwargs) 2018-08-23T04:52:20.380719351Z File "/usr/lib/python2.7/dist-packages/arvados/arvfile.py", line 273, in synchronized_wrapper 2018-08-23T04:52:20.380719351Z return orig_func(self, *args, **kwargs) 2018-08-23T04:52:20.380719351Z File "/usr/lib/python2.7/dist-packages/arvados/collection.py", line 902, in copy 2018-08-23T04:52:20.380719351Z source_obj, target_dir, target_name = self._get_src_target(source, target_path, source_collection, True) 2018-08-23T04:52:20.380719351Z File "/usr/lib/python2.7/dist-packages/arvados/collection.py", line 849, in _get_src_target 2018-08-23T04:52:20.380719351Z raise IOError(errno.ENOENT, "File not found", source) 2018-08-23T04:52:20.380719351Z IOError: [Errno 2] File not found: '25154_8%231/25154_8%231.cram' 2018-08-23T04:52:20.380766018Z cwltool ERROR: [step samtools_fastaref] Cannot make job: [Errno 2] File not found: '25154_8%231/25154_8%231.cram'
It seems that the keep URL is not unescaped before it looks for the files in keep?
Updated by Joshua Randall over 6 years ago
The problem appears to be in CollectionFsAccess get_collection:
def get_collection(self, path): sp = path.split("/", 1) p = sp[0] if p.startswith("keep:") and arvados.util.keep_locator_pattern.match(p[5:]): pdh = p[5:] return (self.collection_cache.get(pdh), sp[1] if len(sp) == 2 else None) else: return (None, path)
I believe both occurrences of `p[5:]` should be replaced with `urllib.unquote(p[5:])`
Updated by Tom Clegg over 6 years ago
Related fix fc582bfe0561092165f45b7b6f4d0025892aebfd merged in #13931
Updated by Tom Clegg over 6 years ago
- Related to Bug #13931: [CWL] size field not accessible on arvados added
Updated by Joshua Randall over 6 years ago
- Status changed from New to Resolved
I can confirm that fc582bfe0561092165f45b7b6f4d0025892aebfd fixes this issue. Keep URI path segments in CWL inputs still need to be URL encoded, and basename must be specified that does not contain disallowed characters.