Bug #14074

a-c-r does not work with input keep locators containing '#' characters

Added by Joshua Randall over 2 years ago. Updated over 2 years ago.

Assigned To:
Target version:
Start date:
Due date:
% Done:


Estimated time:
Story points:


It appears that if an input locator filename contains a '#' character, arvados-cwl-runner (a-c-r) does not handle this properly, logging an error such as `Cannot make job: ../../lib/cwl/cwl.input.json:1:237: Don't know what to do with 'keep:3cfbe520dc6af7b368da3651828014d1+19483/25454_6#1/25454_6#1.cram'`

2018-08-18T23:19:58.681473932Z cwltool INFO: /usr/bin/arvados-cwl-runner, arvados-python-client, cwltool 1.0.20180524215209
2018-08-18T23:19:58.685007065Z cwltool INFO: Resolved '/var/lib/cwl/workflow.json#main' to 'file:///var/lib/cwl/workflow.json#main'
2018-08-18T23:21:41.422453390Z cwltool INFO: [workflow workflow.json#main] start
2018-08-18T23:21:41.493540795Z cwltool INFO: [step cram_get_fasta] start
2018-08-18T23:21:41.533163404Z cwltool INFO: [workflow cram_get_fasta] start
2018-08-18T23:21:41.581015548Z cwltool INFO: [step samtools_seq_cache_populate] start
2018-08-18T23:21:45.865025554Z arvados.cwl-runner INFO: [container samtools_seq_cache_populate] reused container eglyx-dz642-oqkg4rvdy6g9298
2018-08-18T23:21:53.454735004Z arvados.cwl-runner INFO: [container samtools_seq_cache_populate] eglyx-xvhdp-gbajxzk1vtcotfy is Final
2018-08-18T23:21:53.737324987Z arvados.cwl-runner INFO: [container samtools_seq_cache_populate] Intermediate output eglyx-4zz18-34nv1m01g1jniq9 (3184b10f225a48f1c79cbb29dc2ab9c7+333031) will be trashed at 2018-08-25 23:21:45 UTC.
2018-08-18T23:21:54.784467630Z cwltool INFO: [step samtools_seq_cache_populate] completed success
2018-08-18T23:21:54.816996868Z cwltool INFO: [step samtools_fastaref] start
2018-08-18T23:21:55.829041349Z cwltool ERROR: Exception on step 'samtools_fastaref'
2018-08-18T23:21:55.829041349Z cwltool ERROR: [step samtools_fastaref] Cannot make job: ../../lib/cwl/cwl.input.json:1:237: Don't know what to do with 'keep:3cfbe520dc6af7b368da3651828014d1+19483/25454_6#1/25454_6#1.cram'
2018-08-18T23:21:55.829041349Z cwltool INFO: [workflow cram_get_fasta] completed permanentFail
2018-08-18T23:21:55.829041349Z cwltool WARNING: [step cram_get_fasta] completed permanentFail
2018-08-18T23:21:55.829041349Z cwltool INFO: [workflow workflow.json#main] completed permanentFail
2018-08-18T23:21:55.829041349Z arvados.cwl-runner WARNING: Overall process status is permanentFail
2018-08-18T23:21:56.364305285Z arvados.cwl-runner INFO: Final output collection cf4ca1135b9a6d28f6d8141347be6563+61 "Output of main (2018-08-18T23:21:56.111Z)" (eglyx-4zz18-wpk8eucgv6naecz)
2018-08-18T23:21:57.139935452Z cwltool WARNING: Final process status is permanentFail

Related issues

Related to Arvados - Bug #13931: [CWL] size field not accessible on arvadosResolved08/06/2018


#1 Updated by Joshua Randall over 2 years ago

This problem can be avoided by urlencoding the portion of the keep locator after the PDH.

e.g.`keep:3cfbe520dc6af7b368da3651828014d1+19483/25454_6#1/25454_6#1.cram` becomes `keep:3cfbe520dc6af7b368da3651828014d1+19483/25454_6%231%2F25454_6%231.cram`

#2 Updated by Joshua Randall over 2 years ago

Actually the URL encoding causes a problem later on. This looks like it may have been just a YAML quoting issue, or at least, it appears that the problem can be avoided by quoting the keep URL string in the inputs YAML (although note that is should not be necessary to quote a string just because it has a '#' character in it, as long as the '#' character it not separated by whitespace.

#3 Updated by Joshua Randall over 2 years ago

I've checked the `/var/lib/cwl/cwl.input.json` passed into one of these failing runner containers (i.e. the failure is coming from the a-c-r instance that has been submitted using `--submit`).

In it, the record for one of these failing files looks something like this:

"library_cram": {"secondaryFiles": [{"class": "File", "location": "keep:73d210db5daa1ba0050dfec60ca16626+20156/26298_3#1/26298_3#1.cram.crai", "basename": "26298_3"}], "class": "File", "location": "keep:73d210db5daa1ba0050dfec60ca16626+20156/26298_3#1/26298_3#1.cram", "basename": "26298_3"}

Note that the basename field has been truncated at the `#` character rather than being the full filename.

#4 Updated by Joshua Randall over 2 years ago

Because cwltool parses the location as a URL, the `#` characters are being interpreted as delimiting the fragment identifier from the rest of the URL, and so basename is being set to what it thinks is the path component.

If we urlencode just the `#` characters (but not the `/` path separators) then it gets farther but yields an explicit exception from cwltool `Invalid filename: '26298_3#1.cram' contains illegal characters`

It appears that cwltool explicitly disallows any characters in filenames other than:

ACCEPTLIST_EN_STRICT_RE = re.compile(r"^[a-zA-Z0-9._+-]+$")

I did not realize filenames in CWL were so restrictive - have not found reference to this in the spec, perhaps it is just a cwltool thing.

#5 Updated by Joshua Randall over 2 years ago

Ok, I see now in the spec for File (https://www.commonwl.org/v1.0/CommandLineTool.html#File) it does say that `path` cannot have any shell metacharacters or disallowed IDNA characters. Since `path` is set by the implementation, I didn't think I needed to worry about this.

However, on closer reading it seems that there is a chain of rules that means the implementations hands are tied:
- the final `path` component must match the value of `basename
- if not provided, `basename` must be set based on the `location` field by taking the final path component after parsing as an IRI

This implies that there is a rule that the `location` IRIs final path component must not contain any shell metacharacters or disallowed IDNA characters (even if they are escaped!?!).

I guess the issue here is that it would be better if a-c-r would check its keep URIs to see if this condition is satisfied, and if not to suggest to the user that they set `basename` on the input to a safe value.

#6 Updated by Joshua Randall over 2 years ago

I think that having inputs such as this should work:

  - basename: "26216_2_1.cram.crai" 
    class: File
    location: "keep:a929679c8e11147216dba8869ae754c1+17972/26216_2%231/26216_2%231.cram.crai" 
  basename: "26216_2_1.cram" 
  class: File
  location: "keep:a929679c8e11147216dba8869ae754c1+17972/26216_2%231/26216_2%231.cram" 

I.e. the path segments in the keep URL are url escaped and the basename is set so that a nice friendly filename can be used inside the container.

However, this fails with a-c-r with a message like this:

2018-08-23T04:52:20.339088302Z cwltool INFO: [step samtools_fastaref] start
2018-08-23T04:52:20.346879754Z arvados.arv-run INFO: Using empty collection d41d8cd98f00b204e9800998ecf8427e+0
2018-08-23T04:52:20.380719351Z cwltool ERROR: Unexpected exception
2018-08-23T04:52:20.380719351Z Traceback (most recent call last):
2018-08-23T04:52:20.380719351Z   File "/usr/lib/python2.7/dist-packages/cwltool/workflow.py", line 645, in job
2018-08-23T04:52:20.380719351Z     **kwargs):
2018-08-23T04:52:20.380719351Z   File "/usr/lib/python2.7/dist-packages/cwltool/command_line_tool.py", line 382, in job
2018-08-23T04:52:20.380719351Z     builder.pathmapper = self.makePathMapper(reffiles, builder.stagedir, **make_path_mapper_kwargs)
2018-08-23T04:52:20.380719351Z   File "/usr/lib/python2.7/dist-packages/arvados_cwl/arvtool.py", line 30, in makePathMapper
2018-08-23T04:52:20.380719351Z     **kwargs)
2018-08-23T04:52:20.380719351Z   File "/usr/lib/python2.7/dist-packages/arvados_cwl/pathmapper.py", line 53, in __init__
2018-08-23T04:52:20.380719351Z     super(ArvPathMapper, self).__init__(referenced_files, input_basedir, None)
2018-08-23T04:52:20.380719351Z   File "/usr/lib/python2.7/dist-packages/cwltool/pathmapper.py", line 218, in __init__
2018-08-23T04:52:20.380719351Z     self.setup(dedup(referenced_files), basedir)
2018-08-23T04:52:20.380719351Z   File "/usr/lib/python2.7/dist-packages/arvados_cwl/pathmapper.py", line 168, in setup
2018-08-23T04:52:20.380719351Z     self.addentry(srcobj, c, ".", remap)
2018-08-23T04:52:20.380719351Z   File "/usr/lib/python2.7/dist-packages/arvados_cwl/pathmapper.py", line 105, in addentry
2018-08-23T04:52:20.380719351Z     c.copy(srcpath, path + "/" + obj["basename"], source_collection=src, overwrite=True)
2018-08-23T04:52:20.380719351Z   File "/usr/lib/python2.7/dist-packages/arvados/arvfile.py", line 472, in must_be_writable_wrapper
2018-08-23T04:52:20.380719351Z     return orig_func(self, *args, **kwargs)
2018-08-23T04:52:20.380719351Z   File "/usr/lib/python2.7/dist-packages/arvados/arvfile.py", line 273, in synchronized_wrapper
2018-08-23T04:52:20.380719351Z     return orig_func(self, *args, **kwargs)
2018-08-23T04:52:20.380719351Z   File "/usr/lib/python2.7/dist-packages/arvados/collection.py", line 902, in copy
2018-08-23T04:52:20.380719351Z     source_obj, target_dir, target_name = self._get_src_target(source, target_path, source_collection, True)
2018-08-23T04:52:20.380719351Z   File "/usr/lib/python2.7/dist-packages/arvados/collection.py", line 849, in _get_src_target
2018-08-23T04:52:20.380719351Z     raise IOError(errno.ENOENT, "File not found", source)
2018-08-23T04:52:20.380719351Z IOError: [Errno 2] File not found: '25154_8%231/25154_8%231.cram'
2018-08-23T04:52:20.380766018Z cwltool ERROR: [step samtools_fastaref] Cannot make job: [Errno 2] File not found: '25154_8%231/25154_8%231.cram'

It seems that the keep URL is not unescaped before it looks for the files in keep?

#7 Updated by Joshua Randall over 2 years ago

The problem appears to be in CollectionFsAccess get_collection:

    def get_collection(self, path):
        sp = path.split("/", 1)
        p = sp[0]
        if p.startswith("keep:") and arvados.util.keep_locator_pattern.match(p[5:]):
            pdh = p[5:]
            return (self.collection_cache.get(pdh), sp[1] if len(sp) == 2 else None)
            return (None, path)

I believe both occurrences of `p[5:]` should be replaced with `urllib.unquote(p[5:])`

#9 Updated by Tom Clegg over 2 years ago

  • Related to Bug #13931: [CWL] size field not accessible on arvados added

#10 Updated by Joshua Randall over 2 years ago

  • Status changed from New to Resolved

I can confirm that fc582bfe0561092165f45b7b6f4d0025892aebfd fixes this issue. Keep URI path segments in CWL inputs still need to be URL encoded, and basename must be specified that does not contain disallowed characters.

Also available in: Atom PDF