Story #9397

[Crunch2] Support prepopulating the output directory - CWL InitialWorkDirRequirement

Added by Brett Smith about 3 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
01/31/2017
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
1.0
Release:
Release relationship:
Auto

Description

This is needed to support the CWL InitialWorkDirRequirement which is needed for full CWL compliance Crunch1 feature parity.

Minimum required functionality

When a container's output_path is a tmp mount backed by local disk, this output directory can be pre-populated with content from existing collections.
  • Initial content is specified in the container request by mounting collections at mount points that are subdirectories of output_path.
  • Mount points underneath output_path must not have "writable":true -- if any of them do, the API refuses to create/update the container request, and (just in case the API does not catch this problem) crunch-run fails the container.
  • When the container starts, these existing collections and files are readable at the specified mount points.
  • When the container finishes, the mounted collections/files are included in the output collection at the specified mount points. IOW, the container's output is equal to what the container sees in output_path just before it exits. Except: If a mount has "exclude_from_output":true then it is omitted from the container's output collection.
  • If a process in the container tries to modify, remove, or rename these mount points or anything underneath them, the operation fails and the container output is unaffected (as are the underlying collections used to pre-populate).

Implementation

  1. In crunchrun.go SetupMounts(), sort the keys in "runner.Container.Mounts" such that parents are processed before children (e.g., alphanumerically or by length).
  2. The inconsistency between the crunch-run and the spec noted in #note-10 needs to be fixed to follow the spec
  3. In crunchrun.go CaptureOutput(), after getting manifestText, go through runner.Container.Mounts and search for mount points beginning with runner.Container.OutputPath that do not have "exclude_from_output":true.
  4. For each such file and directory, add the relevant manifest fragment to the container output manifest, modifying stream/file names as needed.

The last one may be the most complicated part of the ticket just due to the fact that there is much less infrastructure for manipulating collections in Go compared to the Python SDK.


Subtasks

Task #11018: Add support for InitialWorkDirRequirement to arvados-cwl-runnerResolvedPeter Amstutz

Task #11051: Review 9397-cwl-initialworkdir-crunchv2ResolvedPeter Amstutz

Task #11028: Update documentationResolvedRadhika Chippada

Task #10817: Review 9397-prepopulate-output-directory-paResolvedRadhika Chippada

Task #11072: Review 9397-go-manifestResolvedTom Clegg


Related issues

Related to Arvados - Bug #9674: [CWL] InitialWorkDirRequirement not working as expectedResolved07/27/2016

Related to Arvados - Story #7582: [CWL] binary run-command shim for CWLResolved10/16/2015

Associated revisions

Revision 80689aac
Added by Peter Amstutz over 2 years ago

Merge branch '9397-go-manifest' refs #9397

Revision 2a469c48
Added by Peter Amstutz over 2 years ago

Merge branch '9397-prepopulate-output-directory' refs #9397

Revision 5763c671
Added by Peter Amstutz over 2 years ago

Merge branch '9397-cwl-initialworkdir-crunchv2' closes #9397

History

#1 Updated by Brett Smith about 3 years ago

  • Description updated (diff)

#2 Updated by Tom Morris almost 3 years ago

  • Story points set to 2.0

Use case is to support modify in place and output to the same directory as input e.g. samtools index.

Mount has a collection which crunchrun uses the files from to either copy or make symlinks to in the output directory based on a writable flag. Extend documentation to cover new feature. Mount temp gains new fields with portable data hash for collection and boolean write-required flag.

#3 Updated by Tom Morris almost 3 years ago

  • Target version set to Arvados Future Sprints

#4 Updated by Tom Morris over 2 years ago

  • Subject changed from [Crunch2] Support prepopulating the output directory to [Crunch2] Support prepopulating the output directory - CWL InitialWorkDirRequirement
  • Description updated (diff)

#5 Updated by Tom Morris over 2 years ago

  • Description updated (diff)

Much more important than "full CWL compliance" is Crunch1 feature parity. This is a feature which is implemented in Crunch1 and is missing from Crunch2.

#6 Updated by Peter Amstutz over 2 years ago

Proposed extension to "tmp" mounts. Support a "contents" dict which lists files or directories to be added to the "tmp" mount. Create symlinks by default, unless the name begins with "*" in which case it should be copied instead.

"mounts": {
  "/tmp/outputdir": {
    "kind": "tmp",
    "contents": {
      "foo": "/keep/abc+123/foo",
      "bar/baz" "/keep/abc+123/foo",
      "quux": "*/keep/abc+123/foo",
    }
}

Will result in:

/tmp/outputdir/foo     (symlink to /keep/abc+123/foo, readonly)
/tmp/outputdir/bar     (regular directory, writable)
/tmp/outputdir/bar/baz (symlink to /keep/abc+123/foo, readonly)
/tmp/outputdir/quux    (normal file copied from /keep/abc+123/foo, writable)

When constructing the output collection, identify symlinks which point to keep and copy them efficiently via copying block locators from source manifest text.

#7 Updated by Peter Amstutz over 2 years ago

The most minimal version of this to get to crunch1 parity is to not support copies and not try to identify symlinks, so it uploads the entire output directory, even the files already in keep.

#8 Updated by Tom Clegg over 2 years ago

I think "quux": "*/keep/abc+123/foo" is too surprising/unguessable as a way to express "copy instead of symlink". I would prefer to do something more structured.

Two lines of thought:

1. "Simple" and "extended" versions:

"mounts": {
  "/tmp/outputdir": {
    "kind": "tmp",
    "contents": {
      "foo": "/keep/abc+123/foo",
      "bar/baz" "/keep/abc+123/foo",
      "quux": ["symlink", "/keep/abc+123/foo"],
    }
}

2. Separate "symlink" mount type:

"mounts": {
  "/tmp/outputdir": {
    "kind": "tmp" 
  },
  "/tmp/outputdir/foo": {
    "kind": "symlink",
    "target": "/keep/abc+123/foo" 
  },
  "/tmp/outputdir/bar/baz": {
    "kind": "symlink",
    "target": "/keep/abc+123/foo" 
  },
  "/tmp/outputdir/quux": {
    "kind": "collection_file",
    "portable_data_hash": "abc+123",
    "path": "/foo" 
  }
}

#9 Updated by Peter Amstutz over 2 years ago

"mounts": {
  "/tmp/outputdir": {
    "kind": "tmp" 
  },
  "/tmp/outputdir/foo": {
    "kind": "collection",
    "portable_data_hash": "abc+123/foo" 
  },
  "/tmp/outputdir/bar/baz": {
    "kind": "collection",
    "portable_data_hash": "abc+123/foo" 
  },
  "/tmp/outputdir/quux": {
    "kind": "collection",
    "portable_data_hash": "abc+123/foo" 
    "writable": true
  }
}

Keep references like this can be added as regular bind mounts.

When collecting the output, crunch-run scans read-only mounts under the output directory and efficiently adds them to the output collection.

Mounts marked as "writable: true" should be copied from keep into the output directory as a regular file. Since they are regular files, they will be uploaded as normal output.

#10 Updated by Peter Amstutz over 2 years ago

I found a discrepancy between crunch-run and the spec. The spec says to do this:

  "/tmp/outputdir/foo": {
    "kind": "collection",
    "portable_data_hash": "abc+123" 
    "path": "/foo" 
  }

However crunch-run is expecting this:

  "/tmp/outputdir/foo": {
    "kind": "collection",
    "portable_data_hash": "abc+123/foo" 
  }

#11 Updated by Tom Clegg over 2 years ago

I don't think we should overload "writeable":true to also mean "include in output if target is inside output mount"... Better if it always only means "writing to this path will change the underlying collection" -- fundamentally incompatible with "portable_data_hash".

"portable_data_hash":"x","path":"/y" seems better to me than "portable_data_hash":"x/y" (in which case the spec is fine and we should fix the software).

Containers API seems undecided about whether a single file from a collection is "kind":"collection" or "kind":"collection_file". I think sticking with "collection" is fine. It's hard to tell from a distance whether "/foo" is a file or a directory, and it's common enough for tools to handle both equally well, so it makes some sense to specify both the same way. It also means the only special syntactic case to explain is "no path given = entire collection".

#12 Updated by Tom Clegg over 2 years ago

Consider mounting collection A at /outputdir/foo, where collection A has two files "baz" and "qux", and running a command "rm /outputdir/foo/baz".

There are several possible outcomes, any of which might be what the requester wants:
  1. Output consists of just "/foo/qux". Also, the "baz" file is removed from collection A.
  2. Output consists of just "/foo/qux". Collection A is unaltered.
  3. Output has "/foo/baz" and "/foo/qux". The "rm" command fails because collection A is mounted readonly.
  4. Output has no "/foo" at all.
Also consider mounting a file "bar" from collection B at /outputdir/bar, and running a command that modifies /outputdir/bar. Possible outcomes:
  1. Output has modified data. Also, collection B itself is modified.
  2. Output has modified data. Collection B is unaltered.
  3. Output has unmodified data. Modification operation fails.
  4. Output has no "/bar" at all. Modification operation fails.
  5. Output has no "/bar" at all. Modification operation succeeds.
Also consider mounting a file "waz" from collection C at /outputdir/waz, and running a command like "mv waz waz2 && head waz2 > waz". Possible outcomes:
  1. Output has the truncated file "/waz".
  2. Output has no "/waz".
  3. Output has the original "/waz". "mv" fails.
Also consider mounting a file "quux" from collection D at /outputdir/quux, and running "rm quux". Possible outcomes:
  1. Output has "/quux".
  2. Output has no "/quux".
  3. Output has "/quux". "rm" fails.

The "writable" flag has already been assigned to mean "do collections A and B get altered?". (Repeating my comment above for convenience) I don't think we should overload it to mean other things.

If we restrict ourselves to defining the writable=false cases for now, a lot of the complications/questions don't have to be sorted out.

It seems like the main questions left are
  • How does a container specify whether the output should include the unmodified "/bar" file?
  • Is the "waz" example supported, or does the requester have to mount at "waz2" instead if it wants to do this?
  • Is the "quux" example supported, or does the requester have to use "cp" in order to decide at runtime whether to include quux?

Suggestion: an "exclude" flag that can be used on mounts underneath output_path (the default should be "include" because the most obvious behavior is "output of container == final contents of output dir as seen from inside the container").

"mounts":{
  "/outputdir":{...},
  "/outputdir/bar":{
    ...,
    "writable":false,
    "exclude_from_output":false
  }
  "/outputdir/quux":{
    ...,
    "writable":false,
    "exclude_from_output":true
  }
}

#13 Updated by Tom Morris over 2 years ago

  • Assigned To set to Peter Amstutz
  • Target version changed from Arvados Future Sprints to 2017-01-18 sprint

#14 Updated by Peter Amstutz over 2 years ago

  • Description updated (diff)

#15 Updated by Tom Clegg over 2 years ago

As an alternative to teaching docker how to mount things the way we want, we might be able to
  • make use of (and possibly improve) arv-mount's ability to mount collections at specified paths
  • expose a mechanism in arv-mount for efficiently copying data between collections (e.g., "copy entire collection A into a subdirectory of writable collection B") using the magic .arvados#collection file or something similar.

#16 Updated by Peter Amstutz over 2 years ago

We're using staged output in production, so a solution that only works for arv-mount output isn't immediately useful.

If you attempt to use "mv" or "rm" on a file that is bind mounted into a container, it fails with "Device or resource busy". This means the "waz" and "quux" cases are not supported.

An apparent implementation detail of how docker handles volume mounts, if you mount a directory from the host file system, and then mount a file to be inside that directory, it will create a zero length stand-in file from the perspective of the host system but is linked to the source file inside the container. Writes to the file inside the container will affect the source file, not the stand-in file. The stand-in file remains after the container is terminated.

#17 Updated by Tom Clegg over 2 years ago

  • Description updated (diff)

#18 Updated by Peter Amstutz over 2 years ago

  • Target version changed from 2017-01-18 sprint to 2017-02-01 sprint

#19 Updated by Peter Amstutz over 2 years ago

  • Assigned To changed from Peter Amstutz to Radhika Chippada

#20 Updated by Radhika Chippada over 2 years ago

1. Is the kind going to be collection always for these mount points underneath the output_dir?

2. What would be the metadata fragment to be added to container output manifest text for the following?

  "/tmp/outputdir/foo": {
    "kind": "collection",
    "portable_data_hash": "887cd41e9c613463eab2f0d885c6dd96+83" 
    "path": "/foo” 
  }

The manifest_text for that pdh being:

  "manifest_text": ". c2251981055c321c8449d1ab6ff14a3d+34+Abb8d657b743e44acc100b109373d876c41aef683@5898b8df 0:12:alice.txt 12:10:bob.txt 22:12:carol.txt\n",

Should this result in adding the following to the manifest text?

  "manifest_text": "... /tmp/outputdir/foo c2251981055c321c8449d1ab6ff14a3d+34+Abb8d657b743e44acc100b109373d876c41aef683@5898b8df 0:12:alice.txt 12:10:bob.txt 22:12:carol.txt\n",

3. I am assuming the following needs to be done: for each of the mount point underneath the output_dir
  • if exclude_from_output != true
    • get the collection for it's portable_data_hash
    • add the manifest text from the collection to the container output manifest

4. What exactly is "path": "/foo” here? That is, would there be a need to deal with "portions" or "subdirs" in the pdh? For example,

  "/tmp/outputdir/foo": {
    "kind": "collection",
    "portable_data_hash": "cdfbe2e823222d26483d52e5089d553c+175" 
    "path": "/carol” 
  }

Here the manifest text for cdfbe2e823222d26483d52e5089d553c+175 is:

  "manifest_text": "./alice 03032680d3fa0561ef4f85071140861e+13+A6c10ae2b309e6acfb51d4b991c5cfe8d8b42ad9e@5898c718 0:13:hello.txt\n./bob d820b9df970e1b498e7723c50b107e1b+11+A365a027194381147b586222de074ce942c651aa0@5898c718 0:11:hello.txt\n./carol cf72b172ff969250ae14a893a6745440+13+A923f92a1c5d3e4b2fd02ada088e13b60dae1f218@5898c718 0:13:hello.txt\n",

Can the user configure to add only the "carol" subdir fragment to container output?

#21 Updated by Tom Clegg over 2 years ago

Radhika Chippada wrote:

1. Is the kind going to be collection always for these mount points underneath the output_dir?

Either collection or collection_file. Yes. (We can handle other kinds in the future, but for now, a container request should fail validation if it tries to mount other kinds under outputdir.)

2. What would be the metadata fragment to be added to container output manifest text for the following?

The "path" entry in the mount spec refers to a path in the mounted collection itself, so "foo" would be ENOENT in that case. But here is a modification that should work:

"/tmp/outputdir/foo": {
  "kind": "collection",
  "portable_data_hash": "887cd41e9c613463eab2f0d885c6dd96+83" 
}

+

(887cd)
"manifest_text": ". c2251981055c321c8449d1ab6ff14a3d+34+Abb8d657b743e44acc100b109373d876c41aef683@5898b8df 0:12:alice.txt 12:10:bob.txt 22:12:carol.txt\n",

=

(output)
"manifest_text": "./foo c2251981055c321c8449d1ab6ff14a3d+34+Abb8d657b743e44acc100b109373d876c41aef683@5898b8df 0:12:alice.txt 12:10:bob.txt 22:12:carol.txt\n",
3. I am assuming the following needs to be done: for each of the mount point underneath the output_dir
  • if exclude_from_output != true
    • get the collection for it's portable_data_hash
    • add the manifest text from the collection to the container output manifest

...with the appropriate modifications to change paths, like "/foo" in the above example, yes.

4. What exactly is "path": "/foo” here? That is, would there be a need to deal with "portions" or "subdirs" in the pdh?
...
Can the user configure to add only the "carol" subdir fragment to container output?

Yes.

"/tmp/outputdir/foo/CAROL": {
  "kind": "collection",
  "portable_data_hash": "887cd41e9c613463eab2f0d885c6dd96+83",
  "path": "/carol.txt" 
}

+

(887cd)
"manifest_text": ". c2251981055c321c8449d1ab6ff14a3d+34+Abb8d657b743e44acc100b109373d876c41aef683@5898b8df 0:12:alice.txt 12:10:bob.txt 22:12:carol.txt\n",

=

(output)
"manifest_text": "./foo c2251981055c321c8449d1ab6ff14a3d+34+Abb8d657b743e44acc100b109373d876c41aef683@5898b8df 22:12:CAROL\n",

#22 Updated by Radhika Chippada over 2 years ago

  • Status changed from New to In Progress

#23 Updated by Peter Amstutz over 2 years ago

  • Status changed from In Progress to New

Notes:

crunchrun.go:670

You can add "/" to the end of OutputPath, eliminates the need for the check on line 677

bindSuffix := strings.TrimPrefix(bind, runner.Container.OutputPath+"/")

Line 681

This is wrong. The "exclude_from_output" flag should be part of the "Mount" struct (defined in sdk/go/arvados/container.go:21), not found in the "Content" field.

        jsondata, err := json.Marshal(mnt.Content)
        if err != nil {
            return fmt.Errorf("While marshal of mount content: %v", err)
        }
        var content map[string]interface{}
        err = json.Unmarshal(jsondata, &content)
        if err != nil {
            return fmt.Errorf("While unmarshal of mount content: %v", err)
        }

        if content["exclude_from_output"] == true {
            continue
        }

The function getCollectionManifestForPath should be refactored to use the existing manifest manipulation functions in sdk/go/manifest/manifest.go. The functionality of extracting manifest text for a single file should be refactored to the SDK.

Also, both of these are legal manifests, although only the first one is considered "normalized form":

./subdir1 3e426d509afffb85e06c4c96a7c15e91+27+Aa124ac75e5168396cabcdefghij6419876543234@569fa8c4 0:9:file1_in_subdir1.txt
. 3e426d509afffb85e06c4c96a7c15e91+27+Aa124ac75e5168396cabcdefghij6419876543234@569fa8c4 0:9:subdir1/file1_in_subdir1.txt

Tests of the merge code need both cases.

We'll need to do integration testing to ensure that the resulting container actually functions as intended, but we can do that in the 2nd task (a-c-r support).

#24 Updated by Radhika Chippada over 2 years ago

  • Target version changed from 2017-02-01 sprint to 2017-02-15 sprint

#25 Updated by Radhika Chippada over 2 years ago

  • Story points changed from 2.0 to 1.0

#26 Updated by Radhika Chippada over 2 years ago

You can add "/" to the end of OutputPath, eliminates the need for the check on line 677

Updated

Line 681 -- This is wrong. The "exclude_from_output" flag should be part of the "Mount" struct (defined in sdk/go/arvados/container.go:21), not found in the "Content" field.

Corrected this

The function getCollectionManifestForPath should be refactored to use the existing manifest manipulation functions in sdk/go/manifest/manifest.go. The functionality of extracting manifest text for a single file should be refactored to the SDK.

Added manifest normalization to manifest.go and used it

Also, both of these are legal manifests, although only the first one is considered "normalized form": "./subdir1 3e426d509afffb85e06c4c96a7c15e91+27+Aa124ac75e5168396cabcdefghij6419876543234@569fa8c4 0:9:file1_in_subdir1.txt" and ". 3e426d509afffb85e06c4c96a7c15e91+27+Aa124ac75e5168396cabcdefghij6419876543234@569fa8c4 0:9:subdir1/file1_in_subdir1.txt"

Added additional logic to expect denormalized form as well and added test. This may not be what you had in mind. If so, let's discuss and probably do it as a separate ticket in the interest of supporting the happy path (normalized form) asap.

#27 Updated by Radhika Chippada over 2 years ago

  • Status changed from New to In Progress

#28 Updated by Peter Amstutz over 2 years ago

  • Status changed from In Progress to New

Okay, I think this will unblock you:

  1. Merge 9397-go-manifest (it is a separate branch, I'm going to ask Tom for a review)
  2. In crunchrun.go getCollectionManifestForPath(),
    manifestText := manifest.NormalizedManifestForPath(mnt.Path)
    

    Becomes
    manifestText := manifest.ManifestForPath(mnt.Path, bindSuffix)
    

    and you can delete the rest of the function.
  3. With the combined manifestText, before calling collection create, call manifest.NormalizeManifest()

#29 Updated by Peter Amstutz over 2 years ago

arvados-cwl-runner support in 9397-cwl-initialworkdir-crunchv2

#30 Updated by Tom Clegg over 2 years ago

9397-go-manifest @ ab6a70e86dd041f3b4da167c59e3e91309f14365

Functional
  • Is this a panic (i.e., could only happen if the code is buggy)? It looks like a corner to hide subtle bugs in.
    • +               // Binary search to determine first block in the stream
      +               i := FirstBlock(s.BlockOffsets, wantPos)
      +               if i == -1 {
      +                       // error
      +                       break
      +               }
      
  • ditto "shouldn't happen" comment a few lines down
  • in arvadostest.PathologicalManifest test case, seems like 0:0:zero@9 would be more "normalized" than 9:0:zero@9
  • should test some stuff like ManifestForPath("foo/.//bar/../baz/", "/waz/")
Code
  • avoid stutter in method names, e.g., (*Manifest)SegmentedManifest()
  • should use https://golang.org/pkg/path/#Clean and Dir and Base instead of custom SplitPath() and leading-"./" and trailing-slash checks.
  • StreamIter() always reads from manifests, so it always returns "." or "./foo", never "foo" -- so SegmentManifest() shouldn't need to check this, right?
  • there's growing confusion in the identifiers here about what a "manifest" is. Is it a lump of text, or a struct? Might help to say "Text" or "ManifestText" when referring to the lump of text.
  • (*SegmentedStream)NormalizeStream() could lose some (*stream)[] stuff if it didn't use a pointer receiver
  • instead of putting i+=1 at the bottom of the loop in sendFileSegmentIterByName, suggest "for ; i < len(s.Blocks); i++ {"
  • (*Manifest)NormalizeManifest() seems to amount to m.SegmentManifest().ManifestForPath(".", ".") -- if so, it should just do that instead of copying the code.

Many new exported functions and types here -- seems like the public API could be just (*Manifest)NormalizedText(path, relocate string) and everything else should be private?

#31 Updated by Peter Amstutz over 2 years ago

Tom Clegg wrote:

9397-go-manifest @ ab6a70e86dd041f3b4da167c59e3e91309f14365

Functional
  • Is this a panic (i.e., could only happen if the code is buggy)? It looks like a corner to hide subtle bugs in.
    • [...]

It can happen if wantPos refers to a location past the end of the stream. sendFileSegmentIterByName doesn't currently have a way to signal errors, and a panic would only kill the goroutine (and possibly hang the program if the channel isn't closed).

  • ditto "shouldn't happen" comment a few lines down

Yes, that would only happen on a legitimate bug, so I debated removing the check entirely. If we keep the check, the same considerations apply.

  • in arvadostest.PathologicalManifest test case, seems like 0:0:zero@9 would be more "normalized" than 9:0:zero@9

I had a similar thought. I the current behavior is that zero-length files reference the zero length block, which (in this case) appears at stream offset 9. I agree it would be better normalization to put them at the beginning of the stream and filter out the zero-length block (except in the special case of a stream consisting only of zero-length files). I will compare the behavior with the Python SDK; the most important thing is that they behave the same way.

In sendFileSegmentIterByName is there any actual benefit in sending the empty block?

        if wantLen == 0 {
            ch <- &FileSegment{Locator: "d41d8cd98f00b204e9800998ecf8427e+0", Offset: 0, Len: 0}
            continue
        }

  • should test some stuff like ManifestForPath("foo/.//bar/../baz/", "/waz/")

Sure.

Code
  • avoid stutter in method names, e.g., (*Manifest)SegmentedManifest()

Are you suggesting it should just be called Segment() ?

Sure.

  • StreamIter() always reads from manifests, so it always returns "." or "./foo", never "foo" -- so SegmentManifest() shouldn't need to check this, right?

I see that parseManifestStream() exits early for streams with badly formed names, however it looks like StreamIter() doesn't check m.Err and still returns them (!?)

  • there's growing confusion in the identifiers here about what a "manifest" is. Is it a lump of text, or a struct? Might help to say "Text" or "ManifestText" when referring to the lump of text.

It is sort of tricky. I'm avoiding using "Collection" because that implies an Arvados database record.

  • (*SegmentedStream)NormalizeStream() could lose some (*stream)[] stuff if it didn't use a pointer receiver

Sounds good.

  • instead of putting i+=1 at the bottom of the loop in sendFileSegmentIterByName, suggest "for ; i < len(s.Blocks); i++ {"

Will do.

  • (*Manifest)NormalizeManifest() seems to amount to m.SegmentManifest().ManifestForPath(".", ".") -- if so, it should just do that instead of copying the code.

Good point.

Many new exported functions and types here -- seems like the public API could be just (*Manifest)NormalizedText(path, relocate string) and everything else should be private?

I think we want to expose SegmentedManifest so the caller can avoid re-parsing the manifest for every operation. Unless you think we should hide it as an implementation optimization inside the Manifest struct.

The FirstBlock function could be used to get the starting file segment for random-access file operations.

#32 Updated by Peter Amstutz over 2 years ago

On further consultation, the Python SDK does turn it into 0:0:zero@9 so that does count as a Go SDK bug. Either sendFileSegmentIterByName shouldn't return zero-length file segments, or SegmentManifest() should filter them out.

#33 Updated by Tom Clegg over 2 years ago

Peter Amstutz wrote:

It can happen if wantPos refers to a location past the end of the stream. sendFileSegmentIterByName doesn't currently have a way to signal errors, and a panic would only kill the goroutine (and possibly hang the program if the channel isn't closed).

Panic kills the whole program, not just the goroutine (assuming no recover() of course).

Let's pick one:
  1. it's possible for a caller to request something we can't do and therefore we need a way to return an error to the caller
  2. this condition indicates our code is internally confused and is crashing so we should panic()

If there's an edge case where it's expected to happen, can we check for that case explicitly before calling FirstBlock()?

  • ditto "shouldn't happen" comment a few lines down

Yes, that would only happen on a legitimate bug, so I debated removing the check entirely. If we keep the check, the same considerations apply.

I say keep the check and call panic() instead of carrying on with undefined behavior. Risk of brokenness is equal, but panic makes the brokenness much easier to track down.

  • in arvadostest.PathologicalManifest test case, seems like 0:0:zero@9 would be more "normalized" than 9:0:zero@9

I had a similar thought. I the current behavior is that zero-length files reference the zero length block, which (in this case) appears at stream offset 9. I agree it would be better normalization to put them at the beginning of the stream and filter out the zero-length block (except in the special case of a stream consisting only of zero-length files). I will compare the behavior with the Python SDK; the most important thing is that they behave the same way.

We should probably achieve that by fixing whichever one is wrong, even if it's Python. :)

(Does anything actually depend on both SDKs normalizing the same way? Having the same data+filenames doesn't guarantee having the same block hashes so normalization can only go so far anyway.)

In sendFileSegmentIterByName is there any actual benefit in sending the empty block?

I think that's the only way for a caller to tell the difference between "zero byte file" and "file not found".

FileSegmentIterByName seems to be a fundamentally awkward API. :\

  • avoid stutter in method names, e.g., (*Manifest)SegmentedManifest()

Are you suggesting it should just be called Segment() ?

Sorry, bad example, this is probably more of an example of "manifest text" vs. "manifest" vs. "collection" confusion.

But (*SegmentedStream)NormalizeStream() and (*Manifest)NormalizeManifest() ...

  • StreamIter() always reads from manifests, so it always returns "." or "./foo", never "foo" -- so SegmentManifest() shouldn't need to check this, right?

I see that parseManifestStream() exits early for streams with badly formed names, however it looks like StreamIter() doesn't check m.Err and still returns them (!?)

Yeah, StreamIter has no other way to report errors, so it's up to whoever consumes StreamIter()'s channel to check Err on each ManifestStream. BlockIterWithDuplicates() does this. :\ again.

  • there's growing confusion in the identifiers here about what a "manifest" is. Is it a lump of text, or a struct? Might help to say "Text" or "ManifestText" when referring to the lump of text.

It is sort of tricky. I'm avoiding using "Collection" because that implies an Arvados database record.

Hm. I would say the database record is just one of the many internal representations of a collection -- a collection is a set of files.

So far, to the extent we're paying attention, "collection" is a set of files, which can be represented various different ways, while "manifest" is a hunk of text with a specific format that can be passed to/from the API server in API requests/responses etc.

Many new exported functions and types here -- seems like the public API could be just (*Manifest)NormalizedText(path, relocate string) and everything else should be private?

I think we want to expose SegmentedManifest so the caller can avoid re-parsing the manifest for every operation. Unless you think we should hide it as an implementation optimization inside the Manifest struct.

I'd like to minimize the public API. If we're going to expose new public APIs we should make good ones. I'm not convinced this is a useful level of abstraction to provide in the long term. Public API should probably be more like http.Filesystem and all this manifest and locator stuff should be hidden so nobody has to care too much when we make "manifest v2 JSON" etc.

The FirstBlock function could be used to get the starting file segment for random-access file operations.

Could be, but isn't. Public API is a commitment. If it's private we don't have to bikeshed/document/maintain it and we don't care whether it works outside the context we're using it in.

#34 Updated by Peter Amstutz over 2 years ago

Tom Clegg wrote:

If there's an edge case where it's expected to happen, can we check for that case explicitly before calling FirstBlock()?

I say keep the check and call panic() instead of carrying on with undefined behavior. Risk of brokenness is equal, but panic makes the brokenness much easier to track down.

File segments are now checked in parseManifestStream, will return an invalid stream if a file segment extends past the end of the stream. sendFileSegmentIterByName panics if a file segment position is invalid/comes before current block.

  • in arvadostest.PathologicalManifest test case, seems like 0:0:zero@9 would be more "normalized" than 9:0:zero@9

Fixed by filtering out zero-length blocks.

We should probably achieve that by fixing whichever one is wrong, even if it's Python. :)

However, in a related issue, I noticed the Python SDK does have an issue where the escaped character "\141" in the manifest, which is normal printable ASCII 'a', does not get un-escaped (however the Go SDK does the right thing).

Yeah, StreamIter has no other way to report errors, so it's up to whoever consumes StreamIter()'s channel to check Err on each ManifestStream. BlockIterWithDuplicates() does this. :\ again.

segment() now skips streams with errors.

  • there's growing confusion in the identifiers here about what a "manifest" is. Is it a lump of text, or a struct? Might help to say "Text" or "ManifestText" when referring to the lump of text.

NormalizeManifest -> NormalizedText

I'd like to minimize the public API. If we're going to expose new public APIs we should make good ones. I'm not convinced this is a useful level of abstraction to provide in the long term. Public API should probably be more like http.Filesystem and all this manifest and locator stuff should be hidden so nobody has to care too much when we make "manifest v2 JSON" etc.

Ok, I NormalizedText() and ManifestTextForPath() are now the only new public functions.

Updated 9397-go-manifest @ 2b78340

#35 Updated by Tom Clegg over 2 years ago

ManifestTextForPath does two things at once, one of which we already have a separate function for. How about:

func (*Manifest) Extract(src, dst string) *Manifest {

Then callers can do

m = m.Extract("foo", "bar")
log.Print(m.Extract("", "baz").NormalizedText())

or even

log.Print(m.Extract("foo", "baz/bar").Text)

I think this would be an easy change, and it would make it possible to fix the "ignored error" bug ("segment() now skips streams with errors"), at least for callers who are conscientious enough to check m.Err after doing an Extract.

m = m.Extract("foo", "bar")
if m.Err != nil {
        log.Fatal(m.Err)
}
log.Print(m.Text)

BlockOffsets shouldn't be exported.

Can you add a NormalizedText test that exercises this bug before fixing it?

firstBlock([]uint64{1,2,3,4}, 3)
=> -1
firstBlock([]uint64{1,2,3,4,5,6}, 4)
=> -1

#36 Updated by Tom Clegg over 2 years ago

If we have Extract() then we don't especially need NormalizedText() at all -- m.Extract(...).Text would give us what we care about for this story, right?

#37 Updated by Peter Amstutz over 2 years ago

Tom Clegg wrote:

I think this would be an easy change, and it would make it possible to fix the "ignored error" bug ("segment() now skips streams with errors"), at least for callers who are conscientious enough to check m.Err after doing an Extract.

Refactored into Extract() method, updated tests accordingly, including tests that manifest errors are reported.

BlockOffsets shouldn't be exported.

Fixed.

Can you add a NormalizedText test that exercises this bug before fixing it?

That one is embarrassing. I didn't put the curly braces in the right place when porting the function from Python (I got confused by indentation when cutting and pasting.) Added a test for firstBlock() and fixed it.

Now @ 4c081dd00f65f1e5a8e0cea34276d60ecbb49f40

#38 Updated by Tom Clegg over 2 years ago

d95e6df LGTM thanks!

#39 Updated by Peter Amstutz over 2 years ago

  • Assigned To changed from Radhika Chippada to Peter Amstutz

#40 Updated by Peter Amstutz over 2 years ago

  • Status changed from New to In Progress

#41 Updated by Radhika Chippada over 2 years ago

Peter: branch 9397-prepopulate-output-directory-pa LGTM. Please make sure to review and make any amendments to the documentation updates I made before merging into master. Thanks.

#42 Updated by Radhika Chippada over 2 years ago

Branch 9397-cwl-initialworkdir-crunchv2 @ 8c286d1

  • In test_container.py, should the path be “bar”, not “foo” in the following?
    +            "hints": [{
    +                "class": "InitialWorkDirRequirement",
    +                "listing": [{
    +                    "class": "File",
    +                    "basename": "foo",
    +                    "location": "keep:99999999999999999999999999999995+99/bar" 
    +                },
    
+                    '/var/spool/cwl/foo': {
+                        'kind': 'collection',
+                        'path': 'foo',
+                        'portable_data_hash': '99999999999999999999999999999996+99'
+                    },
  • Would it make sense to also include paths with subdirectories in the test? For example, "location": "keep:99999999999999999999999999999995+99/bar/filename“ and “location": "keep:99999999999999999999999999999997+99/subdir” ?
+            "hints": [{
+                "class": "InitialWorkDirRequirement",
+                "listing": [{
+                    "class": "File",
+                    "basename": "foo",
+                    "location": "keep:99999999999999999999999999999995+99/bar" 
+                },
+                {
+                    "class": "Directory",
+                    "basename": "foo2",
+                    "location": "keep:99999999999999999999999999999997+99" 
+                }]
+            }],

#43 Updated by Peter Amstutz over 2 years ago

Radhika Chippada wrote:

Branch 9397-cwl-initialworkdir-crunchv2 @ 8c286d1

  • In test_container.py, should the path be “bar”, not “foo” in the following?

That's on purpose. If "basename" is different from the name at the end of the "location" URI, then the file appears under "basename". This is testing that behavior.

  • Would it make sense to also include paths with subdirectories in the test? For example, "location": "keep:99999999999999999999999999999995+99/bar/filename“ and “location": "keep:99999999999999999999999999999997+99/subdir” ?

Added tests for subdirs. Thanks.

I don't think so. This is a standard CWL feature that already existed for crunch v1, and the Arvados specific documentation doesn't say anything about it.

#44 Updated by Peter Amstutz over 2 years ago

  • Status changed from In Progress to Resolved

Applied in changeset arvados|commit:5763c67176e8e34656cd96881074777b14b2dc4a.

Also available in: Atom PDF