Bug #14770

Registered workflows fail to run due to spurious --collection-keep-cache=256

Added by Tom Morris 7 months ago. Updated 7 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
01/30/2019
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-
Release relationship:
Auto

Description

Registering a workflow using arvados-cwl-runner --create-workflow and then attempting to run it results in:

arvados-cwl-runner: error: unrecognized arguments: --collection-keep-cache=256

I can't find any indication that arvados-cwl-runner ever supported such a command switch. The only mention in the code base appears to be in `arvados.apps/workbench/app/controllers/work_units_controller.rb` introduced in eb58fd94 on 2018-11-27:

       attrs['command'] = ["arvados-cwl-runner",
                         "--local",
                          "--api=containers",
                           "--project-uuid=#{params['work_unit']['owner_uuid']}",
                           "--collection-keep-cache=#{keep_cache}",
                           "/var/lib/cwl/workflow.json#main",
                           "/var/lib/cwl/cwl.input.json"]

This works in 1.2.0 (3c85ce36c1188474b6b94ad534daf51cb1497274)
but fails with 1.3.0 (fdf6033451b12862a8305f4e3a4b06fbbdd498e1)


Subtasks

Task #14790: Review 14770-wb-acr-bad-paramResolvedPeter Amstutz

Associated revisions

Revision c5a89fc4
Added by Lucas Di Pentima 7 months ago

Merge branch '14770-wb-acr-bad-param'
Closes #14770

Arvados-DCO-1.1-Signed-off-by: Lucas Di Pentima <>

History

#1 Updated by Tom Morris 7 months ago

  • Target version set to 2019-02-13 Sprint

#2 Updated by Lucas Di Pentima 7 months ago

  • Status changed from New to In Progress
  • Assigned To set to Lucas Di Pentima

#3 Updated by Lucas Di Pentima 7 months ago

It seems that the correct param should be --collection-cache-size

#4 Updated by Tom Morris 7 months ago

  • Target version changed from 2019-02-13 Sprint to 2019-01-30 Sprint

Pulling onto this sprint since Lucas has already started working on it.

#5 Updated by Lucas Di Pentima 7 months ago

Update at d162b10b0 - branch 14770-wb-acr-bad-param
Test run: https://ci.curoverse.com/job/developer-run-tests/1039/

Fixed the param passed to ACR from workbench.

#6 Updated by Tom Morris 7 months ago

  • Release set to 21

#7 Updated by Lucas Di Pentima 7 months ago

  • Target version changed from 2019-01-30 Sprint to 2019-02-13 Sprint

#8 Updated by Peter Amstutz 7 months ago

Did you test this manually?

#9 Updated by Lucas Di Pentima 7 months ago

I did not because it was pretty clear to me what is the problem.
Nevertheless I've just tried it on arvbox and it failed.

Used the example at https://www.commonwl.org/user_guide/02-1st-example/, submitted the workflow executing:

arvados-cwl-runner --create-workflow --api containers 1st-tool.cwl echo-job.yml

...then went to workbench, and tried to run the created workflow and got the same failure. Here's the entire log:

2019-01-30T17:57:59.005019513Z crunch-run dev started
2019-01-30T17:57:59.005053232Z Executing container '1q71a-dz642-8bepebacm6m21dw'
2019-01-30T17:57:59.005093889Z Executing on host 'e72df3831fbd'
2019-01-30T17:57:59.117924767Z Fetching Docker image from collection '99dc77297fbd39babf7587efd0d71f36+345'
2019-01-30T17:57:59.147511628Z Using Docker image id 'sha256:ee4130288b739498d319ac83dd62a714296199f6b7d4383354161fb6336e6a46'
2019-01-30T17:57:59.278644642Z Docker image is available
2019-01-30T17:57:59.299341161Z Running [arv-mount --foreground --allow-other --read-write --crunchstat-interval=10 --file-cache 268435456 --mount-tmp tmp0 --mount-by-pdh by_id /tmp/crunch-run.1q71a-dz642-8bepebacm6m21dw.040935147/keep976348494]
2019-01-30T17:57:59.118078358Z notice: reading stats from /sys/fs/cgroup/cpuacct/cgroup.procs
2019-01-30T17:57:59.118210109Z notice: reading stats from /sys/fs/cgroup/memory/memory.stat
2019-01-30T17:57:59.118458028Z mem 274558976 cache 21878 pgmajfault 3125178368 rss
2019-01-30T17:57:59.118494020Z notice: reading stats from /sys/fs/cgroup/cpuacct/cpuacct.stat
2019-01-30T17:57:59.118580221Z notice: reading stats from /sys/fs/cgroup/cpuset/cpuset.cpus
2019-01-30T17:57:59.118601306Z cpu 1736.5700 user 605.6800 sys 4 cpus
2019-01-30T17:57:59.118673946Z notice: reading stats from /sys/fs/cgroup/blkio/blkio.io_service_bytes
2019-01-30T17:57:59.118722647Z blkio:8:0 122073088 write 3577901056 read
2019-01-30T17:57:59.119331685Z net:eth0 3212642 tx 193065701 rx
2019-01-30T17:57:59.119337804Z net:docker0 0 tx 0 rx
Host Information
Linux e72df3831fbd 4.9.0-8-amd64 #1 SMP Debian 4.9.130-2 (2018-10-27) x86_64 GNU/Linux
CPU Information
processor    : 0
vendor_id    : GenuineIntel
cpu family    : 6
model        : 158
model name    : Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
stepping    : 9
cpu MHz        : 2807.998
cache size    : 6144 KB
physical id    : 0
siblings    : 4
core id        : 0
cpu cores    : 4
apicid        : 0
initial apicid    : 0
fpu        : yes
fpu_exception    : yes
cpuid level    : 22
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase avx2 invpcid rdseed clflushopt flush_l1d
bugs        : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips    : 5615.99
clflush size    : 64
cache_alignment    : 64
address sizes    : 39 bits physical, 48 bits virtual
power management:
processor    : 1
vendor_id    : GenuineIntel
cpu family    : 6
model        : 158
model name    : Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
stepping    : 9
cpu MHz        : 2807.998
cache size    : 6144 KB
physical id    : 0
siblings    : 4
core id        : 1
cpu cores    : 4
apicid        : 1
initial apicid    : 1
fpu        : yes
fpu_exception    : yes
cpuid level    : 22
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase avx2 invpcid rdseed clflushopt flush_l1d
bugs        : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips    : 5615.99
clflush size    : 64
cache_alignment    : 64
address sizes    : 39 bits physical, 48 bits virtual
power management:
processor    : 2
vendor_id    : GenuineIntel
cpu family    : 6
model        : 158
model name    : Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
stepping    : 9
cpu MHz        : 2807.998
cache size    : 6144 KB
physical id    : 0
siblings    : 4
core id        : 2
cpu cores    : 4
apicid        : 2
initial apicid    : 2
fpu        : yes
fpu_exception    : yes
cpuid level    : 22
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase avx2 invpcid rdseed clflushopt flush_l1d
bugs        : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips    : 5615.99
clflush size    : 64
cache_alignment    : 64
address sizes    : 39 bits physical, 48 bits virtual
power management:
processor    : 3
vendor_id    : GenuineIntel
cpu family    : 6
model        : 158
model name    : Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
stepping    : 9
cpu MHz        : 2807.998
cache size    : 6144 KB
physical id    : 0
siblings    : 4
core id        : 3
cpu cores    : 4
apicid        : 3
initial apicid    : 3
fpu        : yes
fpu_exception    : yes
cpuid level    : 22
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase avx2 invpcid rdseed clflushopt flush_l1d
bugs        : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips    : 5615.99
clflush size    : 64
cache_alignment    : 64
address sizes    : 39 bits physical, 48 bits virtual
power management:
Memory Information
MemTotal:        4050520 kB
MemFree:          104796 kB
MemAvailable:     348848 kB
Buffers:           41316 kB
Cached:           430836 kB
SwapCached:        33864 kB
Active:          2759944 kB
Inactive:         968636 kB
Active(anon):    2646300 kB
Inactive(anon):   663692 kB
Active(file):     113644 kB
Inactive(file):   304944 kB
Unevictable:         128 kB
Mlocked:             128 kB
SwapTotal:       1400828 kB
SwapFree:         439240 kB
Dirty:              2696 kB
Writeback:             0 kB
AnonPages:       3231508 kB
Mapped:           241320 kB
Shmem:             53260 kB
Slab:             108112 kB
SReclaimable:      57160 kB
SUnreclaim:        50952 kB
KernelStack:       13088 kB
PageTables:        50396 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     3426088 kB
Committed_AS:    8244960 kB
VmallocTotal:   34359738367 kB
VmallocUsed:           0 kB
VmallocChunk:          0 kB
HardwareCorrupted:     0 kB
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:      167872 kB
DirectMap2M:     4026368 kB
Disk Space
Filesystem     1M-blocks  Used Available Use% Mounted on
none               58758 29852     25900  54% /
none               58758 29852     25900  54% /
Disk INodes
Filesystem      Inodes   IUsed   IFree IUse% Mounted on
none           3826368 1361133 2465235   36% /
none           3826368 1361133 2465235   36% /
{
    "auth_uuid": "1q71a-gj3su-ov79kuuxw2nvqqz",
    "command": [
        "arvados-cwl-runner",
        "--local",
        "--api=containers",
        "--project-uuid=1q71a-tpzed-v0o0a46htvwy37d",
        "--collection-keep-cache=256",
        "/var/lib/cwl/workflow.json#main",
        "/var/lib/cwl/cwl.input.json" 
    ],
    "container_image": "99dc77297fbd39babf7587efd0d71f36+345",
    "created_at": "2019-01-30T17:57:56.869829000Z",
    "cwd": "/var/spool/cwl",
    "environment": {},
    "etag": "4d71pdc11hvfj5jq490qa85v1",
    "exit_code": null,
    "finished_at": null,
    "href": "/containers/1q71a-dz642-8bepebacm6m21dw",
    "kind": "arvados#container",
    "locked_by_uuid": "1q71a-gj3su-1c3q91f062k68fi",
    "log": null,
    "modified_at": "2019-01-30T17:57:57.863375000Z",
    "modified_by_client_uuid": "1q71a-ozdt8-tz3hl42ox5ksyv7",
    "modified_by_user_uuid": "1q71a-tpzed-000000000000000",
    "mounts": {
        "/var/lib/cwl/cwl.input.json": {
            "content": {
                "message": "Hello world!" 
            },
            "kind": "json" 
        },
        "/var/lib/cwl/workflow.json": {
            "content": {
                "$graph": [
                    {
                        "baseCommand": "echo",
                        "class": "CommandLineTool",
                        "cwlVersion": "v1.0",
                        "id": "#main",
                        "inputs": [
                            {
                                "default": "Hello world!",
                                "id": "#main/message",
                                "inputBinding": {
                                    "position": 1
                                },
                                "type": "string" 
                            }
                        ],
                        "outputs": []
                    }
                ],
                "cwlVersion": "v1.0" 
            },
            "kind": "json" 
        },
        "/var/spool/cwl": {
            "kind": "collection",
            "writable": true
        },
        "stdout": {
            "kind": "file",
            "path": "/var/spool/cwl/cwl.output.json" 
        }
    },
    "output": null,
    "output_path": "/var/spool/cwl",
    "owner_uuid": "1q71a-tpzed-000000000000000",
    "priority": 1124351035770240,
    "progress": null,
    "runtime_auth_scopes": [
        "all" 
    ],
    "runtime_constraints": {
        "API": true,
        "keep_cache_ram": 268435456,
        "ram": 1073741824,
        "vcpus": 1
    },
    "runtime_status": {},
    "runtime_user_uuid": "1q71a-tpzed-v0o0a46htvwy37d",
    "scheduling_parameters": {},
    "started_at": null,
    "state": "Locked",
    "uuid": "1q71a-dz642-8bepebacm6m21dw" 
}
2019-01-30T17:58:00.348578626Z Creating Docker container
2019-01-30T17:58:00.608178060Z Attaching container streams
2019-01-30T17:58:01.536004876Z notice: reading stats from /sys/fs/cgroup/cpuacct/docker/fa32906a46e92a275a55058b0e59e03f8e417c767fb14d131edc14ef9a861b67/cgroup.procs
2019-01-30T17:58:01.536034423Z notice: monitoring temp dir /tmp/crunch-run.1q71a-dz642-8bepebacm6m21dw.040935147
2019-01-30T17:58:01.536073241Z notice: reading stats from /sys/fs/cgroup/memory/docker/fa32906a46e92a275a55058b0e59e03f8e417c767fb14d131edc14ef9a861b67/memory.stat
2019-01-30T17:58:01.536296339Z mem 28672 cache 1 pgmajfault 745472 rss
2019-01-30T17:58:01.536334304Z notice: reading stats from /sys/fs/cgroup/cpuacct/docker/fa32906a46e92a275a55058b0e59e03f8e417c767fb14d131edc14ef9a861b67/cpuacct.stat
2019-01-30T17:58:01.536363962Z notice: reading stats from /sys/fs/cgroup/cpuset/docker/fa32906a46e92a275a55058b0e59e03f8e417c767fb14d131edc14ef9a861b67/cpuset.cpus
2019-01-30T17:58:01.536380576Z cpu 0.0000 user 0.0000 sys 4 cpus
2019-01-30T17:58:01.536396875Z notice: reading stats from /sys/fs/cgroup/blkio/docker/fa32906a46e92a275a55058b0e59e03f8e417c767fb14d131edc14ef9a861b67/blkio.io_service_bytes
2019-01-30T17:58:01.536424292Z blkio:8:0 0 write 16384 read
2019-01-30T17:58:01.536505593Z net:eth0 3225951 tx 193066295 rx
2019-01-30T17:58:01.536516550Z net:docker0 0 tx 0 rx
2019-01-30T17:58:01.536528821Z statfs 27157434368 available 31301435392 used 61612142592 total
2019-01-30T17:58:01.035332136Z Starting Docker container id 'fa32906a46e92a275a55058b0e59e03f8e417c767fb14d131edc14ef9a861b67'
2019-01-30T17:58:02.098692346Z Waiting for container to finish
2019-01-30T17:58:03.879214047Z usage: arvados-cwl-runner [-h] [--basedir BASEDIR] [--outdir OUTDIR]
2019-01-30T17:58:03.879214047Z                           [--eval-timeout EVAL_TIMEOUT]
2019-01-30T17:58:03.879214047Z                           [--print-dot | --version | --validate]
2019-01-30T17:58:03.879214047Z                           [--verbose | --quiet | --debug] [--metrics]
2019-01-30T17:58:03.879214047Z                           [--tool-help] [--enable-reuse | --disable-reuse]
2019-01-30T17:58:03.879214047Z                           [--project-uuid UUID] [--output-name OUTPUT_NAME]
2019-01-30T17:58:03.879214047Z                           [--output-tags OUTPUT_TAGS]
2019-01-30T17:58:03.879214047Z                           [--ignore-docker-for-reuse]
2019-01-30T17:58:03.879214047Z                           [--submit | --local | --create-template | --create-workflow | --update-workflow UUID]
2019-01-30T17:58:03.879214047Z                           [--wait | --no-wait]
2019-01-30T17:58:03.879214047Z                           [--log-timestamps | --no-log-timestamps]
2019-01-30T17:58:03.879214047Z                           [--api {jobs,containers}] [--compute-checksum]
2019-01-30T17:58:03.879214047Z                           [--submit-runner-ram SUBMIT_RUNNER_RAM]
2019-01-30T17:58:03.879214047Z                           [--submit-runner-image SUBMIT_RUNNER_IMAGE]
2019-01-30T17:58:03.879214047Z                           [--always-submit-runner]
2019-01-30T17:58:03.879214047Z                           [--submit-request-uuid UUID | --submit-runner-cluster CLUSTER_ID]
2019-01-30T17:58:03.879214047Z                           [--collection-cache-size COLLECTION_CACHE_SIZE]
2019-01-30T17:58:03.879214047Z                           [--name NAME] [--on-error {stop,continue}]
2019-01-30T17:58:03.879214047Z                           [--enable-dev] [--storage-classes STORAGE_CLASSES]
2019-01-30T17:58:03.879214047Z                           [--intermediate-output-ttl N] [--priority PRIORITY]
2019-01-30T17:58:03.879214047Z                           [--thread-count THREAD_COUNT]
2019-01-30T17:58:03.879214047Z                           [--http-timeout HTTP_TIMEOUT]
2019-01-30T17:58:03.879214047Z                           [--trash-intermediate | --no-trash-intermediate]
2019-01-30T17:58:03.879214047Z                           workflow ...
2019-01-30T17:58:03.879214047Z arvados-cwl-runner: error: unrecognized arguments: --collection-keep-cache=256
2019-01-30T17:58:04.167031084Z Container exited with code: 2
2019-01-30T17:58:04.287473388Z Complete
2019-01-30T17:58:04.762585270Z Running [arv-mount --unmount-timeout=8 --unmount /tmp/crunch-run.1q71a-dz642-8bepebacm6m21dw.040935147/keep976348494]
2019-01-30T17:58:05.700347731Z crunch-run finished

#10 Updated by Lucas Di Pentima 7 months ago

For the completeness' sake, I've manually tested the fix and it worked.

#11 Updated by Peter Amstutz 7 months ago

Lucas Di Pentima wrote:

For the completeness' sake, I've manually tested the fix and it worked.

Thanks. It is always good to do at least a manual test even for "obvious" fixes because you never know when you overlook some trivial syntax error :-)

Looks good to me, please merge.

#12 Updated by Lucas Di Pentima 7 months ago

  • Status changed from In Progress to Resolved

Also available in: Atom PDF