Idea #17857
closedRun WGS variant calling workflow on 9tee4 with LSF and Singularity enabled
Added by Peter Amstutz over 3 years ago. Updated about 3 years ago.
Updated by Peter Amstutz over 3 years ago
- Related to Idea #16304: LSF support added
Updated by Peter Amstutz over 3 years ago
- Target version changed from 2021-07-21 sprint to 2021-08-04 sprint
Updated by Peter Amstutz over 3 years ago
- Related to Idea #16305: Singularity support added
Updated by Peter Amstutz over 3 years ago
- Target version changed from 2021-08-04 sprint to 2021-08-18 sprint
Updated by Peter Amstutz over 3 years ago
- Target version changed from 2021-08-18 sprint to 2021-09-01 sprint
Updated by Ward Vandewege over 3 years ago
Just trying to run LSF on 9tee4 is a bit of an adventure. Apparently there are a number of environment variables required, which I have set via /etc/arvados/environment (included via our systemd unit file). I had to set:
PATH="/opt/lsf/10.1/linux2.6-glibc2.3-x86_64/etc:/opt/lsf/10.1/linux2.6-glibc2.3-x86_64/bin/:/usr/bin:$PATH" LSF_SERVERDIR=/opt/lsf/10.1/linux2.6-glibc2.3-x86_64/etc LSF_LIBDIR=/opt/lsf/10.1/linux2.6-glibc2.3-x86_64/lib LD_LIBRARY_PATH=/opt/lsf/10.1/linux2.6-glibc2.3-x86_64/lib LSF_BINDIR=/opt/lsf/10.1/linux2.6-glibc2.3-x86_64/bin LSF_ENVDIR=/opt/lsf/conf
Note the explicit setting of /usr/bin in the PATH, I don't know what PATH is set to by default but sudo lives in /usr/bin so... I had to add that directory.
I think this (setting of env vars, and PATH) needs to be added to the documentation (LSF installation page).
Another thing that should probably be addressed:
Aug 19 16:59:58 9tee4.arvadosapi.com arvados-dispatch-lsf[9981]: {"PID":9981,"level":"warning","msg":"error locking container 9tee4-dz642-8zrkblhz10cwic8: arvados API server error: #\u003cArvadosModel::LockFailedError: cannot lock when Locked\u003e (req-bodo2l4psl3cg6caxvw3) (422: 422 Unprocessable Entity) returned by 9tee4.arvadosapi.com","time":"2021-08-19T16:59:58.042099184Z"}
and
Aug 19 17:11:46 9tee4.arvadosapi.com arvados-dispatch-lsf[9981]: {"PID":9981,"level":"warning","msg":"error locking container 9tee4-dz642-brng6597b8ies8w: arvados API server error: #\u003cArvadosModel::LockFailedError: cannot lock when Locked\u003e (req-1dcoj7l8ffh661gcotiu) (422: 422 Unprocessable Entity) returned by 9tee4.arvadosapi.com","time":"2021-08-19T17:11:46.480336333Z"} Aug 19 17:11:46 9tee4.arvadosapi.com arvados-dispatch-lsf[9981]: {"PID":9981,"level":"info","msg":"Submitting container 9tee4-dz642-brng6597b8ies8w to LSF","time":"2021-08-19T17:11:46.524634299Z"} Aug 19 17:11:46 9tee4.arvadosapi.com arvados-dispatch-lsf[9981]: {"PID":9981,"level":"info","msg":"bsub command [\"sudo\" \"-E\" \"-u\" \"crunch\" \"bsub\" \"-J\" \"9tee4-dz642-brng6597b8ies8w\" \"-R\" \"rusage[mem=1792MB:tmp=512MB] affinity[core(1)]\"] script \"#!/bin/sh\\nexec 'crunch-run' '--runtime-engine=docker' '-cgroup-parent-subsystem=memory' '9tee4-dz642-brng6597b8ies8w'\\n\"","time":"2021-08-19T17:11:46.524816973Z"} Aug 19 17:11:46 9tee4.arvadosapi.com sudo[27663]: root : TTY=unknown ; PWD=/ ; USER=crunch ; COMMAND=/opt/lsf/10.1/linux2.6-glibc2.3-x86_64/bin//bsub -J 9tee4-dz642-brng6597b8ies8w -R rusage[mem=1792MB:tmp=512MB] affinity[core(1)] Aug 19 17:11:46 9tee4.arvadosapi.com sudo[27663]: pam_unix(sudo:session): session opened for user crunch by (uid=0) Aug 19 17:11:46 9tee4.arvadosapi.com sudo[27663]: pam_unix(sudo:session): session closed for user crunch Aug 19 17:11:46 9tee4.arvadosapi.com arvados-dispatch-lsf[9981]: {"PID":9981,"level":"info","msg":"bsub finished","stdout":"Job \u003c1141\u003e is submitted to default queue \u003cnormal\u003e.\n","time":"2021-08-19T17:11:46.585577181Z"} Aug 19 17:11:46 9tee4.arvadosapi.com arvados-dispatch-lsf[9981]: {"PID":9981,"level":"info","msg":"Start monitoring container 9tee4-dz642-brng6597b8ies8w in state \"Locked\"","time":"2021-08-19T17:11:46.585661035Z"} Aug 19 17:11:51 9tee4.arvadosapi.com arvados-dispatch-lsf[9981]: {"PID":9981,"level":"info","msg":"container 9tee4-dz642-brng6597b8ies8w job disappeared from LSF queue","time":"2021-08-19T17:11:51.632859295Z"} Aug 19 17:11:51 9tee4.arvadosapi.com arvados-dispatch-lsf[9981]: {"PID":9981,"level":"info","msg":"Done monitoring container 9tee4-dz642-brng6597b8ies8w","time":"2021-08-19T17:11:51.723647734Z"}
I'm getting quite a few of those in the a-d-l output.
Yeah something is broken here. I submitted a new diag job and it took minutes for arvados-dispatch-lsf to start printing stuff about it:
ward@shell.9tee4:~$ arv get 9tee4-dz642-3d5szoghfu4wm4l { "href":"/containers/9tee4-dz642-3d5szoghfu4wm4l", "kind":"arvados#container", "etag":"c6imq2mm3ht2plwouotyxu7hb", "uuid":"9tee4-dz642-3d5szoghfu4wm4l", "owner_uuid":"9tee4-tpzed-000000000000000", "created_at":"2021-08-19T17:13:52.744996000Z", "modified_by_client_uuid":"9tee4-ozdt8-g11xkkfp6uj7axl", "modified_by_user_uuid":"9tee4-tpzed-000000000000000", "modified_at":"2021-08-19T17:14:51.749059000Z", "command":[ "arvados-cwl-runner", "--enable-reuse", "--local", "--api=containers", "--no-log-timestamps", "--disable-validate", "--disable-color", "--eval-timeout=20", "--thread-count=4", "--collection-cache-size=256", "--on-error=continue", "/var/lib/cwl/workflow.json#main", "/var/lib/cwl/cwl.input.json" ], "container_image":"669164db5df9d83162fb74b72ec39521+261", "cwd":"/var/spool/cwl", "environment":{}, "exit_code":null, ...
Aug 19 17:14:45 9tee4.arvadosapi.com arvados-dispatch-lsf[9981]: {"PID":9981,"level":"warning","msg":"error locking container 9tee4-dz642-3d5szoghfu4wm4l: arvados API server error: #\u003cArvadosModel::LockFailedError: cannot lock when Locked\u003e (req-q1eyr98j4bpj1vqoktrt) (422: 422 Unprocessable Entity) returned by 9tee4.arvadosapi.com","time":"2021-08-19T17:14:45.953611127Z"} Aug 19 17:14:46 9tee4.arvadosapi.com arvados-dispatch-lsf[9981]: {"PID":9981,"level":"info","msg":"Submitting container 9tee4-dz642-3d5szoghfu4wm4l to LSF","time":"2021-08-19T17:14:46.000793134Z"} Aug 19 17:14:46 9tee4.arvadosapi.com arvados-dispatch-lsf[9981]: {"PID":9981,"level":"info","msg":"bsub command [\"sudo\" \"-E\" \"-u\" \"crunch\" \"bsub\" \"-J\" \"9tee4-dz642-3d5szoghfu4wm4l\" \"-R\" \"rusage[mem=1792MB:tmp=512MB] affinity[core(1)]\"] script \"#!/bin/sh\\nexec 'crunch-run' '--runtime-engine=docker' '-cgroup-parent-subsystem=memory' '9tee4-dz642-3d5szoghfu4wm4l'\\n\"","time":"2021-08-19T17:14:46.000929006Z"} Aug 19 17:14:46 9tee4.arvadosapi.com sudo[30064]: root : TTY=unknown ; PWD=/ ; USER=crunch ; COMMAND=/opt/lsf/10.1/linux2.6-glibc2.3-x86_64/bin//bsub -J 9tee4-dz642-3d5szoghfu4wm4l -R rusage[mem=1792MB:tmp=512MB] affinity[core(1)] Aug 19 17:14:46 9tee4.arvadosapi.com sudo[30064]: pam_unix(sudo:session): session opened for user crunch by (uid=0) Aug 19 17:14:46 9tee4.arvadosapi.com sudo[30064]: pam_unix(sudo:session): session closed for user crunch Aug 19 17:14:46 9tee4.arvadosapi.com arvados-dispatch-lsf[9981]: {"PID":9981,"level":"info","msg":"bsub finished","stdout":"Job \u003c1146\u003e is submitted to default queue \u003cnormal\u003e.\n","time":"2021-08-19T17:14:46.048139004Z"} Aug 19 17:14:46 9tee4.arvadosapi.com arvados-dispatch-lsf[9981]: {"PID":9981,"level":"info","msg":"Start monitoring container 9tee4-dz642-3d5szoghfu4wm4l in state \"Locked\"","time":"2021-08-19T17:14:46.048234727Z"} Aug 19 17:14:48 9tee4.arvadosapi.com arvados-dispatch-lsf[9981]: {"PID":9981,"level":"info","msg":"container 9tee4-dz642-3d5szoghfu4wm4l job disappeared from LSF queue","time":"2021-08-19T17:14:48.632636136Z"} Aug 19 17:14:48 9tee4.arvadosapi.com arvados-dispatch-lsf[9981]: {"PID":9981,"level":"info","msg":"Done monitoring container 9tee4-dz642-3d5szoghfu4wm4l","time":"2021-08-19T17:14:48.723767220Z"}
Ah, whoops, I still had c-d-s running as well. We should really detect that somehow.
Okay, now lsf is trying to run my diag job but it errors out immediately, and there are no logs at the container/CR level:
Aug 19 17:19:47 9tee4.arvadosapi.com arvados-dispatch-lsf[9981]: {"PID":9981,"level":"info","msg":"Submitting container 9tee4-dz642-x6pyy2uun6nsd92 to LSF","time":"2021-08-19T17:19:47.292209164Z"} Aug 19 17:19:47 9tee4.arvadosapi.com arvados-dispatch-lsf[9981]: {"PID":9981,"level":"info","msg":"bsub command [\"sudo\" \"-E\" \"-u\" \"crunch\" \"bsub\" \"-J\" \"9tee4-dz642-x6pyy2uun6nsd92\" \"-R\" \"rusage[mem=1792MB:tmp=512MB] affinity[core(1)]\"] script \"#!/bin/sh\\nexec 'crunch-run' '--runtime-engine=docker' '-cgroup-parent-subsystem=memory' '9tee4-dz642-x6pyy2uun6nsd92'\\n\"","time":"2021-08-19T17:19:47.292378272Z"} Aug 19 17:19:47 9tee4.arvadosapi.com sudo[813]: root : TTY=unknown ; PWD=/ ; USER=crunch ; COMMAND=/opt/lsf/10.1/linux2.6-glibc2.3-x86_64/bin//bsub -J 9tee4-dz642-x6pyy2uun6nsd92 -R rusage[mem=1792MB:tmp=512MB] affinity[core(1)] Aug 19 17:19:47 9tee4.arvadosapi.com sudo[813]: pam_unix(sudo:session): session opened for user crunch by (uid=0) Aug 19 17:19:47 9tee4.arvadosapi.com sudo[813]: pam_unix(sudo:session): session closed for user crunch Aug 19 17:19:47 9tee4.arvadosapi.com arvados-dispatch-lsf[9981]: {"PID":9981,"level":"info","msg":"bsub finished","stdout":"Job \u003c1155\u003e is submitted to default queue \u003cnormal\u003e.\n","time":"2021-08-19T17:19:47.351514377Z"} Aug 19 17:19:47 9tee4.arvadosapi.com arvados-dispatch-lsf[9981]: {"PID":9981,"level":"info","msg":"Start monitoring container 9tee4-dz642-x6pyy2uun6nsd92 in state \"Locked\"","time":"2021-08-19T17:19:47.351621125Z"} Aug 19 17:19:53 9tee4.arvadosapi.com arvados-dispatch-lsf[9981]: {"PID":9981,"level":"info","msg":"container 9tee4-dz642-x6pyy2uun6nsd92 job disappeared from LSF queue","time":"2021-08-19T17:19:53.632135634Z"} Aug 19 17:19:53 9tee4.arvadosapi.com arvados-dispatch-lsf[9981]: {"PID":9981,"level":"info","msg":"Done monitoring container 9tee4-dz642-x6pyy2uun6nsd92","time":"2021-08-19T17:19:53.757735375Z"}
Okay, I have no idea where the stdout/stderr for these jobs is going so I'm stuck.
Updated by Ward Vandewege over 3 years ago
Hum. I've switched to singularity, same result:
Aug 19 17:59:25 9tee4.arvadosapi.com arvados-dispatch-lsf[12239]: {"PID":12239,"level":"info","msg":"Submitting container 9tee4-dz642-oswa66y75f33qd4 to LSF","time":"2021-08-19T17:59:25.882555047Z"} Aug 19 17:59:25 9tee4.arvadosapi.com arvados-dispatch-lsf[12239]: {"PID":12239,"level":"info","msg":"bsub command [\"sudo\" \"-E\" \"-u\" \"crunch\" \"bsub\" \"-J\" \"9tee4-dz642-oswa66y75f33qd4\" \"-R\" \"rusage[mem=1792MB:tmp=512MB] affinity[core(1)]\"] script \"#!/bin/sh\\nexec 'crunch-run' '--runtime-engine=singularity' '-cgroup-parent-subsystem=memory' '9tee4-dz642-oswa66y75f33qd4'\\n\"","time":"2021-08-19T17:59:25.882748567Z"} Aug 19 17:59:25 9tee4.arvadosapi.com sudo[12872]: root : TTY=unknown ; PWD=/ ; USER=crunch ; COMMAND=/opt/lsf/10.1/linux2.6-glibc2.3-x86_64/bin//bsub -J 9tee4-dz642-oswa66y75f33qd4 -R rusage[mem=1792MB:tmp=512MB] affinity[core(1)] Aug 19 17:59:25 9tee4.arvadosapi.com sudo[12872]: pam_unix(sudo:session): session opened for user crunch by (uid=0) Aug 19 17:59:25 9tee4.arvadosapi.com sudo[12872]: pam_unix(sudo:session): session closed for user crunch Aug 19 17:59:25 9tee4.arvadosapi.com arvados-dispatch-lsf[12239]: {"PID":12239,"level":"info","msg":"bsub finished","stdout":"Job \u003c1193\u003e is submitted to default queue \u003cnormal\u003e.\n","time":"2021-08-19T17:59:25.968487331Z"} Aug 19 17:59:25 9tee4.arvadosapi.com arvados-dispatch-lsf[12239]: {"PID":12239,"level":"info","msg":"Start monitoring container 9tee4-dz642-oswa66y75f33qd4 in state \"Locked\"","time":"2021-08-19T17:59:25.968575377Z"} Aug 19 17:59:29 9tee4.arvadosapi.com arvados-dispatch-lsf[12239]: {"PID":12239,"level":"info","msg":"container 9tee4-dz642-oswa66y75f33qd4 job disappeared from LSF queue","time":"2021-08-19T17:59:29.278079381Z"} Aug 19 17:59:29 9tee4.arvadosapi.com arvados-dispatch-lsf[12239]: {"PID":12239,"level":"info","msg":"Done monitoring container 9tee4-dz642-oswa66y75f33qd4","time":"2021-08-19T17:59:29.352937440Z"}
Updated by Peter Amstutz over 3 years ago
- Assigned To changed from Ward Vandewege to Tom Clegg
- Status changed from In Progress to New
Updated by Tom Clegg over 3 years ago
Added to Containers section of config:
LSF: BsubArgumentsList: ["-e", "/tmp/crunch-run.err", "-o", "/tmp/crunch-run.out"]
(-e "Appends the standard error output of the job to the specified file path." ... -o same thing for stdout.)
Ran a job.
Looked at /tmp/crunch-run.err on compute0. Lots of this:
/home/crunch/.lsbatch/1629991275.1228: 8: /home/crunch/.lsbatch/1629991275.1228: cat: not found /home/crunch/.lsbatch/1629991275.1228: 14: /home/crunch/.lsbatch/1629991275.1228: chmod: not found /home/crunch/.lsbatch/1629991275.1228: 15: /home/crunch/.lsbatch/1629991275.1228: /home/crunch/.lsbatch/1629991275.1228.shell: Permission denied
Something horribly wrong with PATH?
Updated by Tom Clegg over 3 years ago
Looks like the systemd EnvironmentFile feature does not support interpolation like PATH="...:$PATH"
I updated /etc/arvados/environment to this
PATH="/opt/lsf/10.1/linux2.6-glibc2.3-x86_64/etc:/opt/lsf/10.1/linux2.6-glibc2.3-x86_64/bin/:/usr/local/bin:/usr/bin:/bin"
And my container ran.
Starting arvados-dispatch-lsf... {"Listen":"[::]:9009","PID":18731,"Service":"arvados-dispatch-lsf","URL":"http://0.0.0.0:9009/","level":"info","msg":"listening","time":"2021-08-26T17:31:42.636462805Z"} Started arvados-dispatch-lsf. {"PID":18731,"level":"info","msg":"Submitting container 9tee4-dz642-n3r1abawm3d7lug to LSF","time":"2021-08-26T17:31:58.719867230Z"} {"PID":18731,"level":"info","msg":"bsub command [\"sudo\" \"-E\" \"-u\" \"crunch\" \"bsub\" \"-e\" \"/tmp/crunch-run.err\" \"-o\" \"/tmp/crunch-run.out\" \"-J\" \"9tee4-dz642-n3r1abawm3d7lug root : TTY=unknown ; PWD=/ ; USER=crunch ; COMMAND=/opt/lsf/10.1/linux2.6-glibc2.3-x86_64/bin//bsub -e /tmp/crunch-run.err -o /tmp/crunch-run.out -J 9tee4-dz642-n3r1abawm3d7lug -R rusage pam_unix(sudo:session): session opened for user crunch by (uid=0) pam_unix(sudo:session): session closed for user crunch {"PID":18731,"level":"info","msg":"bsub finished","stdout":"Job \u003c1241\u003e is submitted to default queue \u003cnormal\u003e.\n","time":"2021-08-26T17:31:58.773572895Z"} {"PID":18731,"level":"info","msg":"Start monitoring container 9tee4-dz642-n3r1abawm3d7lug in state \"Locked\"","time":"2021-08-26T17:31:58.773692745Z"} {"PID":18731,"level":"info","msg":"container 9tee4-dz642-n3r1abawm3d7lug changed state from Locked to Running","time":"2021-08-26T17:32:46.206450531Z"} {"PID":18731,"level":"info","msg":"container 9tee4-dz642-n3r1abawm3d7lug is done","time":"2021-08-26T17:32:51.255502723Z"} {"PID":18731,"level":"info","msg":"Bkill(1241)","time":"2021-08-26T17:32:51.650453074Z"} {"PID":18731,"level":"info","msg":"Bkill(1241)","time":"2021-08-26T17:32:56.650658120Z"} {"PID":18731,"level":"info","msg":"Done monitoring container 9tee4-dz642-n3r1abawm3d7lug","time":"2021-08-26T17:33:01.651183840Z"} {"PID":18731,"level":"info","msg":"container 9tee4-dz642-n3r1abawm3d7lug job disappeared from LSF queue","time":"2021-08-26T17:33:01.651157981Z"}
Updated by Tom Clegg over 3 years ago
The "-o" and "-e" flags append output to those files in /tmp, so they will eventually fill up the filesystem. I changed them to "-oo" and "-eo" ("overwrite") so there's still some opportunity to debug. Should we recommend something like this for production systems, or even do it by default?
Updated by Tom Clegg over 3 years ago
Ah, whoops, I still had c-d-s running as well. We should really detect that somehow.
Added #18071
Updated by Carlos Fenoy over 3 years ago
Tom Clegg wrote:
The "-o" and "-e" flags append output to those files in /tmp, so they will eventually fill up the filesystem. I changed them to "-oo" and "-eo" ("overwrite") so there's still some opportunity to debug. Should we recommend something like this for production systems, or even do it by default?
if using systemd tmp cleanup, you can probably use %J in the output and error parameters to have the file created with the jobid in the name. This way the tmp cleanup will take care of cleaning up old log files.
-o /tmp/crunch-run.%J.out -e /tmp/crunch-run.%J.err
Updated by Tom Clegg over 3 years ago
- Assigned To changed from Tom Clegg to Ward Vandewege
Updated by Ward Vandewege over 3 years ago
All right, next try:
compute1.9tee4:~# cat /tmp/crunch-run.err 2021/08/31 19:34:39 crunch-run 2.3.0~dev20210830172933 (go1.16.3) started 2021/08/31 19:34:39 error reading "/etc/arvados/ca-certificates.crt": open /etc/arvados/ca-certificates.crt: permission denied 9tee4-dz642-i85l43j745uwq9r 2021-08-31T19:34:39.393536297Z Not starting a gateway server (GatewayAuthSecret was not provided by dispatcher) 9tee4-dz642-i85l43j745uwq9r 2021-08-31T19:34:39.393708481Z crunch-run 2.3.0~dev20210830172933 (go1.16.3) started 9tee4-dz642-i85l43j745uwq9r 2021-08-31T19:34:39.393740726Z Executing container '9tee4-dz642-i85l43j745uwq9r' 9tee4-dz642-i85l43j745uwq9r 2021-08-31T19:34:39.393773730Z Executing on host 'compute1.9tee4.arvadosapi.com' 2021/08/31 19:34:39 error reading "/etc/arvados/ca-certificates.crt": open /etc/arvados/ca-certificates.crt: permission denied 9tee4-dz642-i85l43j745uwq9r 2021-08-31T19:34:39.532756229Z crunch-run finished 2021/08/31 19:34:39 9tee4-dz642-i85l43j745uwq9r: dispatch error detected: container "9tee4-dz642-i85l43j745uwq9r" has state "Running"
The "/etc/arvados" directory is not typically readable by the crunch user, and it should not be. I've temporarily changed it to 755, and tried again:
compute1.9tee4:~# cat /tmp/crunch-run.err 2021/08/31 19:38:35 crunch-run 2.3.0~dev20210830172933 (go1.16.3) started 9tee4-dz642-3fblxkmhh3bs27z 2021-08-31T19:38:35.408102191Z Not starting a gateway server (GatewayAuthSecret was not provided by dispatcher) 9tee4-dz642-3fblxkmhh3bs27z 2021-08-31T19:38:35.408294715Z crunch-run 2.3.0~dev20210830172933 (go1.16.3) started 9tee4-dz642-3fblxkmhh3bs27z 2021-08-31T19:38:35.408319085Z Executing container '9tee4-dz642-3fblxkmhh3bs27z' 9tee4-dz642-3fblxkmhh3bs27z 2021-08-31T19:38:35.408356480Z Executing on host 'compute1.9tee4.arvadosapi.com' 9tee4-dz642-3fblxkmhh3bs27z 2021-08-31T19:38:35.550646257Z crunch-run finished 2021/08/31 19:38:35 9tee4-dz642-3fblxkmhh3bs27z: dispatch error detected: container "9tee4-dz642-3fblxkmhh3bs27z" has state "Running"
That's not super helpful. The output file:
compute1.9tee4:~# cat /tmp/crunch-run.out Sender: LSF System <ward@compute1.9tee4.arvadosapi.com> Subject: Job 1526: <9tee4-dz642-3fblxkmhh3bs27z> in cluster <9tee4> Exited Job <9tee4-dz642-3fblxkmhh3bs27z> was submitted from host <9tee4.arvadosapi.com> by user <crunch> in cluster <9tee4> at Tue Aug 31 19:38:32 2021 Job was executed on host(s) <compute1.9tee4.arvadosapi.com>, in queue <normal>, as user <crunch> in cluster <9tee4> at Tue Aug 31 19:38:33 2021 </home/crunch> was used as the home directory. </> was used as the working directory. Started at Tue Aug 31 19:38:33 2021 Terminated at Tue Aug 31 19:38:36 2021 Results reported at Tue Aug 31 19:38:36 2021 Your job looked like: ------------------------------------------------------------ # LSBATCH: User input #!/bin/sh exec 'crunch-run' '--runtime-engine=singularity' '-cgroup-parent-subsystem=memory' '9tee4-dz642-3fblxkmhh3bs27z' ------------------------------------------------------------ Exited with exit code 1. Resource usage summary: CPU time : 0.94 sec. Max Memory : 19 MB Average Memory : 19.00 MB Total Requested Memory : 1792.00 MB Delta Memory : 1773.00 MB Max Swap : - Max Processes : 3 Max Threads : 8 Run time : 6 sec. Turnaround time : 4 sec. The output (if any) follows: PS: Read file </tmp/crunch-run.err> for stderr output of this job.
Looking in the logs for that container, it seems there are a few problems here:
2021-08-31T19:38:34.515989997Z 2021-08-31 19:38:34 arvados.events[11568] WARNING: Failed to connect to websockets on wss://ws.9tee4.arvadosapi.com/websocket 2021-08-31T19:38:34.516081593Z 2021-08-31 19:38:34 arvados.events[11568] WARNING: Falling back to polling after websocket error: Invalid response status: b'502' b'Bad Gateway' 2021-08-31T19:38:33.945886904Z Not starting a gateway server (GatewayAuthSecret was not provided by dispatcher) 2021-08-31T19:38:33.946147307Z crunch-run 2.3.0~dev20210830172933 (go1.16.3) started 2021-08-31T19:38:33.946165578Z Executing container '9tee4-dz642-3fblxkmhh3bs27z' 2021-08-31T19:38:33.946199043Z Executing on host 'compute0.9tee4.arvadosapi.com' 2021-08-31T19:38:34.060837819Z container token "v2/9tee4-gj3su-fj7lt87j2dpumqk/4wvktzjdy06ds9m4w3prqmvj8kjf9sg58r6cwrl9ldjg4w7gn8/9tee4-dz642-3fblxkmhh3bs27z" 2021-08-31T19:38:34.061402246Z Running [arv-mount --foreground --allow-other --read-write --storage-classes default --crunchstat-interval=10 --file-cache 268435456 --mount-tmp tmp0 --mount-by-pdh by_id --mount-by-id by_uuid /tmp/crunch-run.9tee4-dz642-3fblxkmhh3bs27z.092670893/keep116139816] 2021-08-31T19:38:34.566733220Z Fetching Docker image from collection 'bc563f6936293fe5e270e45f261b4ae6+261' 2021-08-31T19:38:34.664994438Z Using Docker image id "sha256:8def795c5bfc75c4db818b859ba4dd8d6465c07b46d7fc8e9c9fa53172f955b1" 2021-08-31T19:38:34.665057505Z Loading Docker image from keep 2021-08-31T19:38:35.320980789Z Starting container 2021-08-31T19:38:35.322152762Z Waiting for container to finish 2021-08-31T19:38:37.383305765Z caught signal: terminated 2021-08-31T19:38:37.383360502Z stopping container 2021-08-31T19:38:37.521524788Z error in CaptureOutput: error creating output collection: arvados API server error: //railsapi.internal/arvados/v1/collections: 401 Unauthorized: Not logged in (req-1ehxhz9n0dz971hh7zlp) (401: 401 Unauthorized) returned by 9tee4.arvadosapi.com 2021-08-31T19:38:37.540459288Z Cancelled 2021-08-31T19:38:37.679458370Z Running [arv-mount --unmount-timeout=8 --unmount /tmp/crunch-run.9tee4-dz642-3fblxkmhh3bs27z.092670893/keep116139816] 2021-08-31T19:38:38.735741353Z caught signal: terminated 2021-08-31T19:38:38.735829553Z stopping container 2021-08-31T19:38:40.233902460Z caught signal: terminated 2021-08-31T19:38:40.233978985Z stopping container 2021-08-31T19:38:41.736482842Z caught signal: terminated 2021-08-31T19:38:41.736569449Z stopping container 2021-08-31T19:38:43.234006227Z caught signal: terminated 2021-08-31T19:38:43.234076022Z stopping container 2021-08-31T19:38:44.734565999Z caught signal: terminated 2021-08-31T19:38:44.734632292Z stopping container 2021-08-31T19:38:46.233633706Z caught signal: terminated 2021-08-31T19:38:46.233700716Z stopping container 2021-08-31T19:38:47.735409189Z caught signal: terminated 2021-08-31T19:38:47.735477843Z stopping container 2021-08-31T19:38:49.235291190Z caught signal: terminated 2021-08-31T19:38:49.235367826Z stopping container 2021-08-31T19:38:50.736262165Z caught signal: terminated 2021-08-31T19:38:50.736343160Z stopping container 2021-08-31T19:38:52.233867003Z caught signal: terminated 2021-08-31T19:38:52.233945412Z stopping container 2021-08-31T19:38:53.734046453Z caught signal: terminated 2021-08-31T19:38:53.734160096Z stopping container 2021-08-31T19:38:55.235477343Z caught signal: terminated 2021-08-31T19:38:55.235546007Z stopping container 2021-08-31T19:38:56.678199057Z error in UpdateContainerFinal: arvados API server error: Log cannot be modified in state 'Cancelled' (nil, "7d62500b46f12d7729d063789e9e88e9+221") (req-f6gxd3fhod79jhkmtnxp) (422: 422 Unprocessable Entity) returned by 9tee4.arvadosapi.com 2021-08-31T19:38:56.678312309Z crunch-run finished
Problems/bugs:
- an issue contacting websockets
- /etc/arvados/ca-certificates.crt => when /etc/arvados is not accessible crunch-run aborts
- stdout/stderr from crunch (the contents of the /tmp/crunch-run.err and /tmp/crunch-run.out) not being saved in the container log, cf the error in the container log about permissions)
- the stdout/stderr from the actual job is not being saved (it's not in /tmp/crunch-run.err and /tmp/crunch-run.out -- where does LSF send this?)
- the job didn't run successfully (why?)
Okay, I had duelling dispatchers again after the reboot earlier today. Sigh. I re-ran and the diag job completed (9tee4-xvhdp-cnl3mlv5mr1q68t) These are the problems that remain:
- an issue contacting websockets => caused by an unrelated misconfiguration, fixed now
- /etc/arvados/ca-certificates.crt issue: when /etc/arvados is not accessible crunch-run prints a warning that looks like an error (fixed in d026d60cc8b9f1117e815d8fb76d4901ca6eaf26)
- stdout (and stderr?) from crunch (the contents of the /tmp/crunch-run.err and /tmp/crunch-run.out) not being saved in the container log) => apart from the LSF metainformation about the job, this is already stored in the container logs. It's not trivial to get the metainformation, so we're going to punt on it for now.
Updated by Ward Vandewege over 3 years ago
Okay, time to run the big workflow. I'm preparing the files in https://workbench.9tee4.arvadosapi.com/projects/9tee4-j7d0g-arsujrb4tphsvor.
Updated by Peter Amstutz over 3 years ago
- Target version changed from 2021-09-01 sprint to 2021-09-15 sprint
Updated by Ward Vandewege over 3 years ago
Ward Vandewege wrote:
Okay, time to run the big workflow. I'm preparing the files in https://workbench.9tee4.arvadosapi.com/projects/9tee4-j7d0g-arsujrb4tphsvor.
The run at https://workbench.9tee4.arvadosapi.com/container_requests/9tee4-xvhdp-3byn9auoavhoexj completed without error, albeit very slowly.
9tee4 | #17857 | 9tee4-xvhdp-3byn9auoavhoexj | LSF + Singularity | It has runtime of 19h58m(2h56mqueued) and used 2d2h4m of node allocation time (2.5⨯ scaling). |
tordo | #17755 | tordo-xvhdp-4dc27lh513rh7ai | cloud + Singularity | It has runtime of 1h14m(1h34mqueued) and used 5h16m of node allocation time (4.3⨯ scaling). |
9tee4 | #18096 | 9tee4-xvhdp-o0l519uaanemolg | Slurm + Singularity | It has runtime of 3h24m(1mqueued) and used 12h34m of node allocation time (3.7⨯ scaling). |
9tee4 | #17857 | 9tee4-xvhdp-yesk0ae2huvdfvs | LSF + Singularity (after perf fix cf note 27) | It has runtime of 3h36m(2mqueued) and used 12h17m of node allocation time (3.4⨯ scaling). |
9tee4 only has 2 compute nodes, each with 20 cores but they are very slow (AMD Opteron(tm) Processor 6348). I do not believe the slowness is related to LSF or singularity per se. There is some curious unevenness to the processing speed, see e.g. haplotypecaller_5 which completed much faster than the other copies of that step. The job on Tordo ran on Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz, apparently. The latter has single thread performance that is almost 50% better (cf. https://www.cpubenchmark.net/compare/AMD-Opteron-6348-vs-Intel-Xeon-Platinum-8259CL/1903vs3671). 9tee4 is running a very old kernel, which I'm sure doesn't help in terms of performance. I assume that the bio-informatics tools may also be optimized for newer CPUs (maybe they use newer CPU features if they are available).
Update in light of #18096: something else is going on here. Slurm + Singularity is running this workflow much faster on the same hardware (e.g. the bwamem-samtools-view steps are completing in ~12-14 minutes, as opposed to 2-3 hours for LSF). I did notice that while that LSF job was running, the bwa process never used more than 100% cpu (despite being started with `-t 16` and the load average being 16), while on Slurm the bwa job uses 1600% CPU as expected.
The 9tee4-xvhdp-o0l519uaanemolg job completed much faster in clock time, partly because Slurm ran multiple jobs per machine. Our current LSF configuration apparently does not allow that.
Updated by Ward Vandewege over 3 years ago
- Status changed from In Progress to Resolved
Updated by Ward Vandewege over 3 years ago
- Related to Feature #18096: Run WGS variant calling workflow on 9tee4 with Slurm and Singularity enabled added
Updated by Ward Vandewege over 3 years ago
- Status changed from Resolved to In Progress
Re-opening this in light of the performance data from #18096, where Slurm + Singularity ran the Chromosome 19 job much faster than LSF + Singularity. Will update https://dev.arvados.org/issues/17857#note-22 with the details.
Updated by Ward Vandewege over 3 years ago
- Subject changed from Run WGS variant calling workflow on 9tee4 with LSF and singularity enabled to Run WGS variant calling workflow on 9tee4 with LSF and Singularity enabled
Updated by Ward Vandewege over 3 years ago
We tracked the slowness down to the way we invoked bsub; the 'affinity[core(%d)]' argument we were passing is not what we want, instead we need to pass '-n %d' without affinity to make the appropriate number of cores available to the job. See 9db9c191df88e0b7f9eef8a8bef044d7e046767f on branch 17857-lsf-cpus.
Testing again now on 9tee4:
$ arvados-cwl-runner --disable-reuse --project-uuid 9tee4-j7d0g-uefv7xq2hlwyph5 ./cwl/wgs-processing-wf.cwl ./yml/wgs-processing-wf-chr19.yml INFO /usr/bin/arvados-cwl-runner 2.3.0.dev20210908184327, arvados-python-client 2.3.0.dev20210908184327, cwltool 3.0.20210319143721 INFO Resolved './cwl/wgs-processing-wf.cwl' to 'file:///home/ward/arvados-tutorial/WGS-processing/cwl/wgs-processing-wf.cwl' INFO Using cluster 9tee4 (https://workbench2.9tee4.arvadosapi.com/) INFO Upload local files: "tail.html" "generatereport.py" "head.html" INFO Using collection 4fce132bc4ae9b4cc844115246a6bd41+175 (9tee4-4zz18-2x31x08s4433o0n) INFO Using collection cache size 256 MiB INFO [container WGS processing workflow scattered over samples] submitted container_request 9tee4-xvhdp-yesk0ae2huvdfvs INFO Monitor workflow progress at https://workbench2.9tee4.arvadosapi.com/processes/9tee4-xvhdp-yesk0ae2huvdfvs ...
Updated by Tom Clegg over 3 years ago
17857-lsf-cpus @ 9db9c191df88e0b7f9eef8a8bef044d7e046767f -- developer-run-tests: #2682
Updated by Ward Vandewege over 3 years ago
Tom Clegg wrote:
17857-lsf-cpus @ 9db9c191df88e0b7f9eef8a8bef044d7e046767f -- developer-run-tests: #2682
LGTM, thanks! The job finished in the expected amount of time now, I have updated note #22.
Updated by Ward Vandewege over 3 years ago
- Status changed from In Progress to Resolved