Project

General

Profile

Actions

Bug #19702

closed

singularity failure "plugin type="portmap" failed (add): netplugin failed with no error message: signal: killed"

Added by Peter Amstutz 3 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Start date:
11/08/2022
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-
Release relationship:
Auto

Description

https://workbench2.tordo.arvadosapi.com/processes/tordo-xvhdp-u55x8qxyai66qu3

2022-11-04T01:36:42.720240040Z using local keepstore process (pid 31086) at http://10.253.254.237:36637
2022-11-04T01:36:43.527944057Z gateway server listening at 10.253.254.237:37517
2022-11-04T01:36:43.529210341Z crunch-run 2.5.0~dev20221031202240 (go1.17.7) started
2022-11-04T01:36:43.529730722Z crunch-run process has uid=0(root) gid=0(root) groups=0(root)
2022-11-04T01:36:50.403375515Z Using FUSE mount: /usr/bin/arv-mount 2.5.0.dev20220908150551
2022-11-04T01:37:00.964489187Z Using container runtime: singularity-ce version 3.9.9
2022-11-04T01:37:00.965279062Z Executing container: tordo-dz642-2jn257pktac5pds
2022-11-04T01:37:00.965437914Z Executing on host 'ip-10-253-254-237'
2022-11-04T01:37:01.072429734Z container token "v2/tordo-gj3su-pss9lb029q7r5tk/16ci5mwyml0v1tgs0bnpq05kyr3ax5wlhg71jdkuzf1nmw8zjs/tordo-dz642-2jn257pktac5pds" 
2022-11-04T01:37:01.073177130Z Running [arv-mount --foreground --read-write --storage-classes default --crunchstat-interval=10 --file-cache 268435456 --mount-tmp tmp0 --mount-by-pdh by_id --disable-event-listening --mount-by-id by_uuid /tmp/crunch-run.tordo-dz642-2jn257pktac5pds.4195713021/keep3363330801]
2022-11-04T01:37:01.980388274Z Fetching Docker image from collection 'f8d78c661b100d071829d0600e01d2a6+513'
2022-11-04T01:37:02.088701070Z Using Docker image id "sha256:fb0ac87078b3916df22a477743a911c933447bc6ed6310af48dcc3cad3c5c815" 
2022-11-04T01:37:02.088734688Z Loading Docker image from keep
2022-11-04T01:37:02.507284910Z building singularity image
2022-11-04T01:37:02.508061814Z [singularity build /tmp/crunch-run.tordo-dz642-2jn257pktac5pds.4195713021/keep3363330801/by_uuid/tordo-4zz18-y71qluoxmymynse/image.sif docker-archive:///tmp/crunch-run-singularity-2503404092/image.tar]
2022-11-04T01:38:45.488012829Z INFO:    Starting build...
2022-11-04T01:38:45.488012829Z Getting image source signatures
2022-11-04T01:38:45.488012829Z Copying blob sha256:2141d9a2bb10152a46970ba69da724943d79d19bc0cd194945cc4ec2d1bc4ae2
2022-11-04T01:38:45.488012829Z Copying blob sha256:270a8dc08c4bb67a19b86398d4cfee8cdfcc344f7f3af88362a0a5eedfb5d2f9
2022-11-04T01:38:45.488012829Z Copying blob sha256:6be90f1a2d3f1eb115203b6adb2ce1014fab9a9f8f1b2afa31343397063603d3
2022-11-04T01:38:45.488012829Z Copying blob sha256:2761f8a9e627669ad97308c19bfb1dc2069a585c2614e83f220daf2dcef7c67e
2022-11-04T01:38:45.488012829Z Copying blob sha256:2761f8a9e627669ad97308c19bfb1dc2069a585c2614e83f220daf2dcef7c67e
2022-11-04T01:38:45.488012829Z Copying blob sha256:fa83b8d448a9dc6ab0ace6ee87bc8fd7ad2afc48536bad3722f198ce2f761872
2022-11-04T01:38:45.488012829Z Copying blob sha256:6483dc4da598b97363463ffda4351a39938c5dd1a7da7f7624dd57d3c6e50340
2022-11-04T01:38:45.488012829Z Copying blob sha256:c6f565b0be6987fcf58b3cc5466c25daf5b4ff9e0729a6194c4d7312577eb1a0
2022-11-04T01:38:45.488012829Z Copying blob sha256:7a51c5dbb21b520720e67b568c5da49bf1fa76af11f84ce5dcb2ae2c4e2714c1
2022-11-04T01:38:45.488012829Z Copying blob sha256:2edb48854a1856466fddbfaa009706ea3b86119977d2d87847d14dd3abe90657
2022-11-04T01:38:45.488012829Z Copying blob sha256:d0b9905f86257c13c172ba5dfef25db80eea39d6b1a5df4897b742e0a82a71ea
2022-11-04T01:38:45.488012829Z Copying config sha256:33b0fcf52b3adeb6a3ffb8d19414dc601ff893e862c1fd0f6819d7b32ccf8aad
2022-11-04T01:38:45.488012829Z Writing manifest to image destination
2022-11-04T01:38:45.488012829Z Storing signatures
2022-11-04T01:38:45.488012829Z 2022/11/04 01:37:32  info unpack layer: sha256:c719853e88efcc312969f220cd8e62ed9c46449a6bf5a7f3a3fa7dd403390aa6
2022-11-04T01:38:45.488012829Z 2022/11/04 01:37:33  info unpack layer: sha256:f12b85199b52ac3a1df407f52e4ca01b65d205852457d62d56e2504bb9db79e8
2022-11-04T01:38:45.488012829Z 2022/11/04 01:37:40  info unpack layer: sha256:5cc050ed8d38cfaa70b4510dca7867744d2c1003dc43e98413bb96ade4803d7a
2022-11-04T01:38:45.488012829Z 2022/11/04 01:37:40  info unpack layer: sha256:3ba58afa464a775d93de58a18d2a684b6a9eb3b830123c595aec9ce9277f9423
2022-11-04T01:38:45.488012829Z 2022/11/04 01:37:40  info unpack layer: sha256:3ba58afa464a775d93de58a18d2a684b6a9eb3b830123c595aec9ce9277f9423
2022-11-04T01:38:45.488012829Z 2022/11/04 01:37:40  info unpack layer: sha256:a4c191a15cf848288a39f0182b45ac9a11fdee6f6b741b3426fa3ca813888090
2022-11-04T01:38:45.488012829Z 2022/11/04 01:37:40  info unpack layer: sha256:6d16249d98c1bc9ba8f3c2cf97cb56612c89963043f6f9f702e7b9b1c3a7081a
2022-11-04T01:38:45.488012829Z 2022/11/04 01:37:40  info unpack layer: sha256:19f6b11482751f58dcde924f562d84e77925464dc00dce9b5e6daf0540441c02
2022-11-04T01:38:45.488012829Z 2022/11/04 01:37:40  info unpack layer: sha256:5fe831bf67816b7d24c7df6c4424f6b68b5499a5b56621138feba1cbeb71dc25
2022-11-04T01:38:45.488012829Z 2022/11/04 01:37:41  info unpack layer: sha256:6b53f2346bd04c68c8024bbef8b511e5aa9d59dfdd75280e338ec57786c05368
2022-11-04T01:38:45.488012829Z 2022/11/04 01:37:41  info unpack layer: sha256:70ce1999407af4e1f02c3c4a3b4c43d958f5eb8cea82d6eff244243b891f8e8a
2022-11-04T01:38:45.488012829Z INFO:    Creating SIF file...
2022-11-04T01:38:45.488012829Z INFO:    Build complete: /tmp/crunch-run.tordo-dz642-2jn257pktac5pds.4195713021/keep3363330801/by_uuid/tordo-4zz18-y71qluoxmymynse/image.sif
2022-11-04T01:38:45.759894753Z Starting container
2022-11-04T01:38:45.761689735Z Waiting for container to finish
2022-11-04T01:38:58.798455389Z FATAL:   container creation failed: plugin type="portmap" failed (add): netplugin failed with no error message: signal: killed
2022-11-04T01:38:58.813852446Z Container exited with status code 255 (signal -1)
2022-11-04T01:38:59.011943529Z Complete


Subtasks 1 (0 open1 closed)

Task #19714: Review 19702-memory-overheadResolvedPeter Amstutz11/08/2022

Actions
Actions #1

Updated by Peter Amstutz 3 months ago

  • Description updated (diff)
Actions #2

Updated by Peter Amstutz 3 months ago

  • Description updated (diff)
Actions #3

Updated by Tom Clegg 3 months ago

I don't see any great clues here.

"signal: killed" might mean OOM while setting up the container. Perhaps 2 GB RAM is not enough for singularity to work reliably while other node-startup things are happening, and ReserveExtraRAM needs to be increased?

Actions #4

Updated by Peter Amstutz 3 months ago

Tom Clegg wrote in #note-3:

I don't see any great clues here.

"signal: killed" might mean OOM while setting up the container. Perhaps 2 GB RAM is not enough for singularity to work reliably while other node-startup things are happening, and ReserveExtraRAM needs to be increased?

In theory, you just merged a feature that should be recording that information?

Actions #5

Updated by Tom Clegg 3 months ago

indeed

2022-11-04T01:38:45.760663239Z mem 100184064 cache 2277 pgmajfault 1048678400 rss
2022-11-04T01:38:51.073166536Z procmem 778432512 arv-mount 43061248 crunch-run 289300480 keepstore
2022-11-04T01:38:55.761151799Z mem 152002560 cache 2475 pgmajfault 1208025088 rss

778432512+43061248+289300480+1208025088 = 2318819328 > 2006636k

Subtracting the requested keep_cache_ram (268435456) from arv-mount+crunch-run+keepstore, we have

(778432512-268435456)+43061248+289300480 = 842358784

Perhaps
  • Default ReserveExtraRAM should increase from 256 MiB to 550 MiB
  • ChooseInstanceType should add ((NBuffers * 64 MiB) + 200 MiB) * 1.1 when LocalKeepBlobBuffersPerVCPU>0, instead of just NBuffers*64 (adding some for non-buffer memory use, and 10% for GOGC=10)
Actions #6

Updated by Tom Clegg 3 months ago

  • Assigned To set to Tom Clegg
  • Status changed from New to In Progress
Actions #7

Updated by Peter Amstutz 3 months ago

Tom Clegg wrote in #note-6:

19702-memory-overhead @ 9cd2fc2cd84000e706d73d1ff8316ce46b1be54d -- developer-run-tests: #3359

I'm wondering if there's something about the conversion from Docker to SIF that is leaving arv-mount with a larger than normal footprint.

Keepstore having 200 MiB of overhead before accounting for buffers seems high. Although the numbers are the numbers.

Does this mean we can't run on 2 GiB nodes any more?

Otherwise this LGTM.

Actions #8

Updated by Tom Clegg 3 months ago

I agree, we should be able to make those numbers lower.

Does this mean we can't run on 2 GiB nodes any more?

I suppose so, if the container requests more than 1 GiB of RAM + arv-mount cache.

Actions #9

Updated by Tom Clegg 3 months ago

  • Status changed from In Progress to Resolved
Actions #10

Updated by Peter Amstutz about 2 months ago

  • Release set to 47
Actions

Also available in: Atom PDF