Project

General

Profile

Actions

Feature #15370

closed

[arvados-dispatch-cloud] loopback driver

Added by Tom Clegg over 5 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Story points:
-
Release relationship:
Auto

Description

The loopback driver implements cloud.Driver by presenting a fake cloud in which
  • Create() succeeds once, but fails with a quota error if the caller tries to create multiple instances
  • Instances() returns the instance that was created, if any
  • Destroy() makes the instance disappear from the next Instances() result
  • Instance address points to an SSH server (brought up by the driver) that accepts the dispatcher's key and executes shell commands
Using the loopback driver will involve some special configuration.
  • If InstanceTypes is empty, it is automatically configured with a single instance type, with the host's RAM/CPU specs

When combined with #14922 this should make crunch-dispatch-local redundant.

This will also facilitate an arvados-dispatch-cloud integration test that uses the real crunch-run program instead of a stub. This might involve a few other changes, like a configurable location for lockfiles.

It's okay that this will be useless (other than single-container test cases) until #14922 is implemented, because it will also make #14922 easier to test.


Subtasks 2 (0 open2 closed)

Task #19060: Review 15370-loopback-dispatchcloudResolvedWard Vandewege05/19/2022Actions
Task #19138: Review 15370-install-dockerResolvedWard Vandewege05/17/2022Actions

Related issues 3 (2 open1 closed)

Related to Arvados - Feature #14922: Run multiple containers concurrently on a single cloud VMNewActions
Related to Arvados - Idea #13908: [Epic] Replace SLURM for cloud job scheduling/dispatchingResolvedActions
Related to Arvados - Idea #18973: Test combinations of federation scenariosNewActions
Actions #1

Updated by Tom Clegg over 5 years ago

  • Related to Feature #14922: Run multiple containers concurrently on a single cloud VM added
Actions #2

Updated by Tom Clegg over 5 years ago

  • Related to Idea #13908: [Epic] Replace SLURM for cloud job scheduling/dispatching added
Actions #3

Updated by Peter Amstutz almost 4 years ago

  • Target version changed from To Be Groomed to 2021-03-31 sprint
Actions #4

Updated by Peter Amstutz almost 4 years ago

  • Target version changed from 2021-03-31 sprint to 2021-04-14 sprint
Actions #5

Updated by Peter Amstutz over 3 years ago

  • Description updated (diff)
Actions #6

Updated by Peter Amstutz over 3 years ago

  • Target version changed from 2021-04-14 sprint to 2021-05-26 sprint
Actions #7

Updated by Peter Amstutz over 3 years ago

  • Target version changed from 2021-05-26 sprint to 2021-07-07 sprint
Actions #8

Updated by Peter Amstutz over 3 years ago

  • Target version changed from 2021-07-07 sprint to 2021-07-21 sprint
Actions #9

Updated by Peter Amstutz over 3 years ago

  • Target version changed from 2021-07-21 sprint to 2021-08-04 sprint
Actions #10

Updated by Peter Amstutz over 3 years ago

  • Target version changed from 2021-08-04 sprint to 2021-08-18 sprint
Actions #11

Updated by Peter Amstutz over 3 years ago

  • Target version changed from 2021-08-18 sprint to 2021-09-01 sprint
Actions #12

Updated by Peter Amstutz over 3 years ago

  • Target version deleted (2021-09-01 sprint)
Actions #13

Updated by Peter Amstutz over 2 years ago

  • Target version set to 2022-04-27 Sprint
Actions #14

Updated by Peter Amstutz over 2 years ago

  • Related to Idea #18973: Test combinations of federation scenarios added
Actions #16

Updated by Peter Amstutz over 2 years ago

  • Target version changed from 2022-04-27 Sprint to 2022-05-11 sprint
Actions #17

Updated by Peter Amstutz over 2 years ago

  • Assigned To set to Tom Clegg
Actions #18

Updated by Tom Clegg over 2 years ago

  • Status changed from New to In Progress
Actions #19

Updated by Tom Clegg over 2 years ago

  • Description updated (diff)

15370-loopback-dispatchcloud @ 34b13b1b9cc34661bf0c6774105ae03b412cbbdb -- developer-run-tests: #3085

(tests are failing because CI image doesn't have rsync)

Actions #20

Updated by Tom Clegg over 2 years ago

  • Description updated (diff)
Actions #22

Updated by Tom Clegg over 2 years ago

  • Target version changed from 2022-05-11 sprint to 2022-05-25 sprint
Actions #24

Updated by Tom Clegg over 2 years ago

Now tests are failing because the CI image doesn't have docker, so "arv-keepdocker" doesn't work.

Added docker install recipe to arvados-server install

15370-loopback-dispatchcloud @ f07c059fca954e4d001cbf1cb36c845be9d884dd

Actions #28

Updated by Tom Clegg over 2 years ago

Actions #29

Updated by Ward Vandewege over 2 years ago

Tom Clegg wrote:

15370-install-docker @ 663f3742a80b1b236d727d2d27068d03a37b4469

LGTM thanks!

Actions #30

Updated by Ward Vandewege over 2 years ago

Tom Clegg wrote:

15370-loopback-dispatchcloud @ 731c5e81f5aedc82d03786670610bde68bba27c7 -- developer-run-tests: #3146

I updated the jenkins satellite image to incorporate the changes from main, which means docker should now be present. Running these tests again:

developer-run-tests: #3153

That failed because the jenkins user can't access Docker. I pushed the update that adds docker to the 'test' image, and gives the jenkins user access to Docker, and rebuilt the image once more.

developer-run-tests: #3154

Some different failures here:

developer-run-tests-remainder: #3301 /consoleFull

time="2022-05-20T18:30:33.430743543Z" level=error msg=failed error="Error response from daemon: client version 1.40 is too new. Maximum supported API version is 1.39" 
14:30:33 exit status 1
14:30:33 
14:30:33 ----------------------------------------------------------------------
14:30:33 FAIL: build_test.go:27: BuildSuite.TestBuildAndInstall
14:30:33 
14:30:33 build_test.go:47:
14:30:33     c.Check(err, check.IsNil)
14:30:33 ... value *exec.ExitError = &exec.ExitError{ProcessState:(*os.ProcessState)(0xc00000e0c0), Stderr:[]uint8(nil)} ("exit status 1")
14:30:33 
14:30:33 build_test.go:50:
14:30:33     c.Assert(err, check.IsNil)
14:30:33 ... value *fs.PathError = &fs.PathError{Op:"stat", Path:"/tmp/check-5577006791947779410/0/arvados-server-easy_1.2.3~rc4_amd64.deb", Err:0x2} ("stat /tmp/check-5577006791947779410/0/arvados-server-easy_1.2.3~rc4_amd64.deb: no such file or directory")
14:30:33 
14:30:33 OOPS: 0 passed, 1 FAILED
14:30:33 --- FAIL: Test (0.72s)

Hmm, we were using an old Buster base image. I've bumped it to the latest and am re-building the image now. Maybe that will fix the version difference? I've also added user_allow_other to /etc/fuse.conf in the image, which should fix the other problem in lib/crunchrun tests:

packer-build-jenkins-image-arvados-tests: #82

Here we go:

developer-run-tests: #3155

Actions #31

Updated by Tom Clegg over 2 years ago

I suspect adding a non-functional docker may have broken the main branch build by failing tests that were skipped when there was no docker in PATH. Might need to c.Skip() the arvados-package test for the time being. Checking:

15370-docker-tests @ 36cfafd6e7eae2784c22aefdd9df26783412d42a -- developer-run-tests: #3156

Looks like the same applies to integrationSuite.TestRunTrivialContainerWithLocalKeepstore in lib/crunchrun.

Actions #32

Updated by Tom Clegg over 2 years ago

15370-docker-tests @ 6caeb0768adabd32b50cc2ca6eb49d162745c4b0 -- developer-run-tests: #3157

(only wb1 integration tests failed there)

Actions #33

Updated by Tom Clegg over 2 years ago

Merged main:

15370-loopback-dispatchcloud @ 3fa6aa4043286ad61e5f29c136d3cc2942e8750d -- developer-run-tests: #3158

Looks like I have some more work to do.

Actions #34

Updated by Tom Clegg over 2 years ago

  • Target version changed from 2022-05-25 sprint to 2022-06-08 sprint
Actions #35

Updated by Tom Clegg over 2 years ago

The cmd/arvados-package test (which we used to skip because it requires docker) fails because it takes longer than 10m. I updated run-tests.sh to change the timeout to 20m for that suite, but I also updated the jenkins config to skip it in [developer-]run-tests-remainder. We can re-enable it after either (a) changing the image prep so a new jenkins worker has a cached build image (which makes the cmd/arvados-package test run much faster) or (b) moving it to a separate run-tests-package / developer-run-tests-package jenkins job.

Also fixed a "missing keep data dir" testing bug, a docker client usage bug, and a flaky error log test.

15370-loopback-dispatchcloud @ bad877eb1d1a84d25c1fab3592e4218774816179 -- developer-run-tests: #3162

retry wb1 developer-run-tests-apps-workbench-integration: #3387

Actions #36

Updated by Ward Vandewege over 2 years ago

Tom Clegg wrote:

The cmd/arvados-package test (which we used to skip because it requires docker) fails because it takes longer than 10m. I updated run-tests.sh to change the timeout to 20m for that suite, but I also updated the jenkins config to skip it in [developer-]run-tests-remainder. We can re-enable it after either (a) changing the image prep so a new jenkins worker has a cached build image (which makes the cmd/arvados-package test run much faster) or (b) moving it to a separate run-tests-package / developer-run-tests-package jenkins job.

Also fixed a "missing keep data dir" testing bug, a docker client usage bug, and a flaky error log test.

15370-loopback-dispatchcloud @ bad877eb1d1a84d25c1fab3592e4218774816179 -- developer-run-tests: #3162

retry wb1 developer-run-tests-apps-workbench-integration: #3387

Is there a reason to pin on a docker API version that is so old? Latest is 1.41, and we're pinning on 1.21.

Otherwise, LGTM, thanks.

Actions #37

Updated by Tom Clegg over 2 years ago

Ward Vandewege wrote:

Is there a reason to pin on a docker API version that is so old? Latest is 1.41, and we're pinning on 1.21.

Sort of. I just figured old API versions are supported for a long time, so there's no particular hurry to use a newer one, in which case we might as well use the same version we use in crunch-run.

If someone wants to use docker 1.9 to build packages, who am I to say no...

Actions #38

Updated by Ward Vandewege over 2 years ago

Tom Clegg wrote:

Ward Vandewege wrote:

Is there a reason to pin on a docker API version that is so old? Latest is 1.41, and we're pinning on 1.21.

Sort of. I just figured old API versions are supported for a long time, so there's no particular hurry to use a newer one, in which case we might as well use the same version we use in crunch-run.

If someone wants to use docker 1.9 to build packages, who am I to say no...

Hmm, actually docker 1.9 would be a problem, the on-disk image format is different (we went through that whole painful migration in #8568 etc). I don't think anyone is using docker that old anymore.

If we're really going to default to a api version that old, there should be a comment in the code that states there is no actual reason for this, only a desire for maximal backwards compatibility.

This would avoid future concern about upping the API version when - inevitably - we'll run into a version of the Docker Engine that doesn't work with an API that old anymore.

It looks like the Docker API 1.21 was introduced with Docker 1.9.0, in 2015-11-03, that's really old.
For reference:
  • Debian 10 (buster) ships with Docker 1.18.09 which has Docker API 1.39
  • Ubuntu 18.04 (bionic) originally shipped with Docker 1.17.12 which has Docker API 1.35

Of course we use the docker package repos to install more recent versions of Docker. Even on CentOS 7 it seems that a recent docker is easily installed, cf. https://docs.docker.com/engine/install/centos/.

Actions #39

Updated by Tom Clegg over 2 years ago

15370-loopback-dispatchcloud @ bac1772ab074713e3c50632a4cad3cc1ce50d0ec -- developer-run-tests: #3163

updated crunch-run to docker API 1.35 and exported it as a const so arvados-package can stay in sync.

Actions #40

Updated by Ward Vandewege over 2 years ago

Tom Clegg wrote:

15370-loopback-dispatchcloud @ bac1772ab074713e3c50632a4cad3cc1ce50d0ec -- developer-run-tests: #3163

updated crunch-run to docker API 1.35 and exported it as a const so arvados-package can stay in sync.

Thank you that's great. LGTM!

Actions #41

Updated by Tom Clegg over 2 years ago

  • Status changed from In Progress to Resolved

Applied in changeset arvados-private:commit:arvados|86660414472d4ff0d8267f9845a753497bd41692.

Actions #42

Updated by Peter Amstutz about 2 years ago

  • Release set to 47
Actions

Also available in: Atom PDF