Story #15026

[arvados-dispatch-cloud] Cloud driver/config testing tool

Added by Tom Clegg 6 months ago. Updated 2 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
06/21/2019
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
0.5

Description

Provide an arvados-server "cloudtest" subcommand (lib/cloud/test) that uses the configured credentials (from cluster config file) to verify that
  • the selected driver implements the cloud.Driver interface properly (empty and non-empty instance tag sets; no implicit filtering of instances list; Instances() includes the new instance if called immediately after Create() returns success; Destroy() works)
  • the cloud provider accepts the configured credentials
  • resulting VMs accept the configured SSH private key and run commands as root
This has three main uses:
  1. Dev tests when creating/modifying a driver
  2. CI tests
  3. Verify/debug config while creating/updating a real cluster
Specs:
  • By default, use InstanceSetID "cloudtest-$(whoami)@$(hostname)" so a series of aborted/broken runs will recognize any abandoned instances. Accept a command line argument -instance-set-id=string to override.
  • Use the selected driver directly: don't use a worker.Pool, rateLimitingInstanceSet, etc.
  • Start by listing all instances and checking whether any are tagged with the selected InstanceSetID.
    • If so, and a -clear command line flag was given: destroy them, get an updated list, and repeat until they're all gone.
    • If so, and a -clear command line flag was not given: log a message mentioning the "-clear" option, and error out.
  • Create an instance, using a {"CloudTestPID":"$PID","InstanceSetID":"$InstanceSetID"} tag plus any ResourceTags in the cluster config. If an error is returned, log it (and exit non-zero later), but keep going in case an instance was created.
  • Verify that the Tags() on the returned instance match the ones passed to Create().
  • List all instances. If an error is returned, log it but keep going so the test instance (if any) can be destroyed.
  • Verify that the instance list has an instance with the same ID as the one returned from Create(). If not, keep going, but log an error: the instance is supposed to appear in the very next Instances() call after Create() returns.
  • Verify that the instance returned in the list has the same tags.
  • If a new instance was created (either Create() succeeded or List() returned an instance with our InstanceSetID):
    • Poll Instances() until the instance has a non-empty Address() or TimeoutBooting expires.
    • Use ssh_executor to run BootProbeCommand on the instance (or "docker ps -q" if that's empty). Retry until it succeeds or TimeoutBooting expires.
    • If a -command value is given, execute it as a shell command on the instance, with stdin/stdout/stderr connected to cloudtest's own stdin/stdout/stderr.
    • If the -pause-before-destroy flag is given, show a sample SSH command line for connecting to the instance, and wait for the user to press Enter before proceeding.
    • Destroy the instance.
    • Poll Instances() until the instance disappears.
  • Exit 0 if everything succeeded, otherwise 1.

If the -quiet flag isn't given, log progress to stdout.

$ arvados-server cloudtest -exec 'echo $(hostname) $(date)' -pause-before-destroy
getting instance list
got instance list (N=13)
no instances are tagged with our InstanceSetID (7 instances are not tagged with any InstanceSetID at all)
creating instance with tags map[CloudTestPID:1234, InstanceSetID:cloudtest-ops@4xphq]
created instance with id i-12345abcde
all requested tags are present
getting instance list
got instance list (N=14)
found our instance i-12345abcde in returned list
all requested tags are present
instance has no address
waiting probeInterval 10s
getting instance list
got instance list (N=14)
found our instance i-12345abcde in returned list
instance has no address
waiting probeInterval 10s
getting instance list
got instance list (N=14)
found our instance i-12345abcde in returned list
instance i-12345abcde has addr 10.2.3.4
executing command "docker ps -q" on i-12345abcde addr 10.2.3.4 port 2222
executing command failed (attempt 1): connection refused, output "" 
waiting probeInterval 10s
executing command "docker ps -q" on i-12345abcde addr 10.2.3.4 port 2222
executing command failed (attempt 2): connection refused, output "" 
waiting probeInterval 10s
executing command "docker ps -q" on i-12345abcde addr 10.2.3.4 port 2222
executing command succeeded (attempt 3), output "" 
executing command "echo $(hostname) $(date)" on i-12345abcde addr 10.2.3.4 port 2222
executing command succeeded (attempt 1), output "i-12345abcde.cloud.example Tue Jun 11 11:28:23 EDT 2019\n" 
instance is booted
... you can connect with "ssh -p2222 debian@10.2.3.4" 
... hit Enter when you are finished, and ready to destroy the instance: {pause until user hits Enter}
destroying instance i-12345abcde
destroyed instance i-12345abcde
getting instance list
got instance list (N=14)
found our instance i-12345abcde in returned list
waiting probeInterval 10s
getting instance list
got instance list (N=14)
found our instance i-12345abcde in returned list
waiting probeInterval 10s
getting instance list
got instance list (N=14)
instance i-12345abcde not found in returned list
done
15026-dispatcher.png (64.7 KB) 15026-dispatcher.png Tom Clegg, 06/27/2019 02:55 PM
15026-cloudtest.png (193 KB) 15026-cloudtest.png Tom Clegg, 06/27/2019 02:55 PM

Subtasks

Task #15389: Review 15026-cloudtestResolvedPeter Amstutz


Related issues

Blocks Arvados - Story #13908: [Epic] Replace SLURM for cloud job scheduling/dispatchingNew

Associated revisions

Revision 10d70a1c
Added by Tom Clegg 3 months ago

Merge branch '15026-cloudtest'

refs #15026

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

Revision 380a54a7
Added by Tom Clegg 3 months ago

Merge branch '15026-cloudtest'

refs #15026

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

Revision 60df7480
Added by Tom Clegg 3 months ago

Merge branch '15026-cloudtest'

closes #15026

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

History

#1 Updated by Tom Clegg 6 months ago

  • Related to Story #13908: [Epic] Replace SLURM for cloud job scheduling/dispatching added

#2 Updated by Tom Clegg 6 months ago

  • Subject changed from [arvados-dispatch-cloud] Generic driver test suite to [arvados-dispatch-cloud] Cloud driver/config testing tool

#3 Updated by Tom Morris 5 months ago

  • Target version set to To Be Groomed

#4 Updated by Tom Clegg 4 months ago

  • Description updated (diff)

#5 Updated by Tom Clegg 3 months ago

  • Description updated (diff)

#6 Updated by Tom Clegg 3 months ago

  • Description updated (diff)

#7 Updated by Tom Clegg 3 months ago

  • Related to deleted (Story #13908: [Epic] Replace SLURM for cloud job scheduling/dispatching)

#8 Updated by Tom Clegg 3 months ago

  • Blocks Story #13908: [Epic] Replace SLURM for cloud job scheduling/dispatching added

#9 Updated by Nico C├ęsar 3 months ago

Some formatting difference that makes operator life much easier

$ arvados-server cloudtest
[ACTION] req-12345 getting instance list
[INFO] req-12345 got instance list (N=13) ## <--- here "N=13" is some information, but probably we'll need to have more, unsure about the format (N=..)
[WARN] no instances are tagged with our InstanceSetID (7 instances are not tagged with any InstanceSetID at all)
[ACTION] req-12346 creating instance with tags CloudTestPID:1234, InstanceSetID:cloudtest-ops@4xphq   ## <- map[] is a implementation detail in golang
[INFO] req-12346 created instance with id i-12345abcde
[CHECK] all requested tags are present
[ACTION] req-12347 getting instance list
[INFO] req-12347 got instance list (N=14)
[CHECK] found our instance i-12345abcde in returned list
[CHECK] all requested tags are present
[WARN] instance has no address
[INFO] waiting probeInterval 10s
[ACTION] req-12348 getting instance list
[INFO] req-12348 got instance list (N=14)
[CHECK] instance i-12345abcde  has address 10.2.3.4 ## <- Is always good to be able to grep for i-12345abcde  

[ACTION] req-12349 executing command "true" on i-12345abcde addr 10.2.3.4 port 2222 ## <--- what is command "true" ? is it a bash command..? 
[WARN] req-12349 executing command failed: connection refused ## <-- WARN? ... is this "expected"...? maybe INFO with "tries 1 out of 5" kind of message
[INFO] waiting probeInterval 10s
[ACTION] req-12350 executing command "true" on i-12345abcde addr 10.2.3.4 port 2222
[INFO] req-12350 executing command failed: connection refused (tries: 2/5)  ##  <-- like this.
[INFO] waiting probeInterval 10s
[ACTION] req-12351 executing command "true" on i-12345abcde addr 10.2.3.4 port 2222
[INFO] req-12351 executing command succeeded
[ACTION] req-12352 destroying instance i-12345abcde
[INFO] req-12352 destroyed instance i-12345abcde

(..)

[ACTION] req-12360 getting instance list
[INFO] req-12360 got instance list (N=14)
[ERROR] req-12360instance i-12345abcde not found in returned list

#10 Updated by Tom Clegg 3 months ago

  • Description updated (diff)

#11 Updated by Tom Morris 3 months ago

  • Story points set to 2.5
  • Target version changed from To Be Groomed to 2019-07-03 Sprint

#12 Updated by Tom Clegg 3 months ago

  • Assigned To set to Tom Clegg

#13 Updated by Tom Clegg 3 months ago

  • Description updated (diff)

#14 Updated by Tom Clegg 3 months ago

  • Status changed from New to In Progress

#16 Updated by Tom Clegg 3 months ago

15026-cloudtest @ 0684f84ec50069b24b4564474db67d6b44266699 -- https://ci.curoverse.com/view/Developer/job/developer-run-tests/1337/

As described, except:
  • Log uses logrus "text" format, with timestamps and info/error loglevels
  • Shell command isn't connected to caller's stdin/stdout/stderr: it has no stdin, and stdout/stderr are buffered and reported as log fields instead. This can be changed with a few changes to our "executor" interface, but I'm not sure how important it is: if we get this far we've already satisfied the primary goal here, and we already provide a much more flexible option for doing stuff on the test instance: use -pause-before-destroy and run remote commands with normal tools (ssh).
  • Renamed -clear to -destroy-existing
  • Added -image-id and -instance-type command line options (default instance type is the cheapest one in the config)

#17 Updated by Peter Amstutz 3 months ago

This needs its own page in the admin section of the docs, which should be linked with install-dispatch-cloud.html.

#18 Updated by Peter Amstutz 3 months ago

In refreshTestInstance()

        t.testInstance = worker.TagVerifier{i, t.secret}

I also see this in Run()

    t.testInstance = worker.TagVerifier{t.testInstance, t.secret}

Is this redundantly wrapping testInstance with TagVerifier twice?

Instead of directly running a shell command on the instance, it would be more useful to have "run a local command" passing along the instance type and host:port on the command line. The local script could call ssh and do what it likes. This would enable testing tasks like "try each instance type, make sure it works, then log in and record the RAM and disk space on the actually booted instance." Right now an admin would have to do that manually, or by capturing and parsing the output of cloud-test.

For the benefit of admins, maybe the logging could treat each logical step in the process of bringing up a cloud node as a test and print "PASSED: node appeared on the instance list" or "FAILED: boot probe command did not succeed"

If the boot probe fails, that should be a deferred error. That seems like an obvious situation where the admin wants to log in to the instance and find out why.

#19 Updated by Tom Clegg 3 months ago

Peter Amstutz wrote:

Is this redundantly wrapping testInstance with TagVerifier twice?

Ah, yes. In the "successful create()" case it was getting wrapped twice. Harmless, but confusing. Fixed.

Instead of directly running a shell command on the instance, it would be more useful to have "run a local command" passing along the instance type and host:port on the command line. The local script could call ssh and do what it likes. This would enable testing tasks like "try each instance type, make sure it works, then log in and record the RAM and disk space on the actually booted instance." Right now an admin would have to do that manually, or by capturing and parsing the output of cloud-test.

I think the stdout/stderr changes I mentioned above (note-16) would achieve this ("cloudtest -command 'cat /proc/meminfo' >meminfo.txt") without taking on the challenge of templating the private key/host/port into a given command line.

We're really here to test cloud drivers/configs in order to unblock arvados-dispatch-cloud production use. The "execute shell command" thing is already a "nice easy thing we might as well add" so I'm reluctant to creep scope further.

For the benefit of admins, maybe the logging could treat each logical step in the process of bringing up a cloud node as a test and print "PASSED: node appeared on the instance list" or "FAILED: boot probe command did not succeed"

I'm not quite following the benefit here. I guess seeing a few "PASS" among the "INFO" could be encouraging? We already have "FAILED", except that it's spelled "ERROR"...? I'm imagining the typical user will ignore everything and look out for ERROR and/or non-zero exit, and dig through the rest of the logs if the error message isn't enough to troubleshoot.

If the actions were independent, as they are in a unit test suite, "look at the log entries between the most recent PASS and the error" would be a time-saver, but I'm not sure that approach really works here.

If the boot probe fails, that should be a deferred error. That seems like an obvious situation where the admin wants to log in to the instance and find out why.

Ah, yes. Fixed. This also means the shell command gets attempted, which seems better (or at least not worse).

Same change for "tags weren't saved/loaded properly" -- might as well still wait for boot and try the rest of the stuff, right?

15026-cloudtest @ e61d0c92e9f52c8db5e65026a551f62918aad1b4 -- https://ci.curoverse.com/view/Developer/job/developer-run-tests/1346/

#21 Updated by Tom Clegg 3 months ago

TODO:
  • Copy wiki page Cloudtest utility to admin docs
  • Report duration of remote operations when logging success/error

#22 Updated by Peter Amstutz 3 months ago

Tom Clegg wrote:

I'm not quite following the benefit here. I guess seeing a few "PASS" among the "INFO" could be encouraging? We already have "FAILED", except that it's spelled "ERROR"...? I'm imagining the typical user will ignore everything and look out for ERROR and/or non-zero exit, and dig through the rest of the logs if the error message isn't enough to troubleshoot.

If the actions were independent, as they are in a unit test suite, "look at the log entries between the most recent PASS and the error" would be a time-saver, but I'm not sure that approach really works here.

The documentation says:

arvados-server cloudtest performs the following sequence:
  1. Create a new instance
  2. Wait for it to finish booting
  3. Run a shell command on the new instance (optional)
  4. Pause while you log in to the new instance and do other tests yourself (optional)
  5. Shut down the instance

So my thought is that it would be nice if the logs divided it up what it is doing into phases corresponding to the documentation, so that the admin can follow along specifically what each step is trying to accomplish.

INFO[] Step 1: creating instance
INFO[] Step 1 passed: created instance
INFO[] Step 2: Wait for it to finish booting
INFO[] Step 2 passed: found our instance in returned list
...

But if you don't think that worth the effort, I won't hold up the review over it (LGTM).

#23 Updated by Tom Clegg 3 months ago

15026-cloudtest @ 4aa3a732cac495fb384a521b2046f6b7f09dd13d -- https://ci.curoverse.com/view/Developer/job/developer-run-tests/1349/
  • doc page
  • log duration of create/destroy/list/run ops

#24 Updated by Tom Clegg 3 months ago

new cloudtest doc page:

...

updated section of dispatcher install page:

#25 Updated by Tom Clegg 3 months ago

15026-cloudtest @ b13f1f4222fc0a211d9f03d4078353ef11900907

...with 2x notextile tags added to fix the escaped <span ...> stuff.

#26 Updated by Peter Amstutz 3 months ago

In the code for logging duration:

        lgr := t.Logger.WithField("Instance", t.testInstance.ID())
        if lgr := lgr.WithField("Duration", time.Since(t0)); err != nil {
            lgr.WithError(err).Error("error destroying instance")

What's happening here with embedding "lgr" in the "if" statement? Is that aliasing "lgr" so that it has the "Duration" field inside the "if" block but not in the code following it? Maybe rename the inner variable to avoid confusion?

Otherwise LGTM.

#27 Updated by Tom Clegg 3 months ago

Peter Amstutz wrote:

What's happening here with embedding "lgr" in the "if" statement? Is that aliasing "lgr" so that it has the "Duration" field inside the "if" block but not in the code following it? Maybe rename the inner variable to avoid confusion?

Yes, exactly. Will rename.

#28 Updated by Tom Clegg 3 months ago

The Azure driver uses the InstanceSetID for things that have restrictions.

arvados-server cloudtest

ERRO[2019-07-01T14:43:15.345808788Z] error creating test instance Duration=241.514954ms error="network.InterfacesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=400 Code=\"InvalidResourceName\" Message=\"Resource name compute-cloudtest-root@c97qk.arvadosapi.com-fuk758oeqqcn1ij-nic is invalid. The name can be up to 80 characters long. It must begin with a word character, and it must end with a word character or with '_'. The name may contain word characters or '.', '-', '_'.\" Details=[]"

arvados-server cloudtest -instance-set-id cloudtest-root_c97qk.arvadosapi.com

ERRO[2019-07-01T14:44:28.959516983Z] error creating test instance Duration=822.158368ms error="autorest/azure: Service returned an error. Status=400 Code=\"InvalidParameter\" Message=\"Linux host name cannot exceed 64 characters in length or contain the following characters: ` ~ ! @ # $ % ^ & * ( ) = + _ [ ] { } \\\\ | ; : ' \\\" , < > / ?.\" Target=\"computerName\""

arvados-server cloudtest -instance-set-id cloudtest-root-c97qk.arvadosapi.com

→success

The cloud.Driver code promises the InstanceSetID will be "of the form "zzzzz-zzzzz-zzzzzzzzzzzzzzz" where each z can be any alphanum"... so maybe cloudtest's default should be something like "zzzzz-zzzzz-zzzzzzcloudtest"?

(The root@hostname thing was supposed to make the default behavior "notice if previous runs left instances lying around, but don't mess with another host's concurrent tests" -- but it's so easy for the operator to achieve that with a command line/script that less built-in cleverness is probably better.)

#29 Updated by Peter Amstutz 3 months ago

Tom Clegg wrote:

The cloud.Driver code promises the InstanceSetID will be "of the form "zzzzz-zzzzz-zzzzzzzzzzzzzzz" where each z can be any alphanum"... so maybe cloudtest's default should be something like "zzzzz-zzzzz-zzzzzzcloudtest"?

Yes. Especially if that is part of the interface contract. Maybe the admin should be required to provide one in that form as well?

#30 Updated by Tom Clegg 3 months ago

  • Target version changed from 2019-07-03 Sprint to 2019-07-17 Sprint

#31 Updated by Tom Clegg 3 months ago

  • Story points changed from 2.5 to 0.5

#32 Updated by Tom Clegg 3 months ago

  • Status changed from In Progress to Resolved

Also available in: Atom PDF