Bug #14844

[dispatch-cloud] Azure driver bugs discovered in trial run

Added by Tom Clegg 6 months ago. Updated 6 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Start date:
02/28/2019
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
1.0
Release relationship:
Auto

Description

  • If creating a VM fails, an attempt should be made to delete the VM's dependent resources (nic/blob) before returning the error to Create()'s caller. As it stands, an unbounded number of new unused nics and blobs pile up during times when VMs can't be created and the dispatcher keeps retrying.
  • nil pointer panic in (*AzureInstance)Address() -- perhaps a newly created instance that has no IP address assigned yet (see note)

Subtasks

Task #14892: Review 14844-cdc-azure-fixesResolvedPeter Amstutz


Related issues

Related to Arvados - Story #13908: [Epic] Replace SLURM for cloud job scheduling/dispatchingNew

Related to Arvados - Story #14807: [arvados-dispatch-cloud] Features/fixes needed before first production deployResolved01/29/2019

Associated revisions

Revision a310d114
Added by Peter Amstutz 6 months ago

Merge branch '14844-cdc-azure-fixes' closes #14844

Arvados-DCO-1.1-Signed-off-by: Peter Amstutz <>

History

#1 Updated by Tom Clegg 6 months ago

  • Related to Story #13908: [Epic] Replace SLURM for cloud job scheduling/dispatching added

#2 Updated by Tom Clegg 6 months ago

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x83aab5]
goroutine 102 [running]:
git.curoverse.com/arvados.git/lib/cloud.(*AzureInstance).Address(0xc420478500, 0x7f16da9a9628, 0xc420478500)
        /GOPATH/src/git.curoverse.com/arvados.git/lib/cloud/azure.go:633 +0x15
git.curoverse.com/arvados.git/lib/dispatchcloud/ssh_executor.(*Executor).setupSSHClient(0xc420368ea0, 0xc42061a6e7, 0xc420368e01, 0xc4204b88a0)
        /GOPATH/src/git.curoverse.com/arvados.git/lib/dispatchcloud/ssh_executor/executor.go:178 +0x61
git.curoverse.com/arvados.git/lib/dispatchcloud/ssh_executor.(*Executor).sshClient(0xc420368ea0, 0x1, 0x0, 0x0, 0x0)
        /GOPATH/src/git.curoverse.com/arvados.git/lib/dispatchcloud/ssh_executor/executor.go:153 +0x10f
git.curoverse.com/arvados.git/lib/dispatchcloud/ssh_executor.(*Executor).newSession.func1(0x8f7c01, 0x0, 0x9ebaa0, 0xc4204b88b0)
        /GOPATH/src/git.curoverse.com/arvados.git/lib/dispatchcloud/ssh_executor/executor.go:128 +0x37
git.curoverse.com/arvados.git/lib/dispatchcloud/ssh_executor.(*Executor).newSession(0xc420368ea0, 0x0, 0x8e5c40, 0xc420253710)
        /GOPATH/src/git.curoverse.com/arvados.git/lib/dispatchcloud/ssh_executor/executor.go:136 +0xa0
git.curoverse.com/arvados.git/lib/dispatchcloud/ssh_executor.(*Executor).Execute(0xc420368ea0, 0x0, 0xc4201df740, 0x19, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /GOPATH/src/git.curoverse.com/arvados.git/lib/dispatchcloud/ssh_executor/executor.go:92 +0x73
git.curoverse.com/arvados.git/lib/dispatchcloud/worker.(*worker).probeBooted(0xc4203b0b00, 0x989064, 0xa, 0x97c340, 0xc4204ed6e0)
        /GOPATH/src/git.curoverse.com/arvados.git/lib/dispatchcloud/worker/worker.go:349 +0x91
git.curoverse.com/arvados.git/lib/dispatchcloud/worker.(*worker).probeAndUpdate(0xc4203b0b00)
        /GOPATH/src/git.curoverse.com/arvados.git/lib/dispatchcloud/worker/worker.go:192 +0x1394
git.curoverse.com/arvados.git/lib/dispatchcloud/worker.(*worker).ProbeAndUpdate(0xc4203b0b00)
        /GOPATH/src/git.curoverse.com/arvados.git/lib/dispatchcloud/worker/worker.go:141 +0x57
created by git.curoverse.com/arvados.git/lib/dispatchcloud/worker.(*Pool).runProbes
        /GOPATH/src/git.curoverse.com/arvados.git/lib/dispatchcloud/worker/pool.go:636 +0x378

Evidently either IPConfigurations or PrivateIPAddress can be nil here:

func (ai *AzureInstance) Address() string {
        return *(*ai.nic.IPConfigurations)[0].PrivateIPAddress
}

#3 Updated by Tom Morris 6 months ago

  • Target version changed from To Be Groomed to Arvados Future Sprints
  • Story points set to 1.0

#4 Updated by Tom Clegg 6 months ago

  • Related to Story #14807: [arvados-dispatch-cloud] Features/fixes needed before first production deploy added

#5 Updated by Tom Morris 6 months ago

  • Target version changed from Arvados Future Sprints to 2019-03-13 Sprint

#6 Updated by Peter Amstutz 6 months ago

  • Assigned To set to Peter Amstutz

#7 Updated by Peter Amstutz 6 months ago

14844-cdc-azure-fixes @ 8c4fb97b1d34b5f8fc50d239698a08c35a63dac3

  • If PrivateIPAddress somehow isn't defined, return empty string (don't panic)
  • If VM create fails, attempt to immediately clean the VHD and NIC corresponding to that VM (if it doesn't work, cleanup processes should still get around to it.)

#8 Updated by Lucas Di Pentima 6 months ago

This LGTM, thanks.

#9 Updated by Peter Amstutz 6 months ago

  • Status changed from New to Resolved
  • % Done changed from 0 to 100

#10 Updated by Tom Morris 6 months ago

  • Release set to 15

Also available in: Atom PDF