Project

General

Profile

Actions

Bug #14844

closed

[dispatch-cloud] Azure driver bugs discovered in trial run

Added by Tom Clegg about 5 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Story points:
1.0
Release relationship:
Auto

Description

  • If creating a VM fails, an attempt should be made to delete the VM's dependent resources (nic/blob) before returning the error to Create()'s caller. As it stands, an unbounded number of new unused nics and blobs pile up during times when VMs can't be created and the dispatcher keeps retrying.
  • nil pointer panic in (*AzureInstance)Address() -- perhaps a newly created instance that has no IP address assigned yet (see note)

Subtasks 1 (0 open1 closed)

Task #14892: Review 14844-cdc-azure-fixesResolvedPeter Amstutz02/28/2019Actions

Related issues

Related to Arvados - Idea #13908: [Epic] Replace SLURM for cloud job scheduling/dispatchingResolvedActions
Related to Arvados - Idea #14807: [arvados-dispatch-cloud] Features/fixes needed before first production deployResolvedTom Clegg01/29/2019Actions
Actions #1

Updated by Tom Clegg about 5 years ago

  • Related to Idea #13908: [Epic] Replace SLURM for cloud job scheduling/dispatching added
Actions #2

Updated by Tom Clegg about 5 years ago

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x83aab5]
goroutine 102 [running]:
git.curoverse.com/arvados.git/lib/cloud.(*AzureInstance).Address(0xc420478500, 0x7f16da9a9628, 0xc420478500)
        /GOPATH/src/git.curoverse.com/arvados.git/lib/cloud/azure.go:633 +0x15
git.curoverse.com/arvados.git/lib/dispatchcloud/ssh_executor.(*Executor).setupSSHClient(0xc420368ea0, 0xc42061a6e7, 0xc420368e01, 0xc4204b88a0)
        /GOPATH/src/git.curoverse.com/arvados.git/lib/dispatchcloud/ssh_executor/executor.go:178 +0x61
git.curoverse.com/arvados.git/lib/dispatchcloud/ssh_executor.(*Executor).sshClient(0xc420368ea0, 0x1, 0x0, 0x0, 0x0)
        /GOPATH/src/git.curoverse.com/arvados.git/lib/dispatchcloud/ssh_executor/executor.go:153 +0x10f
git.curoverse.com/arvados.git/lib/dispatchcloud/ssh_executor.(*Executor).newSession.func1(0x8f7c01, 0x0, 0x9ebaa0, 0xc4204b88b0)
        /GOPATH/src/git.curoverse.com/arvados.git/lib/dispatchcloud/ssh_executor/executor.go:128 +0x37
git.curoverse.com/arvados.git/lib/dispatchcloud/ssh_executor.(*Executor).newSession(0xc420368ea0, 0x0, 0x8e5c40, 0xc420253710)
        /GOPATH/src/git.curoverse.com/arvados.git/lib/dispatchcloud/ssh_executor/executor.go:136 +0xa0
git.curoverse.com/arvados.git/lib/dispatchcloud/ssh_executor.(*Executor).Execute(0xc420368ea0, 0x0, 0xc4201df740, 0x19, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /GOPATH/src/git.curoverse.com/arvados.git/lib/dispatchcloud/ssh_executor/executor.go:92 +0x73
git.curoverse.com/arvados.git/lib/dispatchcloud/worker.(*worker).probeBooted(0xc4203b0b00, 0x989064, 0xa, 0x97c340, 0xc4204ed6e0)
        /GOPATH/src/git.curoverse.com/arvados.git/lib/dispatchcloud/worker/worker.go:349 +0x91
git.curoverse.com/arvados.git/lib/dispatchcloud/worker.(*worker).probeAndUpdate(0xc4203b0b00)
        /GOPATH/src/git.curoverse.com/arvados.git/lib/dispatchcloud/worker/worker.go:192 +0x1394
git.curoverse.com/arvados.git/lib/dispatchcloud/worker.(*worker).ProbeAndUpdate(0xc4203b0b00)
        /GOPATH/src/git.curoverse.com/arvados.git/lib/dispatchcloud/worker/worker.go:141 +0x57
created by git.curoverse.com/arvados.git/lib/dispatchcloud/worker.(*Pool).runProbes
        /GOPATH/src/git.curoverse.com/arvados.git/lib/dispatchcloud/worker/pool.go:636 +0x378

Evidently either IPConfigurations or PrivateIPAddress can be nil here:

func (ai *AzureInstance) Address() string {
        return *(*ai.nic.IPConfigurations)[0].PrivateIPAddress
}
Actions #3

Updated by Tom Morris about 5 years ago

  • Target version changed from To Be Groomed to Arvados Future Sprints
  • Story points set to 1.0
Actions #4

Updated by Tom Clegg about 5 years ago

  • Related to Idea #14807: [arvados-dispatch-cloud] Features/fixes needed before first production deploy added
Actions #5

Updated by Tom Morris about 5 years ago

  • Target version changed from Arvados Future Sprints to 2019-03-13 Sprint
Actions #6

Updated by Peter Amstutz about 5 years ago

  • Assigned To set to Peter Amstutz
Actions #7

Updated by Peter Amstutz about 5 years ago

14844-cdc-azure-fixes @ 8c4fb97b1d34b5f8fc50d239698a08c35a63dac3

  • If PrivateIPAddress somehow isn't defined, return empty string (don't panic)
  • If VM create fails, attempt to immediately clean the VHD and NIC corresponding to that VM (if it doesn't work, cleanup processes should still get around to it.)
Actions #8

Updated by Lucas Di Pentima about 5 years ago

This LGTM, thanks.

Actions #9

Updated by Peter Amstutz about 5 years ago

  • Status changed from New to Resolved
  • % Done changed from 0 to 100
Actions #10

Updated by Tom Morris about 5 years ago

  • Release set to 15
Actions

Also available in: Atom PDF