Story #7490

[Data Manager] Return errors instead of calling log.Fatal in code that needs to be tested

Added by Brett Smith about 4 years ago. Updated almost 4 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Radhika Chippada
Category:
Keep
Target version:
Start date:
09/10/2015
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
1.0

Description

Currently, parts of Data Manager handle some runtime errors by calling log.Fatal(). This practice makes it unnecessarily hard (impossible?) to test the relevant error functions: log.Fatal() causes the whole test suite to exit, rather than failing the current test.

Any offending bits of Data Manager should be fixed at least to the point where tests can demonstrate Data Manager's behavior when faced a network error during startup or while retrieving collections from the API server.


Subtasks

Task #7783: Review branch 7490-datamanager-dont-die-return-errorResolvedRadhika Chippada

Associated revisions

Revision c4023fb9
Added by Radhika Chippada almost 4 years ago

closes #7490
Merge branch '7490-datamanager-dont-die-return-error'

Revision 30fc3285
Added by Radhika Chippada almost 4 years ago

refs #7490
Merge branch '7490-datamanager-test'

Revision 30fc3285
Added by Radhika Chippada almost 4 years ago

refs #7490
Merge branch '7490-datamanager-test'

History

#1 Updated by Radhika Chippada about 4 years ago

  • Status changed from New to In Progress
  • Assigned To set to Radhika Chippada

#2 Updated by Radhika Chippada about 4 years ago

  • Story points changed from 0.5 to 1.0

#3 Updated by Brett Smith about 4 years ago

  • Target version changed from Arvados Future Sprints to 2015-12-02 sprint

#4 Updated by Radhika Chippada about 4 years ago

  • Converted most of fatalfs into errors.
  • Added several tests with errors in keep/keep_test.go; this test was actually not passing and included earlier and this was also corrected
  • Updated datamanager to not continue if not all collections are read. Do let me know if this should not be done or should be done in a better way
        if len(readCollections.UUIDToCollection) == 0 {
            return nil // no collections read so no more work to do?
        }
    
  • Added several additional failing tests in datamanager_test.go
    • I renamed file.go -> writeDataTo as WriteDataTo and collection.go -> heapProfileFilename as HeapProfileFilename so that I could inject errors during datamanager singlerun. I tried to convert these into FlagSet, however due to multiple packages and all, it seemed much too involved and might not be worth the trouble. Of course, this introduces the vulnerability of exposing these two variables outside the package

#5 Updated by Tom Clegg almost 4 years ago

In keep.go → GetKeepServersAndSummarize I think we should return right away if GetKeepServers returns an error. The number of keep disks and replication level distribution won't be correct anyway, and "we'll return this error later" generally seems a bit fragile (it's normal in Go to reuse an existing err variable without worrying about whether it already has a deferred error in it, so we might do that by accident).

Returning empty objects to indicate errors seems a bit fragile too. How about adding a distinct Err field to keep.ServerResponse and collection.ReadCollections? (That way we can report the actual error, rather than "no host info found")

This should let us check if readCollections.Err != nil instead of if len(readCollections.UUIDToCollection) == 0 ...and return the actual error from singlerun instead of treating it as "success because either there's nothing to do or we hit an error and it has already been reported". (Even if there are no collections, it seems correct to continue and do the MaybeWriteData, report block stats, and so on. If we're going to flag "no collections" as a special case that's "probably an accident", perhaps we could handle it by setting the (forthcoming) "dry run" flag and printing a message explaining what's happening.)

In datamanager.go, func makeArvadosClient() seems to be redundant now; we can delete it and just call arvadosclient.MakeArvadosClient() directly.

In datamanager.go, we should probably call loggerutil.FatalWithMessage instead of log.Fatalf when singlerun fails, so we still propagate the error to the API server logs as well.

In summary/file.go, MaybeWriteData's first return value is ignored by the only code that uses it; it might as well only return err.

In keep_test.go, use net.SplitHostPort(ksURL.Host) instead of strings.Split(ksURL.Host, ":") -- that way it will work with IPv6 addresses, too.

In keep_test.go, testGetKeepServersFromAPI could match the error against a testData.errorMatches string field passed by the test function rather than having its own logic about which tests expect which errors. (This should also resolve the question of what TestGetKeepServers_ServerError expects to happen -- it looks like it currently doesn't assert anything.)

#6 Updated by Tom Clegg almost 4 years ago

Pushed a small fix to error messages (3ed87fe) -- using "%#q" avoids newlines in error messages, which makes it easier to match everything with ErrorMatches.

Everything else LGTM, thanks.

#7 Updated by Radhika Chippada almost 4 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:c4023fb9b27ba31926faf530a4e59b9aee0989b3.

Also available in: Atom PDF