Project

General

Profile

Actions

Bug #6221

closed

[Data Manager] implement Delete

Added by Ward Vandewege almost 9 years ago. Updated over 8 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Keep
Target version:
Story points:
0.5

Description

By design, Data Manager deletes unneeded blocks by sending "trash lists" to the keepstore servers.

Currently, Data Manager computes lists of trashable blocks, but doesn't send them to the keepstore servers.

Merge #1 should include:
  • trash_worker.go must respect the never_delete flag, and this must be tested. When this is set, it should log a message (including the locator of the block being deleted) instead of deleting the file.
  • Temporarily -- until datamanager has an integration test -- modify keepstore so (except for the test suite) it is impossible to make never_delete==false.
Merge #2 should include:
  • data manager sends its trash lists to keep servers
    • the current code identifies only a subset of trash: blocks that are not referenced by any collection. It doesn't identify overreplicated blocks as trash. This is fine for now -- we're just concentrating on getting the list it does generate to propagate to the keepstore servers.

Merge #3 will be #6260


Subtasks 4 (0 open4 closed)

Task #6629: Review 6221-write-trash-listResolvedTom Clegg07/14/2015Actions
Task #6615: Support sending trash lists from data managerResolvedPeter Amstutz07/14/2015Actions
Task #6620: Review 6221-datamanager-delete @ 66380d0ResolvedPeter Amstutz07/14/2015Actions
Task #6673: merge 6221-write-trash-list only once keepstore is deployed with -no-delete as defaultResolvedPeter Amstutz07/14/2015Actions
Actions #1

Updated by Ward Vandewege over 8 years ago

  • Description updated (diff)
Actions #2

Updated by Brett Smith over 8 years ago

  • Target version set to 2015-07-22 sprint
Actions #3

Updated by Tom Clegg over 8 years ago

  • Description updated (diff)
  • Category set to Keep
  • Target version deleted (2015-07-22 sprint)
Actions #4

Updated by Tom Clegg over 8 years ago

  • Description updated (diff)
Actions #5

Updated by Tom Clegg over 8 years ago

  • Description updated (diff)
Actions #6

Updated by Tom Clegg over 8 years ago

  • Description updated (diff)
Actions #7

Updated by Tom Clegg over 8 years ago

  • Description updated (diff)
  • Assigned To set to Peter Amstutz
  • Target version set to 2015-07-22 sprint
Actions #8

Updated by Tom Clegg over 8 years ago

  • Story points set to 1.5
Actions #9

Updated by Tom Clegg over 8 years ago

6221-datamanager-delete @ 66380d0

This changes the default to never_delete==true but doesn't get as far as "impossible to make never_delete==false".

Perhaps the way to do this without making it too confusing is to add something in main (around if maxBuffers < 0?) along these lines?

if !never_delete {
        log.Fatal("never_delete must not be turned off, see #6221")
}

Minor nitpick: the error checking in trash_worker, according to Go style, should be "handle errors first":

if err != nil {
        // handle error condition
} else {
        // do normal thing
}

Other than that, LGTM.

I think it would be nicer to put the never_delete check in volume_unix.go so we're less likely to make the same mistake again. But you've put it at the same level as the previous check, and maybe we should leave the cleanup for another day for the sake of getting the rest of this ticket done. Besides, I suppose when we have multiple volume types we'll want to do something else like wrap every volume with a command-line-flag-obeying wrapper.

Actions #10

Updated by Tom Clegg over 8 years ago

LGTM @ 421c879

Actions #11

Updated by Tom Clegg over 8 years ago

HostPort() now returns scheme://host:port and should be renamed accordingly.

trash_list.go is shamelessly copied and pasted from pull_list.go. More shame, please?

// Writes each pull list to a file.
// The filename is based on the hostname.
//
// This is just a hack for prototyping, it is not expected to be used
// in production.

Please refactor so we don't have two copies of a "write a map to a file" function, at least. Or just delete the new one, since (AFAICT) it is unused.

keepstore's trash_worker seems to have lost a feature (do not delete a blocks if modtime is different than what Data Manager expected). Now it warns, but deletes anyway. Why? Aside from being a sanity check, without this Data Manager can't fix overreplication when a keepstore server has >1 copy of a block (unless it's willing to delete all copies on that server).

Error handling is broken here. Suggest:

-                       if resp.StatusCode != http.StatusOK {
-                               log.Printf("Error sending trash list to %v error: %v", url, err.Error())
-                       }
+                       log.Printf("Sent trash list to %v: response was HTTP %d", url, resp.StatusCode)

(It seems worthwhile to have logs here even when HTTP 200.)

Can we please have unit tests? (Unit tests would have caught the above use of err.Error() when err is nil, and possibly other things.)

Actions #12

Updated by Peter Amstutz over 8 years ago

Tom Clegg wrote:

HostPort() now returns scheme://host:port and should be renamed accordingly.

trash_list.go is shamelessly copied and pasted from pull_list.go. More shame, please?

[...]

Please refactor so we don't have two copies of a "write a map to a file" function, at least. Or just delete the new one, since (AFAICT) it is unused.

You're right, it's gone now.

keepstore's trash_worker seems to have lost a feature (do not delete a blocks if modtime is different than what Data Manager expected). Now it warns, but deletes anyway. Why? Aside from being a sanity check, without this Data Manager can't fix overreplication when a keepstore server has >1 copy of a block (unless it's willing to delete all copies on that server).

That was an error introduced in the refactoring to make the error reporting better. Fixed.

Error handling is broken here. Suggest:
[...]

Fixed.

(It seems worthwhile to have logs here even when HTTP 200.)

Can we please have unit tests? (Unit tests would have caught the above use of err.Error() when err is nil, and possibly other things.)

Added unit test for BuildTrashList(). Testing SendTrashList() is more complicated and seems better suited to integration testing.

Actions #13

Updated by Peter Amstutz over 8 years ago

  • Status changed from New to In Progress
Actions #14

Updated by Tom Clegg over 8 years ago

Indeed, error reporting for "mtime mismatch" is better now, thanks. (Printing actual/expected mtime would be even better, but I guess that's not what we're here for today)

I think this error ("garbage collection is impossible right now") is serious enough to be propagated back to main explicitly rather than "there is no garbage". The appropriate behavior in main, presumably, is to process pull lists anyway but then exit non-zero so there's an easy way for the caller to detect/report that something is wrong.

               log.Printf("Failed to get blobSignatureTtl: %v", err)
               return map[string]keep.TrashList{}

According to golint, we should do this:

-    for block, _ := range keepBlocksNotInCollections {
+    for block := range keepBlocksNotInCollections {

Could we un-pyramid BuildTrashListsInternal into "handle errors first" style?

            if block_on_server.Mtime >= expiry {
                // block is newer than expiry cutoff
                continue
            }
            srv := keepServerInfo.KeepServerIndexToAddress[block_on_server.ServerIndex].String()
            if _, writable := writableServers[srv]; !writable {
                continue
            }
            m[srv] = append(m[srv], keep.TrashRequest{Locator: block.Digest.String(), BlockMtime: block_on_server.Mtime})

Testing SendTrashList() is more complicated and seems better suited to integration testing

I don't think it should be too complicated -- you can make a http://golang.org/pkg/net/http/httptest/ Server with a closure so the test function can see what got POSTed. Integration tests (even when they exist) aren't especially good at testing error conditions and edge cases. This stuff needs unit testing.

It looks unsafe to share kc.Client across goroutines this way. Is there even a reason for passing a keepclient.KeepClient into SendTrashLists and using its kc.Client, instead of just making a new &http.Client{} in each goroutine?

BuildTrashListsInternal should be buildTrashListsInternal.

Actions #15

Updated by Peter Amstutz over 8 years ago

Tom Clegg wrote:

Indeed, error reporting for "mtime mismatch" is better now, thanks. (Printing actual/expected mtime would be even better, but I guess that's not what we're here for today)

FIxed.

I think this error ("garbage collection is impossible right now") is serious enough to be propagated back to main explicitly rather than "there is no garbage". The appropriate behavior in main, presumably, is to process pull lists anyway but then exit non-zero so there's an easy way for the caller to detect/report that something is wrong.
[...]

Done.

According to golint, we should do this:

[...]

Done.

Could we un-pyramid BuildTrashListsInternal into "handle errors first" style?

[...]

I prefer to structure code in a more functional style using just control blocks and avoiding local jumps, but the the prevailing Go style is unapologetically imperative. Fixed.

I don't think it should be too complicated -- you can make a http://golang.org/pkg/net/http/httptest/ Server with a closure so the test function can see what got POSTed. Integration tests (even when they exist) aren't especially good at testing error conditions and edge cases. This stuff needs unit testing.

Done.

It looks unsafe to share kc.Client across goroutines this way. Is there even a reason for passing a keepclient.KeepClient into SendTrashLists and using its kc.Client, instead of just making a new &http.Client{} in each goroutine?

"The Client's Transport typically has internal state (cached TCP connections), so Clients should be reused instead of created as needed. Clients are safe for concurrent use by multiple goroutines."

BuildTrashListsInternal should be buildTrashListsInternal.

Fixed.

Actions #16

Updated by Tom Clegg over 8 years ago

Peter Amstutz wrote:

Tom Clegg wrote:

Indeed, error reporting for "mtime mismatch" is better now, thanks. (Printing actual/expected mtime would be even better, but I guess that's not what we're here for today)

FIxed.

Thanks. Sentence reads funny but I'm ready to move on from this. :)

I think this error ("garbage collection is impossible right now") is serious enough to be propagated back to main explicitly rather than "there is no garbage". The appropriate behavior in main, presumably, is to process pull lists anyway but then exit non-zero so there's an easy way for the caller to detect/report that something is wrong.
[...]

Done.

Thanks. AFAICT the only things that can go wrong are "can't get discovery doc" and "discovery doc doesn't say what TTL is", both of which seem serious enough to mean "configuration error", and both of which would only happen on the first GC run.

I prefer to structure code in a more functional style using just control blocks and avoiding local jumps, but the the prevailing Go style is unapologetically imperative. Fixed.

Thanks. It's easier to read and maintain a code base that has a style, and Go is helping us skip the debate about which style is better.

I don't think it should be too complicated -- you can make a http://golang.org/pkg/net/http/httptest/ Server with a closure so the test function can see what got POSTed. Integration tests (even when they exist) aren't especially good at testing error conditions and edge cases. This stuff needs unit testing.

Done.

Thanks.

The "error" tests don't test much -- just lack of panic() (which is better than nothing, to be sure). Seems like they should also assert that a non-nil error was returned.

"I'm a teapot" should be 418...?

You could de-duplicate TestSendTrashListError and TestSendTrashListError2 by doing something like this

for _, server := range []*http.Server{
        httptest.NewServer(&TestHandlerError{}),
        httptest.NewUnstartedServer(&TestHandler{}),
} {
        defer server.Close()

        // (do the test)
}

(If you really want separate test functions, please name them like "...HTTPErrorResponse" and "...Unreachable" instead of "Error" and "Error2", and you can still use a helper function instead of copying & pasting code.)

It looks unsafe to share kc.Client across goroutines this way. Is there even a reason for passing a keepclient.KeepClient into SendTrashLists and using its kc.Client, instead of just making a new &http.Client{} in each goroutine?

"The Client's Transport typically has internal state (cached TCP connections), so Clients should be reused instead of created as needed. Clients are safe for concurrent use by multiple goroutines."

Ah, good.

trash_list_test.go seems to miss an easy opportunity to test the "don't delete blocks with a too-recent Mtime" code -- just add one new block to keepServerInfo and leave the assertions alone. Can you do that?

Rest of the fixes look good, thanks.

Actions #17

Updated by Peter Amstutz over 8 years ago

Tom Clegg wrote:

You could de-duplicate TestSendTrashListError and TestSendTrashListError2 by doing something like this

[...]

(If you really want separate test functions, please name them like "...HTTPErrorResponse" and "...Unreachable" instead of "Error" and "Error2", and you can still use a helper function instead of copying & pasting code.)

Done.

trash_list_test.go seems to miss an easy opportunity to test the "don't delete blocks with a too-recent Mtime" code -- just add one new block

to keepServerInfo and leave the assertions alone. Can you do that?

I'm pretty sure the third clause "// Test trash list where only block on sv0 is expired" already tests that?

Rest of the fixes look good, thanks.

Ready to merge?

Actions #18

Updated by Tom Clegg over 8 years ago

Peter Amstutz wrote:

Tom Clegg wrote:

You could de-duplicate TestSendTrashListError and TestSendTrashListError2 by doing something like this

[...]

(If you really want separate test functions, please name them like "...HTTPErrorResponse" and "...Unreachable" instead of "Error" and "Error2", and you can still use a helper function instead of copying & pasting code.)

Done.

Yay de-duplication and helpful headings from "go test", thanks.

The new error handling in SendTrashLists looks a bit odd to me -- perhaps it would be better to
  • Call the returned array of errors "errs" instead of "err"
  • Omit nils, so you can find out whether there were any errors with len(errs)==0 instead of having to do your own loop

trash_list_test.go seems to miss an easy opportunity to test the "don't delete blocks with a too-recent Mtime" code -- just add one new block to keepServerInfo and leave the assertions alone. Can you do that?

I'm pretty sure the third clause "// Test trash list where only block on sv0 is expired" already tests that?

Ah, you're right, I missed that.

Thanks.

Actions #19

Updated by Peter Amstutz over 8 years ago

  • Target version changed from 2015-07-22 sprint to 2015-08-05 sprint
Actions #20

Updated by Peter Amstutz over 8 years ago

  • Story points changed from 1.5 to 0.5
Actions #21

Updated by Brett Smith over 8 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 75 to 100

Applied in changeset arvados|commit:4d6e05c25c6a5d72afee37f8165b006267b4183d.

Actions

Also available in: Atom PDF