Feature #3448
closed[Keep] storage server records (and reports in index) the timestamp of the most recent PUT operation for each block.
Description
- Data manager needs to know this timestamp (via storage server's
GET /
index method) in order to get a more accurate picture of which blocks it will be able to delete. - Storage server itself needs to check this timestamp when performing
DELETE
requests, to catch race conditions (if a client PUT a new copy of a block 10 seconds ago, after data manager decided 20 seconds ago that the block was a good candidate for deletion, the storage server should refuse to delete it).
Implementation
- Use mtime (
touch(1)
, godoc syscall →syscall.Utimes()
?) to store timestamps. This makes it easy for an admin to inspect state using the usual filesystem tools.Volume
andUnixVolume
should grow an UpdateTimestamp method.- GetBlock() in handlers.go should accept an additional argument indicating whether it should also perform an "update timestamp" operation before considering its "Get" operation successful.
- Readonly cases: timestamps can't get updated, so clients won't get a grace period. The safest option is to err on the side of excessive replication: when PUTting a block that already exists on the volume, if the timestamp cannot be updated, consider that a failure and try the next volume.
- In
DELETE
, check the mtime of each file before deleting it. If it's newer than permission signature TTL, don't delete. There's a race condition here too, so use mutexes around {check timestamp; delete file} and around {update timestamp during PUT}. If there's a race, "update timestamp" will fail ENOENT and the PUT operation should either retry the current volume once (which would involve writing) or just give up and move on to the next volume.- Wrap each of the {check timestamp; delete file} and {update timestamp} operations with {open file with mode "a+" and Flock(LOCK_EX) it}
- (Considered using an in-process lock like the -serialize-io mechanism, but that doesn't guard against accidental misconfigurations where multiple keepstore processes point at the same mount point.)
- UnixVolume's Index method already reports modtime. Yay!
Updated by Tom Clegg over 10 years ago
- Description updated (diff)
- Category set to Keep
Updated by Tom Clegg over 10 years ago
At 56d36387
In handlers.go, we call GetBlock(), which walks through some/all of the volumes looking for the block, then call TouchBlock(), which walks through the same volumes looking for the same block, in order to update its timestamp. Aside from the wasted effort, this makes me less confident that the timestamp that gets updated is the one on the very same file whose content we took the trouble to verify. So I still think this is the way to go:- "GetBlock() in handlers.go should accept an additional argument indicating whether it should also perform an "update timestamp" operation before considering its "Get" operation successful."
I think PermissionError is the wrong response for "race condition detected and handled correctly". 202 ("Accepted") isn't perfect, but it's closer. Even 200 would be reasonable, given the outcome is indistinguishable from "delete succeeded, then client wrote anew". (This doesn't need too much bikeshedding given that we're soon moving to an asynchronous "ok-to-delete list" approach.)
The comment in GetBlock() in handlers.go still leaves open the mystery of why we want anything to be writable. Propose: "When reporting success for a PUT operation, we are assuring the caller that the block will not be garbage-collected sooner than permission_ttl. In order to promise this, we must ensure that the timestamp is current, and that all race conditions are accounted for."
Updated by Tim Pierce over 10 years ago
Ready for another look: dc2bf5c
Tom Clegg wrote:
At 56d36387
In handlers.go, we call GetBlock(), which walks through some/all of the volumes looking for the block, then call TouchBlock(), which walks through the same volumes looking for the same block, in order to update its timestamp. Aside from the wasted effort, this makes me less confident that the timestamp that gets updated is the one on the very same file whose content we took the trouble to verify. So I still think this is the way to go:
- "GetBlock() in handlers.go should accept an additional argument indicating whether it should also perform an "update timestamp" operation before considering its "Get" operation successful."
OK, I see why that makes more sense. The interfaces here are getting more and more tangled; I'll file a story with some cleanup ideas that we can revisit after 1.0.
I think PermissionError is the wrong response for "race condition detected and handled correctly". 202 ("Accepted") isn't perfect, but it's closer. Even 200 would be reasonable, given the outcome is indistinguishable from "delete succeeded, then client wrote anew". (This doesn't need too much bikeshedding given that we're soon moving to an asynchronous "ok-to-delete list" approach.)
Fair enough. If the model here isn't "I'm returning an error because I couldn't do the thing you asked me to do" but "I did the right thing for this request based on conditions, even though the file wasn't deleted" then that makes sense. Returning success is by far the easiest thing to do here, but we can tease it into delivering 202 or something.
The comment in GetBlock() in handlers.go still leaves open the mystery of why we want anything to be writable. Propose: "When reporting success for a PUT operation, we are assuring the caller that the block will not be garbage-collected sooner than permission_ttl. In order to promise this, we must ensure that the timestamp is current, and that all race conditions are accounted for."
That's fine, but I don't understand which comment needs clarifying. (Did you mean this comment on Delete()? https://arvados.org/projects/arvados/repository/revisions/56d363875a42c1abd485c67e030f625299548aa0/entry/services/keepstore/volume_unix.go#L254)
Updated by Tom Clegg over 10 years ago
Tim Pierce wrote:
- "GetBlock() in handlers.go should accept an additional argument indicating whether it should also perform an "update timestamp" operation before considering its "Get" operation successful."
OK, I see why that makes more sense. The interfaces here are getting more and more tangled; I'll file a story with some cleanup ideas that we can revisit after 1.0.
Hm, I recommend you don't invest toooo much time in that until we've incorporated (or at least planned) a few more of the pre-1.0 features.
Fair enough. If the model here isn't "I'm returning an error because I couldn't do the thing you asked me to do" but "I did the right thing for this request based on conditions, even though the file wasn't deleted" then that makes sense. Returning success is by far the easiest thing to do here, but we can tease it into delivering 202 or something.
200 is probably best. 202 isn't quite right, and really, everything is 100% OK when this race happens.
The comment in GetBlock() in handlers.go still leaves open the mystery of why we want anything to be writable. Propose: "When reporting success for a PUT operation, we are assuring the caller that the block will not be garbage-collected sooner than permission_ttl. In order to promise this, we must ensure that the timestamp is current, and that all race conditions are accounted for."
That's fine, but I don't understand which comment needs clarifying. (Did you mean this comment on Delete()? https://arvados.org/projects/arvados/repository/revisions/56d363875a42c1abd485c67e030f625299548aa0/entry/services/keepstore/volume_unix.go#L254)
Sorry, I got the file right (handlers.go) but it's in PutBlock, not GetBlock. L498 here, about writable replicas:
Further to that point, I think it's more correct to say that TouchBlock fails if it cannot assure a current timestamp for any reason. All this talk of read-only volumes is just wild speculation about conditions that could cause "assure a current timestamp" to fail...
Updated by Tom Clegg over 10 years ago
Looking at dc2bf5c
This looks much closer, thanks.
Noticed a couple of things- Need error checking on call to lockfile() (if
Flock()
fails,Touch()
must fail) - Need error checking on call to
Touch()
(ifTouch()
fails,PutBlock
must keep looking in other volumes, not proclaim success)
Updated by Tim Pierce over 10 years ago
Updated at e386d20
Tom Clegg wrote:
Looking at dc2bf5c
This looks much closer, thanks.
Noticed a couple of things
- Need error checking on call to lockfile() (if
Flock()
fails,Touch()
must fail)- Need error checking on call to
Touch()
(ifTouch()
fails,PutBlock
must keep looking in other volumes, not proclaim success)
Updated with error checking. Also added a rudimentary unit test case to TestDeleteHandler.
Updated by Tom Clegg about 10 years ago
Tim Pierce wrote:
Updated with error checking. Also added a rudimentary unit test case to TestDeleteHandler.
Looks good, thanks.
I think func (v *UnixVolume) Delete
in volume_unix.go
is still missing its error checking for lockfile(), though.
- PUT (when data is already present) updates the timestamp on the existing file.
- PUT (when data is present but timestamp cannot be updated) writes a second copy.
Are both of those easy enough to write?
(BTW, in commit messages, our system seems to have settled on saying "refs/closes #3448" only in merges/commits to master -- the "3448:" prefix is fine for branch commits.)
Updated by Tim Pierce about 10 years ago
at c15f086:
Tom Clegg wrote:
Tim Pierce wrote:
Updated with error checking. Also added a rudimentary unit test case to TestDeleteHandler.
Looks good, thanks.
I think
func (v *UnixVolume) Delete
involume_unix.go
is still missing its error checking for lockfile(), though.
Ugh, yes, I left it out. Committed now (for sure!)
I feel like we should have two more basic tests:
- PUT (when data is already present) updates the timestamp on the existing file.
- PUT (when data is present but timestamp cannot be updated) writes a second copy.
Are both of those easy enough to write?
Yes, those are good ideas. Added TestPutTouch and TestPutTouchFails to exercise these conditions.
(BTW, in commit messages, our system seems to have settled on saying "refs/closes #3448" only in merges/commits to master -- the "3448:" prefix is fine for branch commits.)
I tend to include "refs #xxxx" just in case and so I don't have to try rewriting my history, but I'll try to remember :-)
Updated by Tom Clegg about 10 years ago
Tim Pierce wrote:
I think
func (v *UnixVolume) Delete
involume_unix.go
is still missing its error checking for lockfile(), though.Ugh, yes, I left it out. Committed now (for sure!)
Ah yes, there it is.
(I notice you used "e" instead of "err" here. Is there a reason for that? I've become accustomed to seeing "err" all over the place in Go, which seems like a nice convention, so I'm wondering if there is some other style/convention I should be filing this "e" under...)
I feel like we should have two more basic tests:
- PUT (when data is already present) updates the timestamp on the existing file.
- PUT (when data is present but timestamp cannot be updated) writes a second copy.
Are both of those easy enough to write?
Yes, those are good ideas. Added TestPutTouch and TestPutTouchFails to exercise these conditions.
In the TestPutBlockTouchFails test, I suspect the "bad" check in Get() will prevent us from ever reaching the code that calls Touch(): L432 of handlers.go will get an error from Get(), look on other volumes, and eventually return, without ever calling Touch(). I think this could be fixed by having an "Untouchable" flag distinct from the ambiguous "Bad" flag. (Or possibly by telling me I'm confused about something.)
I wonder if it would be just as easy for TestPutTouch to set the Mtime to {now minus 5 seconds}, instead of using sleep(1). It seems a shame to make the test suite runtime grow from 0.022s to 1.023s over this. (Even though our Workbench tests take >5 minutes.)
(BTW, in commit messages, our system seems to have settled on saying "refs/closes #3448" only in merges/commits to master -- the "3448:" prefix is fine for branch commits.)
I tend to include "refs #xxxx" just in case and so I don't have to try rewriting my history, but I'll try to remember :-)
Sure, it's not worth rebasing these commits or anything, I was just cultivating future consistency. (Hm, come to think of it, leaving "refs #xxxx" off the branch commits also helps by preventing you from accidentally FF-merging into master...)
Thanks.
Updated by Tom Clegg about 10 years ago
Noticed this seems odd in the test case that returns 200 despite not deleting any data. Shouldn't we expect copies_deleted=0, copies_failed=0?
// Expect response {"copies_deleted":1,"copies_failed":0}
expected_dc = deletecounter{1, 0}
Updated by Tim Pierce about 10 years ago
Updates at cca1529:
Tom Clegg wrote:
(I notice you used "e" instead of "err" here. Is there a reason for that? I've become accustomed to seeing "err" all over the place in Go, which seems like a nice convention, so I'm wondering if there is some other style/convention I should be filing this "e" under...)
There isn't a specific reason. The Go library code and examples tend to use both "err" and "e" for short error checking tests like this. I can revise this to use "err" throughout if it's confusing.
In the TestPutBlockTouchFails test, I suspect the "bad" check in Get() will prevent us from ever reaching the code that calls Touch(): L432 of handlers.go will get an error from Get(), look on other volumes, and eventually return, without ever calling Touch(). I think this could be fixed by having an "Untouchable" flag distinct from the ambiguous "Bad" flag. (Or possibly by telling me I'm confused about something.)
Very good point. Added a Touchable
flag to MockVolume and updated the TODO with some thoughts about clarifying the use of these flags so we don't overload flags like Bad
too much.
I wonder if it would be just as easy for TestPutTouch to set the Mtime to {now minus 5 seconds}, instead of using sleep(1). It seems a shame to make the test suite runtime grow from 0.022s to 1.023s over this. (Even though our Workbench tests take >5 minutes.)
That would be a nice improvement, I just don't want to boil the ocean too much on this right now. Added a TODO.
Noticed this seems odd in the test case that returns 200 despite not deleting any data. Shouldn't we expect copies_deleted=0, copies_failed=0?
Good point. I'll file a bugfix ticket for this. It'll require redefining the Volume interface a little.
This is tricky because we've defined a kind of three-way logic for success and failure: "success, I deleted the block," "failure, I couldn't delete the block because something was wrong", and.... "success, I could have deleted the block but decided that wasn't the right thing to do." This is why I wanted Keepstore to respond 4xx if a DELETE failed for policy reasons like this -- it allows Keepstore to define a very clear success/failure status, and requires the Data Manager to be responsible for interpreting whether that's a problem or not.
Updated by Tom Clegg about 10 years ago
LGTM, please merge.
Tim Pierce wrote:
Noticed this seems odd in the test case that returns 200 despite not deleting any data. Shouldn't we expect copies_deleted=0, copies_failed=0?
Good point. I'll file a bugfix ticket for this. It'll require redefining the Volume interface a little.
Meh, skip it. The "delete" API is about to change to "here's a delete list", so the idea of reporting the result of an individual delete operation with an HTTP status code will be moot.
Thanks!
Updated by Tim Pierce about 10 years ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|commit:94b5a59631f22e4e57561a6244c24b93db77f589.