Bug #2449

New Keep server can write

Added by Tom Clegg over 6 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Tim Pierce
Category:
-
Start date:
04/01/2014
Due date:
% Done:

100%

Estimated time:
(Total: 53.00 h)
Story points:
5.0

Description

See current spec at Keep server

Goals for this sprint:
  • Read and write using existing filesystem structure
  • Enforce "one read/write operation per mount point" restriction (with command line flag to disable, for AWS. --no-io-lock?)
  • Command line argument (--port?) specifies port to listen on. Fail if none given.
  • Command line arguments (--ssl-key and --ssl-cert?) enable https mode. Otherwise, plain http.
  • Command line arguments (positional?) specify mount points ("run" script can be in charge of mount|grep etc)
  • Test suite (use a tmpfs?)
Also, hopefully, index and status handlers:
  • GET /index.txt → one line per block stored on disk: hash+size <space> modtime_unix_timestamp
  • GET /status.json → a json document something like this
{
 "volumes":[
  {"mount_point":"/data/disk0","bytes_free":4882337792,"bytes_used":5149708288},
  {"mount_point":"/data/disk1","bytes_free":39614472192,"bytes_used":3314229248}
 ]
}

Subtasks

Task #2463: Keep: command-line flagsClosedTim Pierce

Task #2561: Keep: /index and /status handlersClosedTim Pierce

Task #2603: Review 2449-keep-index-status-handlersClosedTim Pierce

Task #2507: Review 2449-keep-write-blocksClosedTim Pierce

Task #2292: Keep server writes Warehouse blocksClosedTim Pierce

Task #2591: Review 2449-keep-flagsClosedMisha Zatsman

Task #2460: Keep: unit tests for writing blocksClosedTim Pierce

Associated revisions

Revision 2f91702d
Added by Tim Pierce over 6 years ago

Merge branch '2449-keep-index-status-handlers'

Closes #2449, closes #2603, closes #2561.

History

#1 Updated by Tim Pierce over 6 years ago

  • Description updated (diff)
  • Assigned To set to Tim Pierce

#2 Updated by Ward Vandewege over 6 years ago

Review notes:

  • PutBlock logic is not quite right. It shouldn't write to all Keep volumes it has. Only to one. You should assume there will be many keep servers, each with a number of disks, and the client is responsible for writing copies to as many keep servers as appropriate. Each Keep server should never have more than one copy of a block. Fixing this also sorts out the ambiguity of the 503 return code: that code should mean "I don't have any space to write this block".

Other than that, I think this looks fine to merge. The tests pass, and the code builds.

#3 Updated by Ward Vandewege over 6 years ago

Review 2:

  • About your TODO: check for a full volume before trying to write. Please make this compatible with the existing Keepv1 logic (i.e. a symlink called 'full' that points to a timestamp (non-existant file)). Have a look in the Keepv1 codebase.

Reasoning: I want Keepv2 to be as compatible as possible with Keepv1 is that I really want to be able to have it act as a drop-in replacement.

  • About your TODO: should we check here whether the file on disk matches the file we were asked to store?

Yes - that would be good I think. This is an interesting scenario: we've had data corruption on disk, but someone is offering us a new copy of the data, allowing us to repair the corruption. The correct course of action is probably to delete the corrupted file on disk (or mv it to hash.corrupted?) and write the new file.

Otherwise, this looks ready to merge I think. Tests and build pass.

#4 Updated by Tom Clegg over 6 years ago

  • Description updated (diff)

#5 Updated by Tom Clegg over 6 years ago

  • Description updated (diff)

#6 Updated by Tom Clegg over 6 years ago

  • Description updated (diff)

#7 Updated by Tom Clegg over 6 years ago

  • Status changed from New to In Progress

#8 Updated by Tom Clegg over 6 years ago

Looking at this diff

Notes
  • Avoid leaving bogus data on disk if process/machine dies mid-write: Write to {hash}.tmp, then rename it to {hash} if everything is still OK after flush.
  • Avoid race condition in write: use O_EXCL when opening .tmp file
  • Avoid leaving bogus data on disk if write fails: check errors from f.Close(), and remove .tmp file if Write() or Close() fails
  • L241 looks like it finds corrupt data on disk but reports "Collision". If this is actually the right thing to do, maybe add a comment saying why
  • Use statfs instead of `df` to get free disk space

#9 Updated by Tim Pierce over 6 years ago

  • Assigned To changed from Tim Pierce to Tom Clegg

PTAL

Tom Clegg wrote:

Looking at this diff

Notes
  • Avoid leaving bogus data on disk if process/machine dies mid-write: Write to {hash}.tmp, then rename it to {hash} if everything is still OK after flush.

Good thought. I am using ioutil.TempFile with a filename prefix of "tmp{hash}" instead, so that the tmp files will not show up if someone requests a file index. That seems to me like the desired behavior; please let me know if you don't agree.

  • Avoid race condition in write: use O_EXCL when opening .tmp file

I think using ioutil.TempFile is sufficient to avoid this race, but again, let me know if you disagree.

  • Avoid leaving bogus data on disk if write fails: check errors from f.Close(), and remove .tmp file if Write() or Close() fails

Done.

  • L241 looks like it finds corrupt data on disk but reports "Collision". If this is actually the right thing to do, maybe add a comment saying why

Per discussion on IRC: the desired behavior here is to report the corrupt block internally, then replace it with the good block we have on hand. I've modified this code and its test accordingly.

  • Use statfs instead of `df` to get free disk space

Done.

#10 Updated by Tim Pierce over 6 years ago

  • Assigned To changed from Tom Clegg to Tim Pierce

#11 Updated by Tom Clegg over 6 years ago

Tim Pierce wrote:

  • Avoid leaving bogus data on disk if process/machine dies mid-write: Write to {hash}.tmp, then rename it to {hash} if everything is still OK after flush.

Good thought. I am using ioutil.TempFile with a filename prefix of "tmp{hash}" instead, so that the tmp files will not show up if someone requests a file index. That seems to me like the desired behavior; please let me know if you don't agree.

Agree.

  • Avoid race condition in write: use O_EXCL when opening .tmp file

I think using ioutil.TempFile is sufficient to avoid this race, but again, let me know if you disagree.

Agree. (Docs seem quite clear that ioutil.TempFile guarantees no races.)

  • L241 looks like it finds corrupt data on disk but reports "Collision". If this is actually the right thing to do, maybe add a comment saying why

Per discussion on IRC: the desired behavior here is to report the corrupt block internally, then replace it with the good block we have on hand. I've modified this code and its test accordingly.

Looks good

  • Use statfs instead of `df` to get free disk space

Done.

Nice

Here's an MD5 collision, if you feel like testing that behavior:

foo = '\x0e0eaU\x9a\xa7\x87\xd0\x0b\xc6\xf7\x0b\xbd\xfe4\x04\xcf\x03e\x9epO\x854\xc0\x0f\xfbe\x9cL\x87@\xcc\x94/\xeb-\xa1\x15\xa3\xf4\x15\\\xbb\x86\x07Is\x86em}\x1f4\xa4 Y\xd7\x8fZ\x8d\xd1\xef'
bar = '\x0e0eaU\x9a\xa7\x87\xd0\x0b\xc6\xf7\x0b\xbd\xfe4\x04\xcf\x03e\x9etO\x854\xc0\x0f\xfbe\x9cL\x87@\xcc\x94/\xeb-\xa1\x15\xa3\xf4\x15\xdc\xbb\x86\x07Is\x86em}\x1f4\xa4 Y\xd7\x8fZ\x8d\xd1\xef'
foo == bar
# => False
import hashlib
hashlib.md5(foo).hexdigest()
# => 'cee9a457e790cf20d4bdaa6d69f01e41'
hashlib.md5(bar).hexdigest()
# => 'cee9a457e790cf20d4bdaa6d69f01e41'

Another thought for "tests todo": Write fail despite disk reporting plenty of space. (Perhaps test by using a small tmpfs + lowering MIN_FREE_KILOBYTES?)

Just checked "git diff --check" and it complains about a lot of "tab in indent". gofmt seems to have an "indent with spaces" mode. Is there a strong preference for tabs in the Go world? (If not, we should probably stick to our own coding standards)

I haven't spotted any other problems.

#12 Updated by Tim Pierce over 6 years ago

Tom Clegg wrote:

Here's an MD5 collision, if you feel like testing that behavior:

[...]

Thanks, I've added TestPutBlockCollision.

Another thought for "tests todo": Write fail despite disk reporting plenty of space. (Perhaps test by using a small tmpfs + lowering MIN_FREE_KILOBYTES?)

Good thought. Might be easier to use an interface to mock ioutil.TempFile. I've added a TODO.

Just checked "git diff --check" and it complains about a lot of "tab in indent". gofmt seems to have an "indent with spaces" mode. Is there a strong preference for tabs in the Go world? (If not, we should probably stick to our own coding standards)

There actually is a strong preference, even a formally recommended one: http://golang.org/doc/effective_go.html#formatting

We use tabs for indentation and gofmt emits them by default. Use spaces only if you must.

#13 Updated by Tom Clegg over 6 years ago

Tim Pierce wrote:

Thanks, I've added TestPutBlockCollision.

Looks great

Another thought for "tests todo": Write fail despite disk reporting plenty of space. (Perhaps test by using a small tmpfs + lowering MIN_FREE_KILOBYTES?)

Good thought. Might be easier to use an interface to mock ioutil.TempFile. I've added a TODO.

Great. (But I think the todo comment says "high MIN_FREE" where it should say "low MIN_FREE"?)

Just checked "git diff --check" and it complains about a lot of "tab in indent". gofmt seems to have an "indent with spaces" mode. Is there a strong preference for tabs in the Go world? (If not, we should probably stick to our own coding standards)

There actually is a strong preference, even a formally recommended one: http://golang.org/doc/effective_go.html#formatting

We use tabs for indentation and gofmt emits them by default. Use spaces only if you must.

Indeed. Lovely, now we have to figure out how to make git complain/prohibit tabs in most, but not all, parts of the source tree. (I wouldn't say that qualifies us for the "if you must" clause...)

#14 Updated by Tim Pierce over 6 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 64 to 100

Applied in changeset arvados|commit:5df09e707c313fd88c32ad40f2d99030ecc2e639.

#15 Updated by Peter Amstutz over 6 years ago

Comments relating to 2449-keep-index-status-handlers:

  • Apply filter so that /index doesn't list files that are not keep blocks, such as *.meta, .lock, last_error, etc
  • Add a "device" field to /status.json so the reader can distinguish physical volumes from mount points.

#16 Updated by Tim Pierce over 6 years ago

Peter Amstutz wrote:

Comments relating to 2449-keep-index-status-handlers:

  • Apply filter so that /index doesn't list files that are not keep blocks, such as *.meta, .lock, last_error, etc

Good thought, added an IsValidLocator method to filter out these files and updated the unity test to see that .meta files are properly excluded.

  • Add a "device" field to /status.json so the reader can distinguish physical volumes from mount points.

Added device_num to status.json, populated with the stat(2) st_dev field for the Keep volume directory.

#17 Updated by Peter Amstutz over 6 years ago

  • /index looks good
  • I was envisioning providing the device id, e.g. "/dev/sda3" which would be more useful than fsid, but fsid is good enough to distinguish the physical volumes. Ideally we want to provide quite a bit more useful information about the disk, such as which drive bay the disk is in, but that's a story for another sprint.

I ran ./keep -volumes=~/keep0
This fails because ~ is not expanded, but Keep doesn't report an error until you try to access the volume, e.g.:
IndexHandler: ~/keep0: walking to ~/keep0: lstat ~/keep0: no such file or directory

It should check that the volume exists and is accessible and warn or fail if not.

#18 Updated by Tim Pierce over 6 years ago

Peter Amstutz wrote:

  • /index looks good
  • I was envisioning providing the device id, e.g. "/dev/sda3" which would be more useful than fsid, but fsid is good enough to distinguish the physical volumes. Ideally we want to provide quite a bit more useful information about the disk, such as which drive bay the disk is in, but that's a story for another sprint.

That would probably be more helpful, agreed.

It's not clear to me how much of this makes sense to make Keep's responsibility to track. If a disk is going bad and it's throwing write errors, the paths to the problem files/volumes will be recorded in the Keep error logs and can be used by a sysadmin to diagnose which disk needs replacement. It's not clear to me that there's value in exposing that information in the status handler as well. But yes, we should review these issues in the next sprint or two.

I ran ./keep -volumes=~/keep0
This fails because ~ is not expanded, but Keep doesn't report an error until you try to access the volume, e.g.:
IndexHandler: ~/keep0: walking to ~/keep0: lstat ~/keep0: no such file or directory

It should check that the volume exists and is accessible and warn or fail if not.

Good call. I've added a check for this:

twp@yclept:~/arvados/services/keep$ go run keep.go -volumes=/dsfksjhdf,/lkjshdksd,/kdsfks
2014/04/16 13:41:36 bad Keep volume: stat /dsfksjhdf: no such file or directory
2014/04/16 13:41:36 bad Keep volume: stat /lkjshdksd: no such file or directory
2014/04/16 13:41:36 bad Keep volume: stat /kdsfks: no such file or directory
2014/04/16 13:41:36 could not find any keep volumes
exit status 1

If at least one valid Keep volume is found, the server will launch. It will only die with a fatal error if no valid volumes are specified.

Also available in: Atom PDF