Task #2695

Feature #2620: Keep supports I/O locking

Review 2620-keep-serialize-io

Added by Tim Pierce over 7 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Tim Pierce
Category:
-
Start date:
04/01/2014
Due date:
% Done:

0%

Estimated time:
2.00 h

Description

This is a very large code review. Here are some hints that may make it easier to understand what's going on here:

A Volume interface has been added to provide an abstraction layer for implementing any kind of storage device. The only Volume that has been implemented so far is a UnixVolume (a volume mounted as a local POSIX filesystem).

When Keep receives a GET or PUT request from a remote user, it invokes a Get or Put method on the appropriate Volume. The Volume is responsible for serializing any I/O requests if necessary.

Suggestions for review:

  1. First review volume.go, which describes the Volume interface and implements a MockVolume used for testing the front end.
  2. It may help then to review keep_test.go, to see how the Volume interface is used in unit tests.
  3. At that point, the changes in keep.go should be easier to understand: they chiefly change the Keep front-end to interact with the Volume interface rather than reading and writing files directly to disk.
  4. The underlying UnixVolume type is implemented in volume_unix.go, including the code for managing any serialized I/O requests. volume_unix_test.go supplies unit tests.

Or, you know, review in whatever order makes sense to you. :-P

History

#1 Updated by Tim Pierce over 7 years ago

  • Description updated (diff)

#2 Updated by Tim Pierce over 7 years ago

  • Assigned To set to Peter Amstutz

#3 Updated by Peter Amstutz over 7 years ago

  • Status changed from New to In Progress

#4 Updated by Peter Amstutz over 7 years ago

Reviewing 4ab25f3

  1. In keep PutBlock(), shouldn't it attempt to do some sort of load balancing across disks, instead of writing to the first available volume? E.g. preferentially writing to whichever volume is the least full (by percentage)
  2. volume_unix Read(), what happens if f.Read(buf) returns a short read? It should loop until EOF.
  3. volume_unix Read(), it returns the entire 'buf' on error. It should probably return nil in the 'buf' position.
  4. Go doesn't have better path manipulation functions than Sprintf? (e.g. something like os.path.join() in Python)
  5. Should index requests be I/O queued as well? If there are a lot of files scanning the entire directory tree could interfere with other I/O
  6. SSDs don't have seek time overhead, does it make sense to allow parallel read/write on SSD?
  7. What is the reasoning for having IsFull() record the "full" state with a bogus symlink? Is statfs() an expensive call? And why a symlink and not a regular file?
  8. Empty "return" statements are weird

#5 Updated by Peter Amstutz over 7 years ago

  • Description updated (diff)
  • Assigned To changed from Peter Amstutz to Tim Pierce

#6 Updated by Tim Pierce over 7 years ago

Peter Amstutz wrote:

Reviewing 4ab25f3

In keep PutBlock(), shouldn't it attempt to do some sort of load balancing across disks, instead of writing to the first available volume? E.g. preferentially writing to whichever volume is the least full (by percentage)

Yes, good point. We threw around some ideas for what kind of strategies we could use here but didn't settle on any, so I didn't get around to changing it.

I think I like having it round-robin through the volumes -- that should be simpler to implement, it's more likely to reduce disk contention, and it should keep them pretty well balanced for space. Objections?

volume_unix Read(), what happens if f.Read(buf) returns a short read? It should loop until EOF.

Good point. Done.

volume_unix Read(), it returns the entire 'buf' on error. It should probably return nil in the 'buf' position.

I'd like to let the caller decide what (if anything) to do with the buffer in the event of an error. The convention of returning a value and also an error means that it's pretty unambiguous when there is and isn't an error, and there's a very strong expectation that if an error is returned, the caller will check it and do something with it. It seems useful to be able to report an error but also return whatever data we were able to read.

Go doesn't have better path manipulation functions than Sprintf? (e.g. something like os.path.join() in Python)

There is, actually: filepath.Join. Fixed.

Should index requests be I/O queued as well? If there are a lot of files scanning the entire directory tree could interfere with other I/O

No, the story description for #2620 specifically excludes /index from operations that need to honor an I/O lock. (If we find in practice that /index seems to invoke disk contention then of course we can revisit that.)

SSDs don't have seek time overhead, does it make sense to allow parallel read/write on SSD?

Sure, the admin can just run keep with -serialize=false in that case. I think that at this stage, this feels like a policy decision that is better left to admin control.

What is the reasoning for having IsFull() record the "full" state with a bogus symlink? Is statfs() an expensive call? And why a symlink and not a regular file?

This is for parity with the current Warehouse implementation, which uses a symlink this way to mark a full Keep volume (presumably a symlink rather than a file so as not to risk writing to a full disk). I am not actually sure this degree of feature parity is necessary, but IIRC Ward wants Keep to support this feature specifically to make sure that new Keep can work as a drop-in replacement for Warehouse. I'll be happy to drop it if/when we agree that it's unnecessary.

Empty "return" statements are weird

Agreed!

#8 Updated by Tim Pierce over 7 years ago

Just pushed a change (57b46f9) to replace the KeepVolumes global array with a VolumeManager interface.

See volume.go for a description of the interface and the RRVolumeManager implementation. Then the changes to keep.go and keep_test.go should make sense. IMHO this is a lot cleaner now, and is hopefully easier to read and understand too.

#9 Updated by Peter Amstutz over 7 years ago

  • Assigned To changed from Peter Amstutz to Tim Pierce
  1. keep GetBlock(): If a block is corrupt, would it make sense to continue searching the other volumes for the block in case there is an intact copy?
  2. Some nice to haves:
    1. VolumeManager Choose() specifies whether it is a PUT or a GET and accept the block hash a parameter, and returns a slice of Volumes.
    2. Would enable various load balancing policies across disks, such as consistent hashing, or ranking disks in order of preference by load.
    3. If/when we implement the keep proxy, we can plug into the VolumeManager functionality to provide ranked lists of servers to contact to put/get blocks.

#10 Updated by Tim Pierce over 7 years ago

  • Assigned To changed from Tim Pierce to Peter Amstutz

Peter Amstutz wrote:

  1. keep GetBlock(): If a block is corrupt, would it make sense to continue searching the other volumes for the block in case there is an intact copy?

I don't think so; I can't think of a situation where a block would have gotten PUT subsequently and stored on a different volume under the correct hash rather than overwriting the corrupt block. But I'll totally add that if there's a scenario I haven't thought of!

  1. Some nice to haves:
    1. VolumeManager Choose() specifies whether it is a PUT or a GET and accept the block hash a parameter, and returns a slice of Volumes.
    2. Would enable various load balancing policies across disks, such as consistent hashing, or ranking disks in order of preference by load.
    3. If/when we implement the keep proxy, we can plug into the VolumeManager functionality to provide ranked lists of servers to contact to put/get blocks.

These are some interesting questions. I like the idea of using consistent hashing to improve GET requests, but I think we really need to put it on the Keep roadmap before tackling it. I don't think we currently have enough information to design an interface that supports it properly, so I'd rather not go too far down that road at this stage.

#11 Updated by Peter Amstutz over 7 years ago

  • Assigned To changed from Peter Amstutz to Tim Pierce

LGTM

#12 Updated by Tim Pierce over 7 years ago

  • Status changed from In Progress to Resolved
  • Remaining (hours) changed from 2.0 to 0.0

Also available in: Atom PDF