File splits

General approach.

For each each file segment (generally 1 segment/block):

  1. Fetch the assigned block.
  2. Determine the offset of the first record in the assigned block. (If it is ambiguous, check the previous block to see if there is a record split).
  3. Seek ahead to find the last record in the assigned block and determine where it ends (which may be on the next block).
  4. Generate a collection representing a subsection of the original file starting from the offset of the first record, and range incorporating the end of the last record.
  5. Insert header segment into file at the beginning if required.
  6. Feed the new collection to the target program via SDK or arv-mount.

Should be possible to do in a dedicated split step, or as a parallelization wrapper before running the real program.

Updated by Peter Amstutz over 9 years ago · 1 revisions