Project

General

Profile

Actions

Feature #22319

closed

Add replace_segments feature to controller's CollectionUpdate API

Added by Tom Clegg about 1 month ago. Updated 10 days ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
API
Target version:
Story points:
-

Description

From Efficient block packing for small WebDAV uploads
  • caller provides map of {old-smallblock-segment → new-bigblock-segment}
  • can be combined with replace_files and/or caller-provided manifest_text
  • changes are applied after replace_files (i.e., the mapping is applied to segments that appear in the caller-provided manifest_text as well as the existing manifest)
  • if any of the provided old-smallblock-segments are not referenced in the current version, don't apply any of the changes that remap to the same new-bigblock-segment (this avoids a situation where two callers concurrently compute similar-but-different repackings, the first one applies cleanly, and the second one adds a large block that is mostly unreferenced)

Subtasks 1 (0 open1 closed)

Task #22331: Review 22319-replace-segmentsResolvedPeter Amstutz12/09/2024Actions

Related issues 1 (0 open1 closed)

Related to Arvados - Idea #20996: Efficient packing of small files into blocks in keep-webResolvedTom Clegg11/13/2024Actions
Actions #1

Updated by Tom Clegg about 1 month ago

  • Related to Idea #20996: Efficient packing of small files into blocks in keep-web added
Actions #2

Updated by Tom Clegg about 1 month ago

22319-replace-segments @ b3f72d579614d4efe07b647a653146428f04e5af -- developer-run-tests: #4563

  • All agreed upon points are implemented / addressed. Describe changes from pre-implementation design.
    • Design did not specify how block segments are encoded. I went with "locator offset length". I considered a colon separator because that's what we use in offset:length:filename tokens, but figured
      • Better to be more obvious that these aren't the same numbers as the offset and length in offset:length:filename tokens
      • Locators will never have spaces because adjacent locators in a manifest are separated by spaces
  • Anything not implemented (discovered or discussed during work) has a follow-up story.
    • N/A
  • Code is tested and passing, both automated and manual, what manual testing was done is described.
  • New or changed UX/UX and has gotten feedback from stakeholders.
    • N/A
  • Documentation has been updated.
    • ✅ replace_segments section added to collections API docs
  • Behaves appropriately at the intended scale (describe intended scale).
    • Linear with # file data segments in collection. Expect this to be faster than loading the collection, which we optimized in #21696.
  • Considered backwards and forwards compatibility issues between client and server.
    • N/A
  • Follows our coding standards and GUI style guidelines.
Actions #3

Updated by Peter Amstutz about 1 month ago

  • Target version changed from Development 2024-11-20 to Development 2024-12-04
Actions #4

Updated by Peter Amstutz 18 days ago

22319-replace-segments @ b3f72d579614d4efe07b647a653146428f04e5af

The API

Since the source and destination segments must be the same length, including the length on both sides of the mapping is redundant, and it raises an error if they don't match. Maybe just eliminate the length from one side or the other? (implementation-wise I see it's using the same BlockSegment struct and Marshal/Unmarshal methods, making it awkward, so maybe implementation convenience rules out over data normalization).

Documentation

The API documentation would benefit from a sentence or two about what it means by "segment", i.e. a span of bytes within a single block typically containing content of an individual file.

Be more explicit that the intended use of this API is by clients that want to optimize a manifest by describing how to map individual small segments to spans within a larger block, so that the resulting manifest references fewer blocks total.

"The replace_segments option can be used with the update API" should be "update or create".

The "create" API section doesn't link back to the replace_files or replace_segments section.

The replace_segments option is applied after replace_files and manifest_text. This means it can apply to blocks from manifest_text and/or other collections referenced by replace_files."

This is a little confusing because it essentially describes the order of operations backwards. Be more explicit, e.g. "The order of operations starts with the provided manifest_text, then applies replace_files, and then applies replace_segments. This means replace_segments can apply to blocks from manifest_text and/or other collections referenced by replace_files.

The example is missing some explanation. Should have a sentence or two walking through it, e.g. "In this example, two file segments in blocks c4103f122d and 693e9af84d are combined into a single block ca9c491ac66". The collection with manifest_text should be listed first, in the example, so the flow matches the order of operations.

This example is also slightly weird since we previously talk about repacking small files but this example is joining a single file that was split. How about having the example be for file1.txt and file2.txt?

If a key does not match any existing block segment, that key and all other keys referencing the same replacement block will be skipped. Other replacements will still be applied. Replacements that are skipped for this reason do not cause the request to fail.

This sentence is confusing. Instead of "key" can we say "segment to be replaced"?

"keys referencing the same replacement block"

I think this means that if a segment to be replaced (the key on the left side) that doesn't match a segment in the manifest, then look at the block referenced on the right side and exclude all replacement entries that reference to that block on the right side. Is that right?

Could you include a short rationale for this behavior? I'm guessing it is something like: if the manifest has concurrently changed, the assumptions behind optimizing the manifest may no longer make sense, so it is better to leave the manifest alone than to possibly complicate the manifest by introducing references to new blocks.

Actions #5

Updated by Peter Amstutz 17 days ago

  • Target version changed from Development 2024-12-04 to Development 2025-01-08
Actions #6

Updated by Tom Clegg 16 days ago

Peter Amstutz wrote in #note-4:

Since the source and destination segments must be the same length, including the length on both sides of the mapping is redundant, and it raises an error if they don't match. Maybe just eliminate the length from one side or the other? (implementation-wise I see it's using the same BlockSegment struct and Marshal/Unmarshal methods, making it awkward, so maybe implementation convenience rules out over data normalization).

I'm not sold on that optimization. It's true that the length part of the replacement could be implied, and that would reduce the encoded size of requests and make "length mismatch" error responses impossible. OTOH I think it's easier to understand if we use the same type -- e.g., map[BlockSegment]BlockSegment evokes "replace each thing with same kind of thing". And yes, it's nice to have less code / smaller API surface.

The API documentation would benefit from a sentence or two about what it means by "segment", i.e. a span of bytes within a single block typically containing content of an individual file.

Agreed. Added an introduction to that term.

Be more explicit that the intended use of this API is by clients that want to optimize a manifest by describing how to map individual small segments to spans within a larger block, so that the resulting manifest references fewer blocks total.

Updated "e.g., replacing a number of small blocks with one larger block" to "typically with the goal of replacing a number of small blocks with one larger block"... I'm not sure how much more discussion we need about why that's valuable. At least on this page, I'm trying to stay focused on the specific API behavior. Perhaps the most valuable "why use this?" guidance would be something like "if you're not writing Arvados system components, you probably won't use this API."

"The replace_segments option can be used with the update API" should be "update or create".

Fixed

The "create" API section doesn't link back to the replace_files or replace_segments section.

Fixed

The replace_segments option is applied after replace_files and manifest_text. This means it can apply to blocks from manifest_text and/or other collections referenced by replace_files."

This is a little confusing because it essentially describes the order of operations backwards. Be more explicit, e.g. "The order of operations starts with the provided manifest_text, then applies replace_files, and then applies replace_segments. This means replace_segments can apply to blocks from manifest_text and/or other collections referenced by replace_files.

"manifest_text, then replace_files" doesn't sound quite right (we look at replace_files first, and if manifest_text is also present then it is used during the replace_files step, not afterward) but I've reversed the wording to

"The replace_files and manifest_text options, if present, are applied before replace_segments. This means replace_segments can apply to blocks from manifest_text and/or other collections referenced by replace_files."

The example is missing some explanation. Should have a sentence or two walking through it, e.g. "In this example, two file segments in blocks c4103f122d and 693e9af84d are combined into a single block ca9c491ac66". The collection with manifest_text should be listed first, in the example, so the flow matches the order of operations.

Fixed

This example is also slightly weird since we previously talk about repacking small files but this example is joining a single file that was split. How about having the example be for file1.txt and file2.txt?

Fixed

If a key does not match any existing block segment, that key and all other keys referencing the same replacement block will be skipped. Other replacements will still be applied. Replacements that are skipped for this reason do not cause the request to fail.

This sentence is confusing. Instead of "key" can we say "segment to be replaced"?

"keys referencing the same replacement block"

I think this means that if a segment to be replaced (the key on the left side) that doesn't match a segment in the manifest, then look at the block referenced on the right side and exclude all replacement entries that reference to that block on the right side. Is that right?

Yes, that's right.

Could you include a short rationale for this behavior? I'm guessing it is something like: if the manifest has concurrently changed, the assumptions behind optimizing the manifest may no longer make sense, so it is better to leave the manifest alone than to possibly complicate the manifest by introducing references to new blocks.

I've updated the wording and added a bit about the rationale.

Updates: 60bc4b6dd2

Merged main: 22319-replace-segments @ 63c9289cc675e7cbc6b6218c4ca03aacf38e30d4 -- developer-run-tests: #4566

Actions #7

Updated by Peter Amstutz 12 days ago

This reads much better. LGTM!

Actions #8

Updated by Tom Clegg 12 days ago

  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF