Feature #22319
closedAdd replace_segments feature to controller's CollectionUpdate API
Added by Tom Clegg about 1 month ago. Updated 10 days ago.
Description
- caller provides map of {old-smallblock-segment → new-bigblock-segment}
- can be combined with replace_files and/or caller-provided manifest_text
- changes are applied after replace_files (i.e., the mapping is applied to segments that appear in the caller-provided manifest_text as well as the existing manifest)
- if any of the provided old-smallblock-segments are not referenced in the current version, don't apply any of the changes that remap to the same new-bigblock-segment (this avoids a situation where two callers concurrently compute similar-but-different repackings, the first one applies cleanly, and the second one adds a large block that is mostly unreferenced)
Updated by Tom Clegg about 1 month ago
- Related to Idea #20996: Efficient packing of small files into blocks in keep-web added
Updated by Tom Clegg about 1 month ago
22319-replace-segments @ b3f72d579614d4efe07b647a653146428f04e5af -- developer-run-tests: #4563
- All agreed upon points are implemented / addressed. Describe changes from pre-implementation design.
- ✅
- Design did not specify how block segments are encoded. I went with
"locator offset length"
. I considered a colon separator because that's what we use inoffset:length:filename
tokens, but figured- Better to be more obvious that these aren't the same numbers as the offset and length in
offset:length:filename
tokens - Locators will never have spaces because adjacent locators in a manifest are separated by spaces
- Better to be more obvious that these aren't the same numbers as the offset and length in
- Anything not implemented (discovered or discussed during work) has a follow-up story.
- N/A
- Code is tested and passing, both automated and manual, what manual testing was done is described.
- ✅ automated tests added to source:lib/controller/localdb and source:sdk/go/arvados
- New or changed UX/UX and has gotten feedback from stakeholders.
- N/A
- Documentation has been updated.
- ✅ replace_segments section added to collections API docs
- Behaves appropriately at the intended scale (describe intended scale).
- Linear with # file data segments in collection. Expect this to be faster than loading the collection, which we optimized in #21696.
- Considered backwards and forwards compatibility issues between client and server.
- N/A
- Follows our coding standards and GUI style guidelines.
- ✅
Updated by Peter Amstutz about 1 month ago
- Target version changed from Development 2024-11-20 to Development 2024-12-04
Updated by Peter Amstutz 18 days ago
22319-replace-segments @ b3f72d579614d4efe07b647a653146428f04e5af
The API¶
Since the source and destination segments must be the same length, including the length on both sides of the mapping is redundant, and it raises an error if they don't match. Maybe just eliminate the length from one side or the other? (implementation-wise I see it's using the same BlockSegment struct and Marshal/Unmarshal methods, making it awkward, so maybe implementation convenience rules out over data normalization).
Documentation¶
The API documentation would benefit from a sentence or two about what it means by "segment", i.e. a span of bytes within a single block typically containing content of an individual file.
Be more explicit that the intended use of this API is by clients that want to optimize a manifest by describing how to map individual small segments to spans within a larger block, so that the resulting manifest references fewer blocks total.
"The replace_segments
option can be used with the update API" should be "update or create".
The "create" API section doesn't link back to the replace_files
or replace_segments
section.
The
replace_segments
option is applied afterreplace_files
andmanifest_text
. This means it can apply to blocks frommanifest_text
and/or other collections referenced byreplace_files
."
This is a little confusing because it essentially describes the order of operations backwards. Be more explicit, e.g. "The order of operations starts with the provided manifest_text
, then applies replace_files
, and then applies replace_segments
. This means replace_segments
can apply to blocks from manifest_text
and/or other collections referenced by replace_files
.
The example is missing some explanation. Should have a sentence or two walking through it, e.g. "In this example, two file segments in blocks c4103f122d and 693e9af84d are combined into a single block ca9c491ac66". The collection with manifest_text
should be listed first, in the example, so the flow matches the order of operations.
This example is also slightly weird since we previously talk about repacking small files but this example is joining a single file that was split. How about having the example be for file1.txt and file2.txt?
If a key does not match any existing block segment, that key and all other keys referencing the same replacement block will be skipped. Other replacements will still be applied. Replacements that are skipped for this reason do not cause the request to fail.
This sentence is confusing. Instead of "key" can we say "segment to be replaced"?
"keys referencing the same replacement block"
I think this means that if a segment to be replaced (the key on the left side) that doesn't match a segment in the manifest, then look at the block referenced on the right side and exclude all replacement entries that reference to that block on the right side. Is that right?
Could you include a short rationale for this behavior? I'm guessing it is something like: if the manifest has concurrently changed, the assumptions behind optimizing the manifest may no longer make sense, so it is better to leave the manifest alone than to possibly complicate the manifest by introducing references to new blocks.
Updated by Peter Amstutz 17 days ago
- Target version changed from Development 2024-12-04 to Development 2025-01-08
Updated by Tom Clegg 16 days ago
Peter Amstutz wrote in #note-4:
Since the source and destination segments must be the same length, including the length on both sides of the mapping is redundant, and it raises an error if they don't match. Maybe just eliminate the length from one side or the other? (implementation-wise I see it's using the same BlockSegment struct and Marshal/Unmarshal methods, making it awkward, so maybe implementation convenience rules out over data normalization).
I'm not sold on that optimization. It's true that the length part of the replacement could be implied, and that would reduce the encoded size of requests and make "length mismatch" error responses impossible. OTOH I think it's easier to understand if we use the same type -- e.g., map[BlockSegment]BlockSegment
evokes "replace each thing with same kind of thing". And yes, it's nice to have less code / smaller API surface.
The API documentation would benefit from a sentence or two about what it means by "segment", i.e. a span of bytes within a single block typically containing content of an individual file.
Agreed. Added an introduction to that term.
Be more explicit that the intended use of this API is by clients that want to optimize a manifest by describing how to map individual small segments to spans within a larger block, so that the resulting manifest references fewer blocks total.
Updated "e.g., replacing a number of small blocks with one larger block" to "typically with the goal of replacing a number of small blocks with one larger block"... I'm not sure how much more discussion we need about why that's valuable. At least on this page, I'm trying to stay focused on the specific API behavior. Perhaps the most valuable "why use this?" guidance would be something like "if you're not writing Arvados system components, you probably won't use this API."
"The
replace_segments
option can be used with the update API" should be "update or create".
Fixed
The "create" API section doesn't link back to the
replace_files
orreplace_segments
section.
Fixed
The
replace_segments
option is applied afterreplace_files
andmanifest_text
. This means it can apply to blocks frommanifest_text
and/or other collections referenced byreplace_files
."This is a little confusing because it essentially describes the order of operations backwards. Be more explicit, e.g. "The order of operations starts with the provided
manifest_text
, then appliesreplace_files
, and then appliesreplace_segments
. This meansreplace_segments
can apply to blocks frommanifest_text
and/or other collections referenced byreplace_files
.
"manifest_text
, then replace_files
" doesn't sound quite right (we look at replace_files
first, and if manifest_text
is also present then it is used during the replace_files
step, not afterward) but I've reversed the wording to
"The replace_files
and manifest_text
options, if present, are applied before replace_segments
. This means replace_segments
can apply to blocks from manifest_text
and/or other collections referenced by replace_files
."
The example is missing some explanation. Should have a sentence or two walking through it, e.g. "In this example, two file segments in blocks c4103f122d and 693e9af84d are combined into a single block ca9c491ac66". The collection with
manifest_text
should be listed first, in the example, so the flow matches the order of operations.
Fixed
This example is also slightly weird since we previously talk about repacking small files but this example is joining a single file that was split. How about having the example be for file1.txt and file2.txt?
Fixed
If a key does not match any existing block segment, that key and all other keys referencing the same replacement block will be skipped. Other replacements will still be applied. Replacements that are skipped for this reason do not cause the request to fail.
This sentence is confusing. Instead of "key" can we say "segment to be replaced"?
"keys referencing the same replacement block"
I think this means that if a segment to be replaced (the key on the left side) that doesn't match a segment in the manifest, then look at the block referenced on the right side and exclude all replacement entries that reference to that block on the right side. Is that right?
Yes, that's right.
Could you include a short rationale for this behavior? I'm guessing it is something like: if the manifest has concurrently changed, the assumptions behind optimizing the manifest may no longer make sense, so it is better to leave the manifest alone than to possibly complicate the manifest by introducing references to new blocks.
I've updated the wording and added a bit about the rationale.
Updates: 60bc4b6dd2
Merged main: 22319-replace-segments @ 63c9289cc675e7cbc6b6218c4ca03aacf38e30d4 -- developer-run-tests: #4566
Updated by Tom Clegg 12 days ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|05e1189fb680150c7737fe957b43b314e48daeb1.