Bug #5145
closed[Workbench] Combining collections should separate files with identical names, not concatenate them
Description
collection uuid
foo.txt hash 123
add foo.txt hash 234 to collection uuid
Currently this fails, is there a way to change this in a way that the file gets added but maybe the name gets changed? This came up in a customer meeting.
Updated by Brett Smith almost 10 years ago
Bryan,
Can you please explain the original bug/failure in more detail? Specifically, I'd like to know what tool you used to try to add the second file to the existing collection. The Workbench uploader has code to dynamically rename new files.
Updated by Bryan Cosca almost 10 years ago
so in here: qr1hi-j7d0g-0x4b08b68zanwxh
I have new collection and new collection (2)
I have foo.txt in each of them (one has text bar and the other foo)
I then combine the collections together and I got the collection qr1hi-4zz18-020vo4bxoc3vo1b which shows foo.txt concatenated with both of the original foo.txt (barfoo)
I think these should be two separate items.
Updated by Brett Smith almost 10 years ago
- Subject changed from Adding a file to a collection with the same name as a file in a collection to [Workbench] Combining collections should separate files with identical names, not concatenate them
- Category set to Workbench
Updated by Radhika Chippada almost 10 years ago
There are two distinct scenarios here.
- Scenario 1: Two files from two different directories of a Collection are combined. The files have same name and same content. Currently, this results in error: the new collection will have one file with that name whose contents are file1 contents + file2 contents (basically two copies of the same content). I think in this case, we can include just one copy of the file.
- Scenario 2: Two files from two different directories of a Collection are combined. The files have same name but different content. In this case, this results in error: the new collection will have one file with that name whose contents are file1 contents + file2 contents. How do we want resolve this scenario?
Updated by Tom Clegg almost 10 years ago
Radhika Chippada wrote:
Is there a known/common use case for this? I'm hesitant to add special behavior for this situation:Scenario 1: Two files from two different directories of a Collection are combined. The files have same name and same content. Currently, this results in error: the new collection will have one file with that name whose contents are file1 contents + file2 contents (basically two copies of the same content). I think in this case, we can include just one copy of the file.
- Sometimes it's obvious that file1 and file2 have the same content, but sometimes it isn't: different chunking can encode the same content using different block segments/hashes. It's hard for a Workbench user to see this distinction, so this behavior could end up being perceived as "unpredictably, duplicates are sometimes skipped and sometimes not".
- Even if the behavior is consistent, it still seems a bit mysterious: if I combine two collections with 100 files each, I get a collection with 199 files. Now I have to investigate why one file went missing. This could be addressed by flashing a notice "We skipped some duplicate files".
- What if I really wanted 200 files? Circumventing this feature could get awkward. OTOH, it's easy to hit a "delete file" button to remove duplicates by myself if that's what I want. (In that respect, a "delete file" feature would be a higher-priority usability improvement than automatic de-duplication.)
If we do have a de-duplication feature, I think it should be explicit: "You've just added multiple identical copies of some files to your collection. Want me to remove the redundant copies for you?" (But I don't think this feature should be bundled into the current bugfix.)
Scenario 2: Two files from two different directories of a Collection are combined. The files have same name but different content. In this case, this results in error: the new collection will have one file with that name whose contents are file1 contents + file2 contents. How do we want resolve this scenario?
I think we should do what the web uploader does in this situation: When adding "/foo.txt" to a collection that already has "/foo.txt", call the new file "/foo (X).txt" where X is the smallest positive integer that doesn't conflict with an existing file.
For purposes of this bugfix, I think scenario 1 and 2 can be treated exactly the same. ("Adding duplicate files results in duplicate files" might not be what the user wants in some cases, but I wouldn't call it a bug...)
Updated by Radhika Chippada almost 10 years ago
- Status changed from New to In Progress
- Assigned To set to Radhika Chippada
- Target version changed from Bug Triage to 2015-03-11 sprint
Updated by Radhika Chippada almost 10 years ago
- Updated combine action to handle duplicate names by appending a number to the file name being repeated.
- Combined collection may see duplicate names when same file names from different directories of a collection are being combined.
- It is also possible that the collections being combined have directories with same file names with same or different contents. When file names are repeated within a directory name, their names will only conflict with other files within that that same directory / stream. Other files in other streams will not be considered as duplicate names.
- Tested manually by combining collections or files with same names and different contents. Verified by doing keep get on the resulting collection.
- Added controller tests.
Updated by Peter Amstutz almost 10 years ago
normalized = normalized.gsub(manifest_file) {|s| uniq_file}
This needs to match more specifically, otherwise this could happen:
". abc+5 0:5:abc".gsub("abc") { |s| "xyz" } => ". xyz+5 0:5:xyz"
You need something like this:
normalized = normalized.gsub(/\d+:\d+:(#{Regexp.quote manifest_file})/) {|s| uniq_file}
part_match = /\d*:\d*:(.*)/.match(part)
Should be
part_match = /\d+:\d+:(\S+)/.match(part)
and the previous comment for gsub applies here as well:
adjusted_parts << (part.gsub(part_match[1]) {|s| uniq_file})
You should normalize mt
before processing it to ensure that there are no filenames with '/' in them.
Updated by Radhika Chippada almost 10 years ago
- normalized = normalized.gsub(/\d+:\d+:(#{Regexp.quote manifest_file})/) {|s| uniq_file}
This is resulting in removing the \d+:\d+: portion. Instead I split the string on file name index and applied gsub on the last part, which is the file name being replaced.
- part_match = /\d+:\d+:(\S+)/.match(part)
Done
- and the previous comment for gsub applies here as well: (adjusted_parts << (part.gsub(part_match1) {|s| uniq_file}))
This is not required since I am doing gsub on the filename part alone and hence there are no other such occurrences
- You should normalize mt before processing it to ensure that there are no filenames with '/' in them.
Done
- Also needed to update one collection fixture to have correct format (it was missing the locator)
Updated by Peter Amstutz almost 10 years ago
Radhika Chippada wrote:
- normalized = normalized.gsub(/\d+:\d+:(#{Regexp.quote manifest_file})/) {|s| uniq_file}
This is resulting in removing the \d+:\d+: portion. Instead I split the string on file name index and applied gsub on the last part, which is the file name being replaced.
You're right, I was wrong about it just matching the subexpression, here's the right way to do it:
normalized.gsub(/(\d+:\d+:)(#{Regexp.quote manifest_file})/) {|s| "#{$1}#{uniq_file}" }
This the first subexpression goes into the variable $1
and uses that to build a new string using the new filename.
- part_match = /\d+:\d+:(\S+)/.match(part)
Done
I still see
part_match = /\d*:\d*:(\S+)/.match(part)
\d*
is wrong since that will match zero-length strings.
- and the previous comment for gsub applies here as well: (adjusted_parts << (part.gsub(part_match1) {|s| uniq_file}))
This is not required since I am doing gsub on the filename part alone and hence there are no other such occurrences
I think you want do something like:
part_match = /(\d+:\d+:)(\S+)/.match(part) if part_match uniq_file = derive_unique_filename(part_match[2], manifest_files) adjusted_parts << "#{part_match[1]}#{uniq_file}" else adjusted_parts << part end
Updated by Radhika Chippada almost 10 years ago
Peter, thanks for the regexp lesson. Addressed all three items. Thanks.
Updated by Radhika Chippada almost 10 years ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|commit:00c9cd4ecab3683d95118ab9d68310ac5069e9f6.