Task #4837

Feature #4823: [SDKs] Good Collection API for Python SDK

[SDKs] Define API and in-memory data structure for collections in Python SDK

Added by Tim Pierce over 4 years ago. Updated 5 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
SDKs
Target version:
Start date:
12/17/2014
Due date:
% Done:

0%

Estimated time:
0.00 h

Description

A good data structure for representing the content of a collection in memory will facilitate cleaner, more efficient code for manipulating that content.

Conceptually, a collection contains files and collections (analogous to subdirectories). Currently there are two distinct types of collection:
  1. Collections that correspond to records in the Arvados database (have uuid, portable data hash, manifest_text)
  2. Subdirectories within collections (encoded as stream names and filenames with slashes)

A collection has a name and an array of items. Each item is either a file or a collection.

Each file has a name and an array of data segments.

Each data segment is either
  1. a Keep locator and a list of byte ranges, or
  2. a data buffer (useful when some data is not written to Keep yet).

The file's contents is defined as the concatenation of the specified byte ranges from all data segments, in the order given.

Example:

File{
  name: "foo.txt",
  data_segments: [
    {
      locator: "6a4ff0499484c6c79c95cd8c566bd25f+249025",
      byte_ranges: [ [0, 75], [250, 300], [150, 200] ]
    },
    {
      locator: "ea10d51bcf88862dbcc36eb292017dfd+45",
      byte_ranges: [ [30, 40] ]
    }
  ]
}

(Examples here are written in JSON-ish notation for convenience. This is not intended to be used as a data interchange format.)

This represents a file named foo.txt and consisting of bytes 0-75, 250-300, and 150-200 from block 6a4ff0499484c6c79c95cd8c566bd25f+249025, followed by bytes 30-40 from ea10d51bcf88862dbcc36eb292017dfd+45.

An example for an entire collection:

Collection{
  name: "dir",
  items: [
    File{
      name: "foo.txt",
      data_segments: [
        {
          locator: "ea10d51bcf88862dbcc36eb292017dfd+45",
          byte_ranges: [ [0, 45] ]
        }
      ]
    },
    Collection{
      name: "subdir",
      items: [
        {
          type: "file",
          name: "bar.txt",
          data_segments: [
            {
              locator: "cdd549ae79fe6640fa3d5c6261d8303c+195",
              byte_ranges: [ [0, 195] ],
            }
          ]
        }
      ]
    }
  ]
}

History

#1 Updated by Tom Clegg over 4 years ago

  • Subject changed from [SDKs] structured data representation for collection manifests to [SDKs] Define API and in-memory data structure for collections in Python SDK
  • Description updated (diff)
  • Story points changed from 3.0 to 1.0

#2 Updated by Tom Clegg over 4 years ago

  • Status changed from New to In Progress

#3 Updated by Tom Clegg over 4 years ago

  • Tracker changed from Feature to Task

#4 Updated by Tom Clegg over 4 years ago

  • Parent task set to #4823

#5 Updated by Peter Amstutz over 4 years ago

The current representation is more like this, I would strongly prefer to keep in that way since there are is already bunch of carefully written, debugged and optimized code in StreamReader and FileStreamReader:

File {
  name: "foo.txt",
  data_locators: ["6a4ff0499484c6c79c95cd8c566bd25f+249025", "ea10d51bcf88862dbcc36eb292017dfd+45"],
  data_segments: [ [0, 75], [250, 300], [150, 200], [249055, 40] ]
}

#6 Updated by Tom Clegg over 4 years ago

Peter Amstutz wrote:

The current representation is more like this, I would strongly prefer to keep in that way since there are is already bunch of carefully written, debugged and optimized code in StreamReader and FileStreamReader:

File {
  name: "foo.txt",
  data_locators: ["6a4ff0499484c6c79c95cd8c566bd25f+249025", "ea10d51bcf88862dbcc36eb292017dfd+45"],
  data_segments: [ [0, 75], [250, 300], [150, 200], [249055, 40] ]
}

I don't value the existing code as much as future code. Is this actually easier to work with?

I do like using [start,length] instead of [start,end].

#7 Updated by Peter Amstutz over 4 years ago

Internally, StreamFileReader stores [stream_offset, segment_length, file_offset] and StreamReader stores [keep_locator, block_length, stream_offset]. It uses the 3rd item to perform a binary search in order to efficiently find the 1st block when extracting a range, but that's just an implementation detail that doesn't need to be exposed. The current representation is O(log b) where 'b' is the number of blocks + O(r) where 'r' in the number of segments or blocks that intersect the read range (usually 1 or 2 unless a file is fragmented). If we organized it based on the #4837 description, a random access read would require a complete O(n*m) search across all blocks and file segments to determine which segments and blocks intersect with the read range.

#8 Updated by Peter Amstutz over 4 years ago

File{
  name: "foo.txt",
  data_segments: [
    {
      locator: "6a4ff0499484c6c79c95cd8c566bd25f+249025",
      file_offset: 0,
      block_offset: 0,
      size: 75
    },
    {
      locator: "6a4ff0499484c6c79c95cd8c566bd25f+249025",
      file_offset: 75,
      block_offset: 250,
      size: 50
    },
    {
      locator: "6a4ff0499484c6c79c95cd8c566bd25f+249025",
      file_offset: 125,
      block_offset: 150,
      size: 50
    },
    {
      locator: "ea10d51bcf88862dbcc36eb292017dfd+45",
      file_offset: 175,
      block_offset: 30,
      size: 10
    }
  ]
}

#9 Updated by Tim Pierce over 4 years ago

  • Assigned To changed from Tim Pierce to Peter Amstutz

#10 Updated by Peter Amstutz over 4 years ago

  • Status changed from In Progress to Resolved
  • Remaining (hours) set to 0.0

#11 Updated by Tom Morris 5 months ago

  • Estimated time set to 0.00 h

Also available in: Atom PDF