Keep manifest format

Manifest v1

A manifest is utf-8 encoded text, consisting of zero or more newline-terminated streams.

Each stream consists of three or more space-delimited tokens:
  • The first token is a stream name, consisting of one or more path components, delimited by "/".
    • The first path component is always ".".
    • No path component is empty.
    • No path component is "." or ".." (except the leading ".").
    • The stream name never begins or ends with "/".
  • The second token is a data blob locator (see Keep locator format).
  • ...possibly followed by more data blob locators...
  • The first token that is not a block locator, and all subsequent tokens, are file tokens.
    • A file token has three parts, delimited by ":": position, size, filename.
    • Position and size are given in decimal, and are counted from the beginning of the first data blob.
    • Filename may contain "/" characters, but must not start or end with "/", and must not contain "//".
    • Filename components (delimited by "/") must not be "." or "..".
    • Except: Filename may be "." if size is 0. This does not represent a real file; it is a placeholder used to ensure there is at least one file token in a stream that contains no files.

A manifest contains no TAB characters, nor other ASCII whitespace characters other than the spaces or newline delimiters specified above.

Whitespace, backslashes, and special characters appearing in paths and filenames are encoded as \nnn where nnn is a three-digit octal byte code.
  • A backslash character is encoded as \134.
  • A space is encoded as \040.
  • It is permitted to escape printable characters: "fo\157\057bar" and "foo/bar" are equivalent.

A manifest always ends with a newline -- except the empty (zero-length) string, which is a valid manifest.

Normalized manifest v1

A normalized manifest has the following additional restrictions.
  • Streams are in alphanumeric order.
  • Each stream name is unique within the manifest.
  • Files within a stream are in alphanumeric order.
  • Concatenation stream_name/filename is unique within the manifest. (This can be impossible to accomplish without rewriting blobs.)
  • Filename must not contain "/".

An API call exists will exist to normalize a manifest.

Request:
  • POST /arvados/v1/collections/{hash}/normalize
  • request body: {"collection":{"manifest_text":"...."}}
Response:
  • {"uuid":"...","manifest_text":"..."}
Notes:
  • POST despite no side effects.
  • Returns object with uuid even though no object was stored.

Manifest v2

(Early design stages)

Should probably include:
  • Structured format (JSON?)
  • More than one level of indirection (e.g., manifest references block X, which references data blocks A,B,C)
  • Specify hash algorithm with block hashes