Story #11676


Tile library versioning specification and tool

Added by Abram Connelly over 5 years ago. Updated about 5 years ago.

Assigned To:
Start date:
Due date:
% Done:


Estimated time:
Story points:


Finalize a draft of the tile library versioning techniques and create a tool to calculate and check the tile library version.

See the Tile Library Versioning section of the Lightning Genome Library Format document.

The basic idea is that we can create a "tile path manifest" on a tile path for a library version that will be an ordered list of tile ID along with the has of the sequence. The tile path manifest is then hashed to determine the library tile path version and to create a manifest on a library tile path basis, which is again hashed to create the tile library version.

To check whether an individual genome is compatible with the tile library version, a tile manifest is created via a list of tile IDs along with their sequence hash, which is then hashed to get an individual "genome hash". The genome hash can be calculated for the specified genome on the client side, say, then calculated on the server side, where the tile library resides, say, and compared to see if they match.

The tile library will need a list of tile IDs to determine the genome hash but this can be cached if speed is a concern.

Care needs to be taken with the genome hash as a genome often has multiple alleles.

  • Should the genome hash interleave tile IDs in the manifest or have each individual allele produced serially?
  • Since the tile IDs are in the manifest and hashed, a standard format has to be specified and explicitly enforced. For example, though "35e.0.0.1+a" is the same as "035e.00.0000.0001+a" specify same tile ID, they'll mess up the genome hash if the client and server aren't using the same tile ID format. It might be better to just allow variable length in this case?
  • We generally treat single allele portions of the genome (chrX, chrY, chrM) as doubly allelic since it's not worth the hassle of making special considerations for the single allele portions but we need to make sure this isn't confusing when calculating the genome hash. We might want to extend the CGF format with a flag per path to explicitly tell each path how many alleles it has, even if the underlying storage format doesn't change.

Also available in: Atom PDF