Project

General

Profile

Actions

Idea #11676

open

Tile library versioning specification and tool

Added by Abram Connelly almost 7 years ago. Updated almost 7 years ago.

Status:
New
Priority:
Normal
Assigned To:
-
Story points:
3.0

Description

Finalize a draft of the tile library versioning techniques and create a tool to calculate and check the tile library version.

See the Tile Library Versioning section of the Lightning Genome Library Format document.

The basic idea is that we can create a "tile path manifest" on a tile path for a library version that will be an ordered list of tile ID along with the has of the sequence. The tile path manifest is then hashed to determine the library tile path version and to create a manifest on a library tile path basis, which is again hashed to create the tile library version.

To check whether an individual genome is compatible with the tile library version, a tile manifest is created via a list of tile IDs along with their sequence hash, which is then hashed to get an individual "genome hash". The genome hash can be calculated for the specified genome on the client side, say, then calculated on the server side, where the tile library resides, say, and compared to see if they match.

The tile library will need a list of tile IDs to determine the genome hash but this can be cached if speed is a concern.

Care needs to be taken with the genome hash as a genome often has multiple alleles.

  • Should the genome hash interleave tile IDs in the manifest or have each individual allele produced serially?
  • Since the tile IDs are in the manifest and hashed, a standard format has to be specified and explicitly enforced. For example, though "35e.0.0.1+a" is the same as "035e.00.0000.0001+a" specify same tile ID, they'll mess up the genome hash if the client and server aren't using the same tile ID format. It might be better to just allow variable length in this case?
  • We generally treat single allele portions of the genome (chrX, chrY, chrM) as doubly allelic since it's not worth the hassle of making special considerations for the single allele portions but we need to make sure this isn't confusing when calculating the genome hash. We might want to extend the CGF format with a flag per path to explicitly tell each path how many alleles it has, even if the underlying storage format doesn't change.
Actions #1

Updated by Abram Connelly almost 7 years ago

  • Target version set to Lightning Sprint (2017-05-15 to 2017-05-29)
Actions #2

Updated by Abram Connelly almost 7 years ago

Since we're only concerned whether a tile library has the tile that the genome references, we don't need the nocall information for the genome when creating the "genome library hash".

There are three cases:

  • The sequence as it appears in the genome data. e.g. gcatgcatnnnngcat...
  • The sequence as it appears in the tile library. Nocalls should not appears. e.g. gcatgcatatatgcat...
  • The sequence as it appears when creating the tagmask hash of the tile. e.g. gcatgcatATATgcat... (since the nocalls fall on the beginning tag, say).

I mention it because there are different ways of creating the hash of the tile and they'll all give potentially different hashes.

I don't think we need to worry about the sequence as it appears in the genome or the "tagmask" hash when calculating the genome tile library version. We care about the tile as it appears in the tile library which we can recreate with the tile library and the tiled vector for the genome, ignoring nocall data.

Actions

Also available in: Atom PDF