Project

General

Profile

Actions

Idea #11672

closed

Add "band" functionality to fjt tool

Added by Abram Connelly almost 7 years ago. Updated almost 7 years ago.

Status:
Closed
Priority:
Normal
Assigned To:
Target version:
-
Story points:
3.0

Description

Extend the fjt tool to output "band format".

We've settled on a kind of "intermediate" format for describing tiled genomes, which is this "band format". This is 2 or 4 vectors of numbers representing. The first pair of vectors represents the tile variant ID, with negative numbers indicating a non-trivial tile. The second pair of vectors represents the "low quality" information, with each position consisting of an array with an even number of entries in each, even entries indicating start position from the start of a tile and odd entries indicating the length of the no call.

The variant tile values can be summarized as:

  • >=0: Tile variant with the appropriate value (0 being the "default" or canonical tile)
  • -1: Indicates a non-anchor spanning tile

For example, here is a band representation of path 0x35e:

[ 79 8 0 0 0 0 0 -1 0 0 0 389 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 -1 34 -1 185 1]
[ 79 2 0 0 0 0 0 -1 0 0 0 390 0 0 0 0 0 1 0 0 0 0 0 0 26 0 0 1 0 0 -1 34 -1 185 1]
[[ ][ ][ ][ ][ ][ ][ 903 1 ][ ][ 16 1 ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ 96 1 ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ 291 2 ]]
[[ ][ ][ ][ ][ ][ ][ 903 1 ][ ][ 16 1 ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ 96 1 ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ 291 2 ]]

With this format, we can easily create a CGFv3 file.
This suggests a basic workflow:

  • Convert from source format to FastJ
  • Collect, deduplicate and "impute" all tiles to create the tile library
  • Use the source FastJ and tile library to create a band format
  • Use the band format to create the CGFv3

The cgft can create CGFv3 from band data. The fjt tool should take care of the third step, converting from FastJ and tile library (SGLF) to band format.

As a general rule, splitting by tile path seems to be a good compromise between functionality, memory footprint and speed. With this, an example usage of fjt could be:

fjt -L 035e.sglf -i 035e.fj -B -p 862

Which should output the above band format.

Actions #1

Updated by Abram Connelly almost 7 years ago

  • Status changed from New to In Progress
  • Assigned To set to Abram Connelly

Making sure that the conversions to the new CGFv3 format don't introduce errors, it'd be nice to double check the FastJ against the CGF created. To facilitate this, it'd be nice to be able to convert directly from FastJ to band format instead.

Since other CGF conversion tickets depend on this, I'm prioritizing FastJ to band format conversion by updating the fjt tool.

Actions #2

Updated by Abram Connelly almost 7 years ago

  • Status changed from In Progress to Closed

Update has been merged.

In addition, some simple tests along with a test script has been added to fjt. Tests are minimal for now and only test a small tile path (0x035e) and for only one allele. Future tests should test a bigger tile path along with both alleles.

Actions

Also available in: Atom PDF