Story #7427

CGF of 1kg and PGP for paths 2c5 and 247 (cont.)

Added by Abram Connelly over 3 years ago. Updated almost 3 years ago.

Assigned To:
Start date:
Due date:
% Done:


Estimated time:
Story points:


#1 Updated by Sarah Guthrie over 3 years ago

  • Project changed from Curoverse Science to Lightning
  • Target version changed from 111 to 2015-10-23 Lightning sprint

#2 Updated by Abram Connelly over 3 years ago

  • Assigned To set to Abram Connelly

Continued from #7098

#3 Updated by Abram Connelly over 3 years ago

  • Status changed from New to In Progress

Updates to CGF schema:

Some preliminary estimates put the nocall information at about 17Kb for a single path for a single sample. This scales to about 17Mb for a whole genome. Gzipping the resulting binary file pushes it down to 11Kb (which would imply about 11Mb whole genome).

I think a lesson should be taken from the CGLF and I should use this representation and move on. We can optimize the nocall information at a future date. For now the resulting nocall binary structure can be gzipped and unpacked when needed. A Code field has been added to future proof the LowQualityInfo structure. For now a Code is 0 to represent the structure presented. Maybe we can put in a Code of one to represent the gzipped portion.

It's still an estimate so the nocall information might need more like ~20Mb all told.

#4 Updated by Abram Connelly over 3 years ago

Packed representation is progressing. Currently there's a final nocall entry that's being missed at the end but otherwise it looks good. The current snapshot is:

Current schema is:

I don't think there are any structural changes. The schema (data structure section) has been clarified to indicate what exactly the offset and position arrays hold. The Offset array holds the byte offset of the Stride*k low quality entry starting from LoqInfo[0]. The StepPosition array holds the Stride*k tile position entry of the for the LoqInfo record. This means the above Vector and potentially Overflow (and FinalOverflowMap) structures will need to be consulted above to reconstruct the tile position that the low quality information is for.

The HetHomFlag (to be renamed HomFlag) is set to true if the record is homozygous. The array is a bit vector where a 1 represents the corresponding entry in the LoqInfo structure is homozygous. Note that "homozygous" here refers to the low quality entries and does not refer to the tiles or their variants. This means that there could be a heterozygous tile pair/sequence/group while still having a "homozygous" low quality information. If the low quality information is homozygous, all this means is that the low quality information for that tile position is the same on both alleles. The corresponding bit position is read LSB first, so that bit 0 represents the 8*k low quality record, bit position 1 represents the (8*k)+1 low quality record, etc.

#5 Updated by Abram Connelly over 3 years ago

Trailing no-calls taken care of.

Initial version of CGF (for two paths on a single sample) has been created.

In the process of creating a CGF reader that will read the binary CGF and confirm the bytes written are what we expect.

#6 Updated by Sarah Guthrie over 3 years ago

  • Target version changed from 2015-10-23 Lightning sprint to 2015-11-13 Lightning sprint

#7 Updated by Abram Connelly over 3 years ago

  • Status changed from In Progress to Closed

Around 600 samples for the two paths, 0x247 and 0x2c5, have had their CGF created. The CGF has been verified to produce FastJ that matches the input FastJ.

Current source for the cgf program is at:

Also available in: Atom PDF