CGF of 1kg and PGP for paths 2c5 and 247 (cont.)
#3 Updated by Abram Connelly over 3 years ago
- Status changed from New to In Progress
Updates to CGF schema:
Some preliminary estimates put the nocall information at about 17Kb for a single path for a single sample. This scales to about 17Mb for a whole genome. Gzipping the resulting binary file pushes it down to 11Kb (which would imply about 11Mb whole genome).
I think a lesson should be taken from the CGLF and I should use this representation and move on. We can optimize the nocall information at a future date. For now the resulting nocall binary structure can be gzipped and unpacked when needed. A
Code field has been added to future proof the
LowQualityInfo structure. For now a
Code is 0 to represent the structure presented. Maybe we can put in a
Code of one to represent the gzipped portion.
It's still an estimate so the nocall information might need more like ~20Mb all told.
#4 Updated by Abram Connelly over 3 years ago
Packed representation is progressing. Currently there's a final nocall entry that's being missed at the end but otherwise it looks good. The current snapshot is:
Current schema is:
I don't think there are any structural changes. The schema (data structure section) has been clarified to indicate what exactly the offset and position arrays hold. The
Offset array holds the byte offset of the
Stride*k low quality entry starting from
StepPosition array holds the
Stride*k tile position entry of the for the
LoqInfo record. This means the above
Vector and potentially
FinalOverflowMap) structures will need to be consulted above to reconstruct the tile position that the low quality information is for.
HetHomFlag (to be renamed
HomFlag) is set to true if the record is homozygous. The array is a bit vector where a
1 represents the corresponding entry in the
LoqInfo structure is homozygous. Note that "homozygous" here refers to the low quality entries and does not refer to the tiles or their variants. This means that there could be a heterozygous tile pair/sequence/group while still having a "homozygous" low quality information. If the low quality information is homozygous, all this means is that the low quality information for that tile position is the same on both alleles. The corresponding bit position is read LSB first, so that bit 0 represents the
8*k low quality record, bit position 1 represents the
(8*k)+1 low quality record, etc.
#5 Updated by Abram Connelly over 3 years ago
Trailing no-calls taken care of.
Initial version of CGF (for two paths on a single sample) has been created.
In the process of creating a CGF reader that will read the binary CGF and confirm the bytes written are what we expect.
#7 Updated by Abram Connelly over 3 years ago
- Status changed from In Progress to Closed
Around 600 samples for the two paths, 0x247 and 0x2c5, have had their CGF created. The CGF has been verified to produce FastJ that matches the input FastJ.
Current source for the cgf program is at: https://github.com/abeconnelly/cgf/tree/02fa80d2665bec3a1da54a230cc4f9cba5fd444a
#8 Updated by Abram Connelly almost 3 years ago
The current lightning prototype has both of these tile paths:
- Underlying data: https://workbench.su92l.arvadosapi.com/projects/su92l-j7d0g-2hk0kr9bayye8n0#Data_collections
- Specification and documentation: https://github.com/abeconnelly/l7g
- Proxy sever: https://github.com/abeconnelly/lci
- Tile Server: https://github.com/abeconnelly/glfd
- CGF tools and server: https://github.com/abeconnelly/cgf
- Phenotype server (untap): https://github.com/abeconnelly/l7g-p7e-untap
- Variant server (ClinVar): https://github.com/abeconnelly/l7g-v5t-clinvar
- Prototype Docker image: https://hub.docker.com/r/abeconnelly/lightning/