This tool is nearing completion. Once some local tests are done with a set of 10 Harvard PGP 23andMe genotyping files that are converted to CGF I will submit for review.
The 23andMe data has the Y and MT chromosome as single allele (as it should be) but for the datasets that have a Y chromosome, the X chromosome mostly has a single allele with other places on the X chromosome having two alleles. For datasets with no Y chromosome, all reported X positions have two alleles.
I think it's best to store all datasets as having two alleles, with the unreported allele being considered a 'nocall'. This means we will lose allele information at the low level. To keep that information, I think we should add another structure at a higher level, at worst in the header of the CGF, to tell us which portions of the tilepath have which allele count. We can't split it by tilepath as the break might come in between a tilepath. I would be hesitant to even consider it contiguous. The information content should be low enough entropy so as not to inflate the CGF significantly.
The gtconv
sub-directory has a few auxiliary tools to help with conversion. Here is a brief list:
gtconv
- convert a genotype file (23andMe, Ancestry) to FastJ CSV
format or print out it's "low quality" band information
fjcsv2band
- convert a FastJ CSV
file along with an SGLF
file into a "variant band"
run-whole-gt-conversion
- convert a complete genotype file to CGF
There are some auxiliary scripts and programs to do the minutiae of the conversion (merge-gt-bands
) and some other auxiliary programs and scripts to make sure the conversion went correctly (band2gt
) but those are the main programs. Note that the run-whole-gt-conversion
requires other l7g
tools (merg-sglf
, cgft
, etc.).
The complexity of conversion comes from needing to add to the tile library (SGLF
), then use that information to look up the implied tiles. Loading the SGLF
for each conversion is slow, so this process is batched and a lot of the work comes from consolidating information then, later, splitting.
The CGF created from the genotype have their "low quality" information inverted, with the low quality band information informing which positions are high quality instead of which positions are low quality. For example:
[ 1 0 1 ... ]
[ 2 0 1 ... ]
[[ 150 1 ][ ][ 90 1 95 1 ]... ]
[[ 150 1 ][ ][ 90 1 95 1 ]... ]
Would imply tile step 0
had a called base at tile position 150
, tile step 1
was not called at all, tile step 2
had two called positions at position 90
and 95
(relative to the start of tile step 2
).
There will be a flag set in the header of the CGF
to indicate the inverted interpretation of the low quality band information, though this hasn't been implemented yet.
Testing with 10 genotype files, the amortized run-time is around 2 hours per conversion.
Though I haven't monitored the memory usage closely, it looks like for some conversions it can balloon to 20Gb+ with most being under 1Gb. This is the underlying problem of the SGLF
being too large to fit in memory for some tilepaths and will be a problem until we figure out an alternative. For larger data sets this will be close to a requirement that this be solved.
The disk usage is in the region of 30Gb for the 10 converted (3Gb per genotype conversion), though this can be made more efficient by removing data as it's not required anymore.
TODO:
- Allow for removal of temporary data as it's used to mitigate disk usage
- Add the flag in the CGF to indicate when the CGF should be considered a "genotype" file
- Confirm the
Loq
bit mask is inverted in the genotype CGF file
- Update Docker image with
l7g
programs and scripts