Idea #13506: Update Genotype conversion tool - Lightning - Arvados

Actions

Copy link

Idea #13506

closed

Update Genotype conversion tool

Added by Abram Connelly almost 6 years ago. Updated almost 6 years ago.

Status:

Closed

Priority:

Normal

Assigned To:

Abram Connelly

Target version:

Start date:

06/15/2018

Due date:

Story points:

Description

The genotyping conversion tool, gtconv, was never properly tested or used. Update this tool to be able to convert genotyping data files (VCF, 23andMe, Ancestry, etc.). The output needs to be though about as the data is sparse enough that it doesn't warrant creating full FastJ and an intermediate file might be more appropriate.

At the very least, an SGLF file should be created so that it can be merged into the library and an option should be given to allow for band file conversion to be created so that it can be pumped into cgft (or other CGF creation tools).

Batching should be considered as the genotype files are small compared to the auxiliary files needed, such as the reference.

Documentation should be added as well as test cases to make sure things are working properly.

A short summary:

Creation of an intermediate format to facilitate conversion or determine it's unnecessary.
Creation of SGLF files from input genotype file (23andMe, Ancestry, VCF)
Investigate batching option and implement if it seems reasonable
Conversion to "band format" (from the intermediate format, say)
Documentation
Tests

Subtasks 1 (0 open — 1 closed)

Actions

Copy link

Updated by Abram Connelly almost 6 years ago

Status changed from New to In Progress
Assigned To set to Abram Connelly

This tool is nearing completion. Once some local tests are done with a set of 10 Harvard PGP 23andMe genotyping files that are converted to CGF I will submit for review.

The 23andMe data has the Y and MT chromosome as single allele (as it should be) but for the datasets that have a Y chromosome, the X chromosome mostly has a single allele with other places on the X chromosome having two alleles. For datasets with no Y chromosome, all reported X positions have two alleles.

I think it's best to store all datasets as having two alleles, with the unreported allele being considered a 'nocall'. This means we will lose allele information at the low level. To keep that information, I think we should add another structure at a higher level, at worst in the header of the CGF, to tell us which portions of the tilepath have which allele count. We can't split it by tilepath as the break might come in between a tilepath. I would be hesitant to even consider it contiguous. The information content should be low enough entropy so as not to inflate the CGF significantly.

The gtconv sub-directory has a few auxiliary tools to help with conversion. Here is a brief list:

gtconv - convert a genotype file (23andMe, Ancestry) to FastJ CSV format or print out it's "low quality" band information
fjcsv2band - convert a FastJ CSV file along with an SGLF file into a "variant band"
run-whole-gt-conversion - convert a complete genotype file to CGF

There are some auxiliary scripts and programs to do the minutiae of the conversion (merge-gt-bands) and some other auxiliary programs and scripts to make sure the conversion went correctly (band2gt) but those are the main programs. Note that the run-whole-gt-conversion requires other l7g tools (merg-sglf, cgft, etc.).

The complexity of conversion comes from needing to add to the tile library (SGLF), then use that information to look up the implied tiles. Loading the SGLF for each conversion is slow, so this process is batched and a lot of the work comes from consolidating information then, later, splitting.

The CGF created from the genotype have their "low quality" information inverted, with the low quality band information informing which positions are high quality instead of which positions are low quality. For example:

[ 1 0 1 ... ]
[ 2 0 1 ... ]
[[ 150 1 ][ ][ 90 1 95 1 ]... ]
[[ 150 1 ][ ][ 90 1 95 1 ]... ]

Would imply tile step 0 had a called base at tile position 150, tile step 1 was not called at all, tile step 2 had two called positions at position 90 and 95 (relative to the start of tile step 2).

There will be a flag set in the header of the CGF to indicate the inverted interpretation of the low quality band information, though this hasn't been implemented yet.

Testing with 10 genotype files, the amortized run-time is around 2 hours per conversion.

Though I haven't monitored the memory usage closely, it looks like for some conversions it can balloon to 20Gb+ with most being under 1Gb. This is the underlying problem of the SGLF being too large to fit in memory for some tilepaths and will be a problem until we figure out an alternative. For larger data sets this will be close to a requirement that this be solved.

The disk usage is in the region of 30Gb for the 10 converted (3Gb per genotype conversion), though this can be made more efficient by removing data as it's not required anymore.

TODO:

Allow for removal of temporary data as it's used to mitigate disk usage
Add the flag in the CGF to indicate when the CGF should be considered a "genotype" file
Confirm the Loq bit mask is inverted in the genotype CGF file
Update Docker image with l7g programs and scripts

Actions

Copy link

Updated by Abram Connelly almost 6 years ago

There's a problem with the CGF creation step. The 'low quality' band information's interpretation should be inverted since it's a genotype CGF but the cgft tool doesn't know how to do that.

I think the solution is to alter the cgft tool to allow for the 'low quality' band information to be fed in separately. Regardless, I think this is out of scope for this conversion as it creates the intermediate "band format" that can be used to convert to a final CGF or numpy form.

Actions

Copy link

Updated by Abram Connelly almost 6 years ago

The CGF conversion tools will be updated to accommodate the genotype files and will be incorporated into these tools when that's finished.

Temporary disk usage has been left untouched. The CWL might have opinions on how best to handle that and what to save so I've left it as is for now.
Genotyping CGF files will handled by other CGF tools and will be incorporated back in when they're ready
Docker image still needs updating

Actions

Copy link