Convert openSNP genotyping data to CGF
Updated by Abram Connelly almost 7 years ago
- Status changed from New to In Progress
Data is downloaded and I'm doing initial investigation into automating the process. I'm only focused on a subset of the all of the data available for ease of conversion but both build 37 and build 36 genotyping data. The total number of datasets is in the 2000+ range.
The general tactic is to create the appropriate library tiles that each genotyping file would need and collect them all into an auxiliary simple genome library format file (for each path). Once collected, deduplicate and merge with the current SGLF files so that the genotyping files can then be converted to tiles/cgf.
Updated by Abram Connelly over 6 years ago
This conversion is now in progress. There are a total of 1874 23andMe genotype files to be converted. The conversion is done in 100 file batches with each snapshot of the tile library uploaded to Arvados. The project can be found at:
The double checks to make sure the converted "epa" files and the converted cgf match up to the original source have been taken out as they take up too much time. This hopefully can be more efficiently parallelized after the complete conversion. There are already some errors of the form:
terminate called after throwing an instance of 'std::logic_error' what(): vlc_vector cannot decode values smaller than 1! /bin/bash: line 6: 5942 Done cat $bandfn 5943 Aborted (core dumped) | cgft -e $p $ocgf terminate called after throwing an instance of 'std::logic_error' what(): vlc_vector cannot decode values smaller than 1! /bin/bash: line 6: 10699 Done cat $bandfn 10700 Aborted (core dumped) | cgft -e $p $ocgf terminate called after throwing an instance of 'std::logic_error' what(): vlc_vector cannot decode values smaller than 1! /bin/bash: line 6: 11234 Done cat $bandfn 11235 Aborted (core dumped) | cgft -e $p $ocgf
This looks to be a problem with the encoding so the tile libraries should be unaffected and these can be re-encoded later if the cgf files prove to be corrupt. I've done test runs of the first 100 (where the errors came from above) before starting the whole process so I'm not quite sure why this happened. Maybe stopping and restarting the batch run causes cgf file to be trampled on? Maybe there were some memory issues that show up sporadically?
Currently, on lightning-dev1 the process takes just under 5 hours using 8-10 cores to convert 100 genotype 23andMe file, which gives about 30 minutes per conversion (amortized). Note that one of the major bottlenecks is loading the sglf file into memory (on a per tile path basis).
For completeness, the process is roughly as follows:
- Take the genotype file (23andMe), filter only SNPs and create those tile sequences that have a called SNP on them. Do this for some number, 100 say, at a time.
- Collect all the implied tile sequences and merge them into a new tile library snapshot
- Use the tile library snapshot to create a 'band' format for each in the batch (of 100, say).
- Use the band format to convert to CGF.
This particular run uploads the tile library snapshot and deletes the band, data file tile library sequences and other temporary data while preserving the tile library snapshot from the previous step.
Assuming no errors, this process should complete later this week around Friday (2017-07-14).
This process should