Convert Harvard PGP exome data to CGF
The Harvard Personal Genome Project has exome data available publicly. Find as much exome data as possible and convert to the new CGF format.
There are a number of ad-hoc scripts to convert from some base format (GFF, gVCF, etc.) to FastJ which can be found under l7g/sandbox.
This might be a good opportunity to try and create better tools to do this conversion in an easier way, though I'm a little skeptical that this can be done completely generally.
There will no doubt be special considerations that need to be taken into account for the exome conversion.
The tile library will need to be extended in order to do the CGF conversion.
The basic workflow is:
- Convert gVCF or GFF to FastJ
- Collect, deduplicate and "impute" the FastJ to create the "Simple Genome Library Format" (SGLF)
- Merge the generated SGLF with the current tile library
- Use the source FastJ to generate band information for each of the datasets
- Convert the band dataset to CGF
The band datasets can be easily converted to numpy arrays for machine learning purposes if desired.
Note that extending the tile library should have no affect on previously generated CGF.