Task #12139
closedCreate GRCh38 tile assembly
Description
Create the appropriate assembly files for GRCh38.
This involves create the files:
assembly.00.grch38.fw.gz
assembly.00.grch38.fw.fwi
assembly.00.grch38.fw.gzi
Where the assembly.00.grch38.fw.gz
is the compressed assembly "fixed width" file and the others are the index files to access it.
This should involve mapping the tagset onto the GRCh38 sequences. We'll need to figure out what to do with alternative assembly regions in GRCh38. The focus should be on the main assembly regions.
Updated by Abram Connelly over 6 years ago
- Status changed from New to In Progress
- Assigned To set to Abram Connelly
- Priority changed from Normal to High
This is a necessary step to be able to convert genomes that use GRCh38 as a reference.
Updated by Abram Connelly over 6 years ago
- Status changed from In Progress to Closed
- Remaining (hours) set to 0.0
There is an hg38 tile assembly in the assembly collection in keep.
The l7g repo has been updated with an hg38 tile liftover script to create the new assembly file.
Some notes:- There are empty tile paths in the hg38 tile assembly which might need special consideration when converting to cgf
- Some tags that were unique in hg19 are now duplicated in places in hg38
- Some tiles can be significantly longer than the original tiles in hg19
As a reminder, the format is "<tilestep> <end position, 0 reference, non inclusive>". The end position is 0 referenced, non inclusive and holds the end position of the tile step, except for the last tile step in which case the end position is the end of the tile path.
For example,
... 1536 2367984 1537 2368209 1538 2368410 >hg38:chr1:0001 0000 2368786 0001 2369403 0002 2369634 ...
The fields are tab delimited and padded with spaces to make everything fixed width. The .gzi and .fwi index files are also provided for efficient random access into the file.
I will mark this as done with the understanding that this might need to be updated in the future if any errors are discovered.