Task #12111
closedTool to identify which mitochondrial reference an input genome file (gVCF, GFF, 23andMe, etc.) was referenced against
Description
There are two different mitochondrial reference files that we've seen when trying to convert build37/hg19 referenced genomes. To help automate the choice of which reference mitochondrial DNA to use, we should build a tool to help with this.
I'm imagining a tool that will the mitochondrial portion of a variant file and report back which is the most closely matched reference mitochondrial DNA.
Updated by Abram Connelly over 6 years ago
A rough initial prototype is available at which-ref.
The tool needs to be cleaned up in terms of usage explanation, error handling and reporting but a basic usage is as follows:
./which-ref \ <( samtools faidx /data-sdd/data/ref/human_g1k_v37.fa.gz MT | egrep -v '^>' | tr '[:upper:]' '[:lower:]' | tr 'm' 'n' | tr -d '\n' ) \ <( samtools faidx /data-sdd/data/ref/hg38.fa.gz chrM | egrep -v '^>' | tr '[:upper:]' '[:lower:]' | tr 'm' 'n' | tr -d '\n' ) \ <( samtools faidx /data-sdd/data/ref/hg19.fa.gz chrM | egrep -v '^>' | tr '[:upper:]' '[:lower:]' | tr 'm' 'n' | tr -d '\n' ) \ <( tabix /data-sdd/data/pgp-gff/hu826751-GS03052-DNA_B01.gff.gz chrM | pasta -a gff-rotini -r <( refstream chrM ) | pasta -a rotini-alt0 | tr -d '\n' )
This reports (on lightning-dev1
):
... min_score: 63 min_idx:2 name:/dev/fd/61TODO:
- print usage and allow for other command line options
- option for all matches
- warn when there are identical input reference sequences or when there are identical scores
- allow for max score bailout
- check for errors
Updated by Abram Connelly over 6 years ago
- Status changed from New to In Progress
Updated by Abram Connelly over 6 years ago
- Status changed from In Progress to Closed
- Remaining (hours) set to 0.0
A rudimentary version is on github.com/curoverse/l7g/tools/which-ref. Better usage reporting, error reporting and command line processing should be added but the tool is functional now.