Bug #12319
closedcgfv3 files have errors in last tile path (0x35e)
Description
Some of the cgfv3 files located in l7g Data cgfv3 have errors in their last tile path (862, 0x35e).
For example, dataset huEA4EE5-GS01669-DNA_G02.cgfv3 only has 33 positions in that tile path whereas there should be a total of 35:
cgft -b 862 huEA4EE5-GS01669-DNA_G02.cgfv3 [ 169 -1 5 0 1 0 0 -1 0 0 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0 0 0 0 0 -1 0 37] [ 169 -1 5 0 1 0 0 -1 0 0 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0 0 0 0 0 -1 0 37] [[ 0 63 ][ ][ ][ ][ ][ ][ 903 1 ][ ][ 16 1 ][ ][ ][ 482 1 1003 1 1058 1 3262 1 4261 1 4705 1 4712 1 5094 1 5467 1 ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ 758 2 ]] [[ 0 63 ][ ][ ][ ][ ][ ][ 903 1 ][ ][ 16 1 ][ ][ ][ 482 1 1003 1 1058 1 3262 1 4261 1 4705 1 4712 1 5094 1 5467 1 ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ 758 2 ]]
I suspect this is because of incorrect handling of the last tilestep in the tilepath. If the last tile in the last tilepath is spanning, I think this is what triggers this error.
This ticket is to go through the cgfv3 files, figure out which ones have a bad last tile path and correct them.
When converting from CGFv2 to CGFv3, I (Abram) noticed an incorrect 0x35e SGLF file and updated it. After updating the SGLF file, I updated the corresponding CGF files. Since they used a new SGLF, I most likely needed to refer back to the FastJ and convert that way. Whatever method used to convert to the new format probably had a bug that missed the last tiles if they were non-anchor spanning. Though this should be checked, I suspect only the last tile path of having this issue as the other tile paths didn't need to be recomputed.
Updated by Abram Connelly over 6 years ago
- Status changed from New to In Progress
There are 182 (out of the total 720) that have a corrupted last tile path (0x35e):
HG00436-GS000016683-ASM.cgfv3 HG00437-GS000016672-ASM.cgfv3 HG00438-GS000016673-ASM.cgfv3 HG00442-GS000016869-ASM.cgfv3 HG00449-GS000016677-ASM.cgfv3 HG00450-GS000016678-ASM.cgfv3 HG00476-GS000016710-ASM.cgfv3 HG00477-GS000016711-ASM.cgfv3 HG00537-GS000016720-ASM.cgfv3 HG00538-GS000016721-ASM.cgfv3 HG00611-GS000016851-ASM.cgfv3 HG00612-GS000016852-ASM.cgfv3 HG00626-GS000017125-ASM.cgfv3 HG00627-GS000016860-ASM.cgfv3 HG00662-GS000016981-ASM.cgfv3 HG00671-GS000019202-ASM.cgfv3 HG01566-GS000016976-ASM.cgfv3 HG01567-GS000012114-ASM.cgfv3 HG01924-GS000013460-ASM.cgfv3 HG01925-GS000013413-ASM.cgfv3 HG01926-GS000013203-ASM.cgfv3 HG01933-GS000012137-ASM.cgfv3 HG01934-GS000012138-ASM.cgfv3 HG01967-GS000013456-ASM.cgfv3 HG02003-GS000017113-ASM.cgfv3 HG02004-GS000017112-ASM.cgfv3 HG02105-GS000016959-ASM.cgfv3 HG02106-GS000016960-ASM.cgfv3 HG02147-GS000016341-ASM.cgfv3 HG02148-GS000016961-ASM.cgfv3 HG02286-GS000016419-ASM.cgfv3 HG02287-GS000017903-ASM.cgfv3 HG02601-GS000016693-ASM.cgfv3 HG02602-GS000016692-ASM.cgfv3 HG02604-GS000016690-ASM.cgfv3 HG02605-GS000016689-ASM.cgfv3 HG02657-GS000016544-ASM.cgfv3 HG02658-GS000017179-ASM.cgfv3 HG02659-GS000017180-ASM.cgfv3 HG02685-GS000017225-ASM.cgfv3 HG02686-GS000016439-ASM.cgfv3 HG02697-GS000017229-ASM.cgfv3 HG02698-GS000017230-ASM.cgfv3 HG02728-GS000017019-ASM.cgfv3 HG02729-GS000017017-ASM.cgfv3 HG02734-GS000017233-ASM.cgfv3 HG02735-GS000016431-ASM.cgfv3 HG02783-GS000017016-ASM.cgfv3 HG02784-GS000017015-ASM.cgfv3 HG02785-GS000017012-ASM.cgfv3 HG02787-GS000017021-ASM.cgfv3 hu0211D6-GS01175-DNA_E02.cgfv3 hu025CEA-GS01669-DNA_D02.cgfv3 hu034DB1-GS01669-DNA_A03.cgfv3 hu050E9C-GS01173-DNA_G06.cgfv3 hu085B6D-GS03132-DNA_A01.cgfv3 hu089792-GS02269-DNA_B02.cgfv3 hu0E7AAF-GS03023-DNA_A01.cgfv3 hu1187FF-GS02269-DNA_A04.cgfv3 hu15FECA-GS01175-DNA_F03.cgfv3 hu19C09F-GS01669-DNA_F05.cgfv3 hu27FD1F-GS02269-DNA_B04.cgfv3 hu3A1B15-GS02269-DNA_C01.cgfv3 hu474789-GS02269-DNA_G04.cgfv3 hu49F623-GS02269-DNA_F01.cgfv3 hu553620-GS01669-DNA_B08.cgfv3 hu5B8771-GS02269-DNA_B01.cgfv3 hu5E55F5-GS02269-DNA_B03.cgfv3 hu5FCE15-GS01195-DNA_B01.cgfv3 hu602487-GS02269-DNA_F02.cgfv3 hu620F18-GS02269-DNA_E02.cgfv3 hu7123C1-GS01669-DNA_D07.cgfv3 hu76CAA5-GS02269-DNA_G03.cgfv3 hu775356-GS01175-DNA_A07.cgfv3 hu83BC6A-GS03023-DNA_C01.cgfv3 hu868880-GS02269-DNA_E01.cgfv3 hu925B56-GS03274-DNA_B01.cgfv3 hu92FD55-GS01669-DNA_A04.cgfv3 hu939B7C-GS01670-DNA_C01.cgfv3 huA02824-GS02269-DNA_G01.cgfv3 huA3A02C-GS03166-DNA_E02.cgfv3 huA49E22-GS01669-DNA_E04.cgfv3 huA4F281-GS01669-DNA_H10.cgfv3 huA5FD8B-GS02269-DNA_C02.cgfv3 huAA245C-GS02269-DNA_D03.cgfv3 huAE4A11-GS01669-DNA_F02.cgfv3 huB1FD55-GS01173-DNA_B07.cgfv3 huB4883B-GS02269-DNA_H04.cgfv3 huB4F9B2-GS02269-DNA_A05.cgfv3 huBA30D4-GS01173-DNA_H05.cgfv3 huBEDA0B-GS01669-DNA_E02.cgfv3 huCCAFD0-GS01669-DNA_B10.cgfv3 huD10E53-GS01669-DNA_B04.cgfv3 huD649F1-GS03133-DNA_D02.cgfv3 huD9EE1E-GS01669-DNA_F09.cgfv3 huE2E371-GS02269-DNA_H03.cgfv3 huEA4EE5-GS01669-DNA_G02.cgfv3 huF2DA6F-GS02269-DNA_A01.cgfv3 huF80F84-GS01669-DNA_B01.cgfv3 huFA70A3-GS01670-DNA_G02.cgfv3 huFCC1C1-GS02269-DNA_A02.cgfv3 NA07000-GS000016078-ASM.cgfv3 NA07029-GS000013213-ASM.cgfv3 NA10852-GS000016045-ASM.cgfv3 NA10861-GS000016044-ASM.cgfv3 NA10864-GS000016043-ASM.cgfv3 NA11829-GS000016042-ASM.cgfv3 NA11894-GS000016470-ASM.cgfv3 NA11994-GS000016468-ASM.cgfv3 NA11995-GS000016467-ASM.cgfv3 NA12003-GS000016035-ASM.cgfv3 NA12046-GS000016022-ASM.cgfv3 NA12386-GS000016459-ASM.cgfv3 NA12400-GS000016011-ASM.cgfv3 NA12413-GS000016457-ASM.cgfv3 NA12750-GS000016415-ASM.cgfv3 NA12753-GS000016412-ASM.cgfv3 NA12763-GS000016398-ASM.cgfv3 NA12775-GS000016397-ASM.cgfv3 NA12801-GS000016407-ASM.cgfv3 NA12813-GS000016394-ASM.cgfv3 NA12814-GS000016393-ASM.cgfv3 NA12864-GS000016382-ASM.cgfv3 NA12873-GS000016699-ASM.cgfv3 NA18498-GS000017237-ASM.cgfv3 NA18501-GS000017371-ASM.cgfv3 NA18503-GS000017173-ASM.cgfv3 NA18505-GS000017185-ASM.cgfv3 NA18507-GS000017029-ASM.cgfv3 NA18521-GS000017023-ASM.cgfv3 NA18870-GS000017227-ASM.cgfv3 NA18872-GS000017119-ASM.cgfv3 NA18923-GS000017262-ASM.cgfv3 NA18924-GS000017261-ASM.cgfv3 NA18933-GS000017045-ASM.cgfv3 NA18935-GS000017047-ASM.cgfv3 NA19093-GS000017050-ASM.cgfv3 NA19097-GS000017053-ASM.cgfv3 NA19100-GS000017041-ASM.cgfv3 NA19109-GS000017039-ASM.cgfv3 NA19116-GS000017035-ASM.cgfv3 NA19117-GS000017051-ASM.cgfv3 NA19120-GS000017052-ASM.cgfv3 NA19130-GS000017055-ASM.cgfv3 NA19143-GS000017054-ASM.cgfv3 NA19145-GS000017133-ASM.cgfv3 NA19146-GS000017131-ASM.cgfv3 NA19152-GS000017413-ASM.cgfv3 NA19154-GS000017411-ASM.cgfv3 NA19172-GS000017279-ASM.cgfv3 NA19173-GS000017257-ASM.cgfv3 NA19189-GS000017277-ASM.cgfv3 NA19190-GS000017276-ASM.cgfv3 NA19191-GS000017275-ASM.cgfv3 NA19201-GS000017244-ASM.cgfv3 NA19202-GS000017174-ASM.cgfv3 NA19238-GS000017268-ASM.cgfv3 NA19239-GS000017267-ASM.cgfv3 NA19240-GS000018625-ASM.cgfv3 NA19256-GS000017258-ASM.cgfv3 NA19404-GS000017239-ASM.cgfv3 NA19434-GS000016668-ASM.cgfv3 NA19435-GS000016558-ASM.cgfv3 NA19440-GS000016557-ASM.cgfv3 NA19443-GS000016556-ASM.cgfv3 NA07357-200-37.cgfv3 NA10851-200-37.cgfv3 NA18501-200-37.cgfv3 NA18505-200-37.cgfv3 NA18537-200-37.cgfv3 NA18942-200-37.cgfv3 NA19020-200-37.cgfv3 NA19129-200-37.cgfv3 NA19238-L2-200.cgfv3 NA19239-L2-200.cgfv3 NA19240-L2-200.cgfv3 NA19701-200-37.cgfv3 NA19703-200-37.cgfv3 NA20850-200-37.cgfv3 NA21732-200-37.cgfv3 NA21733-200-37.cgfv3 NA21737-200-37.cgfv3
Updated by Abram Connelly over 6 years ago
For all but 20 of the 182, recalculating the bands with the fjt tool and the SGLF for tile path 862 (0x035e) works. For example:
fjt -B \ -L <( zcat $sglf ) \ -i <( zcat $fjfn ) > $oband
Where $sglf
is the location of the SGLF library for tilepath 862 (0x35e) and $fjfn
is the location of the FastJ file.
20 of the 182 fail with this method. I suspect they used an alternate mitochondrial DNA reference (assembly.00.human_g1k_v37.fw.gz
) and for some reason didn't get converted properly, nor were the new tiles added to the sglf.
The 20 are:
hu089792-GS02269-DNA_B02 hu1187FF-GS02269-DNA_A04 hu27FD1F-GS02269-DNA_B04 hu3A1B15-GS02269-DNA_C01 hu474789-GS02269-DNA_G04 hu49F623-GS02269-DNA_F01 hu5B8771-GS02269-DNA_B01 hu5E55F5-GS02269-DNA_B03 hu5FCE15-GS01195-DNA_B01 hu602487-GS02269-DNA_F02 hu620F18-GS02269-DNA_E02 hu76CAA5-GS02269-DNA_G03 hu868880-GS02269-DNA_E01 huA02824-GS02269-DNA_G01 huAA245C-GS02269-DNA_D03 huB4883B-GS02269-DNA_H04 huB4F9B2-GS02269-DNA_A05 huE2E371-GS02269-DNA_H03 huF2DA6F-GS02269-DNA_A01 huFCC1C1-GS02269-DNA_A02
Though this is now done in CWL and should be standard fare, the steps are to:
- create the FastJ
- extend the SGLF
- convert to band format
As an example, here are the scripts to do those three steps:
To convert to FastJ:
#!/bin/bash gffdir="/data-sdd/data/pgp-gff" afn="/data-sdd/data/l7g/assembly/assembly.00.human_g1k_v37.fw.gz" other_ref="/data-sdd/data/ref/human_g1k_v37.fa.gz" for xid in `cat list`; do gff=$gffdir/$xid.gff.gz echo $xid mkdir -p out-data/$xid tabix $gff chrM | \ pasta -a gff-rotini \ -r <( refstream $other_ref MT ) \ -T <( tagset 035e ) \ -A <( tile-assembly tilepath $afn 035e.00 ) \ --chrom chrM | \ pasta -a rotini-fastj \ --tilepath 035e \ -T <( tagset 035e ) \ -A <( tile-assembly tilepath $afn 035e.00 ) | \ bgzip -c > out-data/$xid/035e.fj.gz fjt -C <( zcat out-data/$xid/035e.fj.gz ) > out-data/$xid/035e.esglf done
Where list
holds the list of 20.
To build the sglf:
#!/bin/bash # fastj2cgflib -V \ -f <( zcat out-data/*/035e.fj.gz ) \ -t <( ../verbose_tagset 035e ) | \ egrep -v '^#' | cut -f2- -d, > 035e-mt-extended-from-20.sglf merge-tilelib <( zcat /data-sdd/data/sglf/035e.sglf.gz | sort ) \ <( cat 035e-mt-extended-from-20.sglf | sort ) | \ sort | \ bgzip -c > 035e-merge.sglf.gz
And to convert to band format:
#!/bin/bash sglf="./035e-merge.sglf.gz" for fjfn in `find out-data -name '*.fj.gz'` ; do b=`basename $fjfn .fj.gz` d=`dirname $fjfn` bb=`basename $d` oband="$d/$bb-$b.band" echo $b $bb '..' $d $oband #continue fjt -B \ -L <( zcat $sglf ) \ -i <( zcat $fjfn ) > $oband done
For reference, verbose_tagset
is:
#!/bin/bash tagver="00" tilepath=$1 #tagset="/data-sdd/data/l7g/tagset.fa/tagset.fa.gz" tagset="/data-sdd/data/l7g/tagset.fa/tagset.fa.gz" if [ "$tilepath" == "" ] ; then echo "provide tilepath" exit 1 fi echo '>{"type":"tagset","path":"'$tilepath'","field":{0:"path",1:"step",2:"startTag"}}' echo "$tilepath,0000," tilestep=1 while read line ; do hstep=`printf '%04x' $tilestep` echo "$tilepath,$hstep,$line" let tilestep="$tilestep + 1" done < <( samtools faidx $tagset $tilepath.$tagver | egrep -v '^>' | tr -d '\n' | fold -w 24 ; echo )
Though I haven't checked the conversion to make sure the round trip conversion matches, from an initial check, the band conversion looks good. The SGLF needs to be updated for tilepath 862 (0x35e) and the cgfv3 needs to be updated. I'm going to wait for confirmation before updating the SGLF and the cgfv3.
Updated by Abram Connelly over 6 years ago
- Status changed from In Progress to Closed
We've decided to use the CWL as the 'ground truth'. This will recreate the SGLF and recreate the CGF. Conversion to the CGFv3 can be done by converting from the CWL generated CGFv2 to band format and then converting to CGFv3.
After the CWL gets run, we should put the SGLF, CGF, FastJ and other artifacts in the "l7g Data" project.