Project

General

Profile

Actions

Bug #12319

closed

cgfv3 files have errors in last tile path (0x35e)

Added by Abram Connelly over 6 years ago. Updated over 6 years ago.

Status:
Closed
Priority:
Normal
Assigned To:
Target version:
-
Story points:
-

Description

Some of the cgfv3 files located in l7g Data cgfv3 have errors in their last tile path (862, 0x35e).

For example, dataset huEA4EE5-GS01669-DNA_G02.cgfv3 only has 33 positions in that tile path whereas there should be a total of 35:

cgft -b 862 huEA4EE5-GS01669-DNA_G02.cgfv3 
[ 169 -1 5 0 1 0 0 -1 0 0 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0 0 0 0 0 -1 0 37]
[ 169 -1 5 0 1 0 0 -1 0 0 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0 0 0 0 0 -1 0 37]
[[ 0 63 ][ ][ ][ ][ ][ ][ 903 1 ][ ][ 16 1 ][ ][ ][ 482 1 1003 1 1058 1 3262 1 4261 1 4705 1 4712 1 5094 1 5467 1 ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ 758 2 ]]
[[ 0 63 ][ ][ ][ ][ ][ ][ 903 1 ][ ][ 16 1 ][ ][ ][ 482 1 1003 1 1058 1 3262 1 4261 1 4705 1 4712 1 5094 1 5467 1 ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ 758 2 ]]

I suspect this is because of incorrect handling of the last tilestep in the tilepath. If the last tile in the last tilepath is spanning, I think this is what triggers this error.

This ticket is to go through the cgfv3 files, figure out which ones have a bad last tile path and correct them.

When converting from CGFv2 to CGFv3, I (Abram) noticed an incorrect 0x35e SGLF file and updated it. After updating the SGLF file, I updated the corresponding CGF files. Since they used a new SGLF, I most likely needed to refer back to the FastJ and convert that way. Whatever method used to convert to the new format probably had a bug that missed the last tiles if they were non-anchor spanning. Though this should be checked, I suspect only the last tile path of having this issue as the other tile paths didn't need to be recomputed.

Actions #1

Updated by Abram Connelly over 6 years ago

  • Status changed from New to In Progress

There are 182 (out of the total 720) that have a corrupted last tile path (0x35e):

HG00436-GS000016683-ASM.cgfv3
HG00437-GS000016672-ASM.cgfv3
HG00438-GS000016673-ASM.cgfv3
HG00442-GS000016869-ASM.cgfv3
HG00449-GS000016677-ASM.cgfv3
HG00450-GS000016678-ASM.cgfv3
HG00476-GS000016710-ASM.cgfv3
HG00477-GS000016711-ASM.cgfv3
HG00537-GS000016720-ASM.cgfv3
HG00538-GS000016721-ASM.cgfv3
HG00611-GS000016851-ASM.cgfv3
HG00612-GS000016852-ASM.cgfv3
HG00626-GS000017125-ASM.cgfv3
HG00627-GS000016860-ASM.cgfv3
HG00662-GS000016981-ASM.cgfv3
HG00671-GS000019202-ASM.cgfv3
HG01566-GS000016976-ASM.cgfv3
HG01567-GS000012114-ASM.cgfv3
HG01924-GS000013460-ASM.cgfv3
HG01925-GS000013413-ASM.cgfv3
HG01926-GS000013203-ASM.cgfv3
HG01933-GS000012137-ASM.cgfv3
HG01934-GS000012138-ASM.cgfv3
HG01967-GS000013456-ASM.cgfv3
HG02003-GS000017113-ASM.cgfv3
HG02004-GS000017112-ASM.cgfv3
HG02105-GS000016959-ASM.cgfv3
HG02106-GS000016960-ASM.cgfv3
HG02147-GS000016341-ASM.cgfv3
HG02148-GS000016961-ASM.cgfv3
HG02286-GS000016419-ASM.cgfv3
HG02287-GS000017903-ASM.cgfv3
HG02601-GS000016693-ASM.cgfv3
HG02602-GS000016692-ASM.cgfv3
HG02604-GS000016690-ASM.cgfv3
HG02605-GS000016689-ASM.cgfv3
HG02657-GS000016544-ASM.cgfv3
HG02658-GS000017179-ASM.cgfv3
HG02659-GS000017180-ASM.cgfv3
HG02685-GS000017225-ASM.cgfv3
HG02686-GS000016439-ASM.cgfv3
HG02697-GS000017229-ASM.cgfv3
HG02698-GS000017230-ASM.cgfv3
HG02728-GS000017019-ASM.cgfv3
HG02729-GS000017017-ASM.cgfv3
HG02734-GS000017233-ASM.cgfv3
HG02735-GS000016431-ASM.cgfv3
HG02783-GS000017016-ASM.cgfv3
HG02784-GS000017015-ASM.cgfv3
HG02785-GS000017012-ASM.cgfv3
HG02787-GS000017021-ASM.cgfv3
hu0211D6-GS01175-DNA_E02.cgfv3
hu025CEA-GS01669-DNA_D02.cgfv3
hu034DB1-GS01669-DNA_A03.cgfv3
hu050E9C-GS01173-DNA_G06.cgfv3
hu085B6D-GS03132-DNA_A01.cgfv3
hu089792-GS02269-DNA_B02.cgfv3
hu0E7AAF-GS03023-DNA_A01.cgfv3
hu1187FF-GS02269-DNA_A04.cgfv3
hu15FECA-GS01175-DNA_F03.cgfv3
hu19C09F-GS01669-DNA_F05.cgfv3
hu27FD1F-GS02269-DNA_B04.cgfv3
hu3A1B15-GS02269-DNA_C01.cgfv3
hu474789-GS02269-DNA_G04.cgfv3
hu49F623-GS02269-DNA_F01.cgfv3
hu553620-GS01669-DNA_B08.cgfv3
hu5B8771-GS02269-DNA_B01.cgfv3
hu5E55F5-GS02269-DNA_B03.cgfv3
hu5FCE15-GS01195-DNA_B01.cgfv3
hu602487-GS02269-DNA_F02.cgfv3
hu620F18-GS02269-DNA_E02.cgfv3
hu7123C1-GS01669-DNA_D07.cgfv3
hu76CAA5-GS02269-DNA_G03.cgfv3
hu775356-GS01175-DNA_A07.cgfv3
hu83BC6A-GS03023-DNA_C01.cgfv3
hu868880-GS02269-DNA_E01.cgfv3
hu925B56-GS03274-DNA_B01.cgfv3
hu92FD55-GS01669-DNA_A04.cgfv3
hu939B7C-GS01670-DNA_C01.cgfv3
huA02824-GS02269-DNA_G01.cgfv3
huA3A02C-GS03166-DNA_E02.cgfv3
huA49E22-GS01669-DNA_E04.cgfv3
huA4F281-GS01669-DNA_H10.cgfv3
huA5FD8B-GS02269-DNA_C02.cgfv3
huAA245C-GS02269-DNA_D03.cgfv3
huAE4A11-GS01669-DNA_F02.cgfv3
huB1FD55-GS01173-DNA_B07.cgfv3
huB4883B-GS02269-DNA_H04.cgfv3
huB4F9B2-GS02269-DNA_A05.cgfv3
huBA30D4-GS01173-DNA_H05.cgfv3
huBEDA0B-GS01669-DNA_E02.cgfv3
huCCAFD0-GS01669-DNA_B10.cgfv3
huD10E53-GS01669-DNA_B04.cgfv3
huD649F1-GS03133-DNA_D02.cgfv3
huD9EE1E-GS01669-DNA_F09.cgfv3
huE2E371-GS02269-DNA_H03.cgfv3
huEA4EE5-GS01669-DNA_G02.cgfv3
huF2DA6F-GS02269-DNA_A01.cgfv3
huF80F84-GS01669-DNA_B01.cgfv3
huFA70A3-GS01670-DNA_G02.cgfv3
huFCC1C1-GS02269-DNA_A02.cgfv3
NA07000-GS000016078-ASM.cgfv3
NA07029-GS000013213-ASM.cgfv3
NA10852-GS000016045-ASM.cgfv3
NA10861-GS000016044-ASM.cgfv3
NA10864-GS000016043-ASM.cgfv3
NA11829-GS000016042-ASM.cgfv3
NA11894-GS000016470-ASM.cgfv3
NA11994-GS000016468-ASM.cgfv3
NA11995-GS000016467-ASM.cgfv3
NA12003-GS000016035-ASM.cgfv3
NA12046-GS000016022-ASM.cgfv3
NA12386-GS000016459-ASM.cgfv3
NA12400-GS000016011-ASM.cgfv3
NA12413-GS000016457-ASM.cgfv3
NA12750-GS000016415-ASM.cgfv3
NA12753-GS000016412-ASM.cgfv3
NA12763-GS000016398-ASM.cgfv3
NA12775-GS000016397-ASM.cgfv3
NA12801-GS000016407-ASM.cgfv3
NA12813-GS000016394-ASM.cgfv3
NA12814-GS000016393-ASM.cgfv3
NA12864-GS000016382-ASM.cgfv3
NA12873-GS000016699-ASM.cgfv3
NA18498-GS000017237-ASM.cgfv3
NA18501-GS000017371-ASM.cgfv3
NA18503-GS000017173-ASM.cgfv3
NA18505-GS000017185-ASM.cgfv3
NA18507-GS000017029-ASM.cgfv3
NA18521-GS000017023-ASM.cgfv3
NA18870-GS000017227-ASM.cgfv3
NA18872-GS000017119-ASM.cgfv3
NA18923-GS000017262-ASM.cgfv3
NA18924-GS000017261-ASM.cgfv3
NA18933-GS000017045-ASM.cgfv3
NA18935-GS000017047-ASM.cgfv3
NA19093-GS000017050-ASM.cgfv3
NA19097-GS000017053-ASM.cgfv3
NA19100-GS000017041-ASM.cgfv3
NA19109-GS000017039-ASM.cgfv3
NA19116-GS000017035-ASM.cgfv3
NA19117-GS000017051-ASM.cgfv3
NA19120-GS000017052-ASM.cgfv3
NA19130-GS000017055-ASM.cgfv3
NA19143-GS000017054-ASM.cgfv3
NA19145-GS000017133-ASM.cgfv3
NA19146-GS000017131-ASM.cgfv3
NA19152-GS000017413-ASM.cgfv3
NA19154-GS000017411-ASM.cgfv3
NA19172-GS000017279-ASM.cgfv3
NA19173-GS000017257-ASM.cgfv3
NA19189-GS000017277-ASM.cgfv3
NA19190-GS000017276-ASM.cgfv3
NA19191-GS000017275-ASM.cgfv3
NA19201-GS000017244-ASM.cgfv3
NA19202-GS000017174-ASM.cgfv3
NA19238-GS000017268-ASM.cgfv3
NA19239-GS000017267-ASM.cgfv3
NA19240-GS000018625-ASM.cgfv3
NA19256-GS000017258-ASM.cgfv3
NA19404-GS000017239-ASM.cgfv3
NA19434-GS000016668-ASM.cgfv3
NA19435-GS000016558-ASM.cgfv3
NA19440-GS000016557-ASM.cgfv3
NA19443-GS000016556-ASM.cgfv3
NA07357-200-37.cgfv3
NA10851-200-37.cgfv3
NA18501-200-37.cgfv3
NA18505-200-37.cgfv3
NA18537-200-37.cgfv3
NA18942-200-37.cgfv3
NA19020-200-37.cgfv3
NA19129-200-37.cgfv3
NA19238-L2-200.cgfv3
NA19239-L2-200.cgfv3
NA19240-L2-200.cgfv3
NA19701-200-37.cgfv3
NA19703-200-37.cgfv3
NA20850-200-37.cgfv3
NA21732-200-37.cgfv3
NA21733-200-37.cgfv3
NA21737-200-37.cgfv3
Actions #2

Updated by Abram Connelly over 6 years ago

For all but 20 of the 182, recalculating the bands with the fjt tool and the SGLF for tile path 862 (0x035e) works. For example:

  fjt -B \
    -L <( zcat $sglf ) \
    -i <( zcat $fjfn ) > $oband

Where $sglf is the location of the SGLF library for tilepath 862 (0x35e) and $fjfn is the location of the FastJ file.

20 of the 182 fail with this method. I suspect they used an alternate mitochondrial DNA reference (assembly.00.human_g1k_v37.fw.gz) and for some reason didn't get converted properly, nor were the new tiles added to the sglf.

The 20 are:

hu089792-GS02269-DNA_B02
hu1187FF-GS02269-DNA_A04
hu27FD1F-GS02269-DNA_B04
hu3A1B15-GS02269-DNA_C01
hu474789-GS02269-DNA_G04
hu49F623-GS02269-DNA_F01
hu5B8771-GS02269-DNA_B01
hu5E55F5-GS02269-DNA_B03
hu5FCE15-GS01195-DNA_B01
hu602487-GS02269-DNA_F02
hu620F18-GS02269-DNA_E02
hu76CAA5-GS02269-DNA_G03
hu868880-GS02269-DNA_E01
huA02824-GS02269-DNA_G01
huAA245C-GS02269-DNA_D03
huB4883B-GS02269-DNA_H04
huB4F9B2-GS02269-DNA_A05
huE2E371-GS02269-DNA_H03
huF2DA6F-GS02269-DNA_A01
huFCC1C1-GS02269-DNA_A02

Though this is now done in CWL and should be standard fare, the steps are to:

  • create the FastJ
  • extend the SGLF
  • convert to band format

As an example, here are the scripts to do those three steps:

To convert to FastJ:

#!/bin/bash

gffdir="/data-sdd/data/pgp-gff" 
afn="/data-sdd/data/l7g/assembly/assembly.00.human_g1k_v37.fw.gz" 
other_ref="/data-sdd/data/ref/human_g1k_v37.fa.gz" 

for xid in `cat list`; do
  gff=$gffdir/$xid.gff.gz

  echo $xid

  mkdir -p out-data/$xid

  tabix $gff chrM | \
    pasta -a gff-rotini \
          -r <( refstream $other_ref MT ) \
          -T <( tagset 035e ) \
          -A <( tile-assembly tilepath $afn 035e.00 ) \
          --chrom chrM | \
    pasta -a rotini-fastj \
          --tilepath 035e \
          -T <( tagset 035e ) \
          -A <( tile-assembly tilepath $afn 035e.00 ) | \
    bgzip -c > out-data/$xid/035e.fj.gz

  fjt -C <( zcat out-data/$xid/035e.fj.gz ) > out-data/$xid/035e.esglf

done

Where list holds the list of 20.

To build the sglf:

#!/bin/bash
#

fastj2cgflib -V \
  -f <( zcat out-data/*/035e.fj.gz ) \
  -t <( ../verbose_tagset 035e ) | \
  egrep -v '^#' | cut -f2- -d, > 035e-mt-extended-from-20.sglf

merge-tilelib <( zcat /data-sdd/data/sglf/035e.sglf.gz | sort ) \
  <( cat 035e-mt-extended-from-20.sglf | sort ) | \
  sort | \
  bgzip -c > 035e-merge.sglf.gz

And to convert to band format:

#!/bin/bash

sglf="./035e-merge.sglf.gz" 

for fjfn in `find out-data -name '*.fj.gz'` ; do
  b=`basename $fjfn .fj.gz`
  d=`dirname $fjfn`
  bb=`basename $d`

  oband="$d/$bb-$b.band" 

  echo $b $bb '..' $d $oband

  #continue

  fjt -B \
    -L <( zcat $sglf ) \
    -i <( zcat $fjfn ) > $oband
done

For reference, verbose_tagset is:

#!/bin/bash

tagver="00" 
tilepath=$1
#tagset="/data-sdd/data/l7g/tagset.fa/tagset.fa.gz" 
tagset="/data-sdd/data/l7g/tagset.fa/tagset.fa.gz" 

if [ "$tilepath" == "" ] ; then
  echo "provide tilepath" 
  exit 1
fi

echo '>{"type":"tagset","path":"'$tilepath'","field":{0:"path",1:"step",2:"startTag"}}'
echo "$tilepath,0000," 

tilestep=1

while read line ; do
  hstep=`printf '%04x' $tilestep`

  echo "$tilepath,$hstep,$line" 

  let tilestep="$tilestep + 1" 
done < <( samtools faidx $tagset $tilepath.$tagver | egrep -v '^>' | tr -d '\n' | fold -w 24 ; echo )

Though I haven't checked the conversion to make sure the round trip conversion matches, from an initial check, the band conversion looks good. The SGLF needs to be updated for tilepath 862 (0x35e) and the cgfv3 needs to be updated. I'm going to wait for confirmation before updating the SGLF and the cgfv3.

Actions #3

Updated by Abram Connelly over 6 years ago

  • Status changed from In Progress to Closed

We've decided to use the CWL as the 'ground truth'. This will recreate the SGLF and recreate the CGF. Conversion to the CGFv3 can be done by converting from the CWL generated CGFv2 to band format and then converting to CGFv3.

After the CWL gets run, we should put the SGLF, CGF, FastJ and other artifacts in the "l7g Data" project.

Actions

Also available in: Atom PDF