For all but 20 of the 182, recalculating the bands with the fjt tool and the SGLF for tile path 862 (0x035e) works. For example:
fjt -B \
-L <( zcat $sglf ) \
-i <( zcat $fjfn ) > $oband
Where $sglf
is the location of the SGLF library for tilepath 862 (0x35e) and $fjfn
is the location of the FastJ file.
20 of the 182 fail with this method. I suspect they used an alternate mitochondrial DNA reference (assembly.00.human_g1k_v37.fw.gz
) and for some reason didn't get converted properly, nor were the new tiles added to the sglf.
The 20 are:
hu089792-GS02269-DNA_B02
hu1187FF-GS02269-DNA_A04
hu27FD1F-GS02269-DNA_B04
hu3A1B15-GS02269-DNA_C01
hu474789-GS02269-DNA_G04
hu49F623-GS02269-DNA_F01
hu5B8771-GS02269-DNA_B01
hu5E55F5-GS02269-DNA_B03
hu5FCE15-GS01195-DNA_B01
hu602487-GS02269-DNA_F02
hu620F18-GS02269-DNA_E02
hu76CAA5-GS02269-DNA_G03
hu868880-GS02269-DNA_E01
huA02824-GS02269-DNA_G01
huAA245C-GS02269-DNA_D03
huB4883B-GS02269-DNA_H04
huB4F9B2-GS02269-DNA_A05
huE2E371-GS02269-DNA_H03
huF2DA6F-GS02269-DNA_A01
huFCC1C1-GS02269-DNA_A02
Though this is now done in CWL and should be standard fare, the steps are to:
- create the FastJ
- extend the SGLF
- convert to band format
As an example, here are the scripts to do those three steps:
To convert to FastJ:
#!/bin/bash
gffdir="/data-sdd/data/pgp-gff"
afn="/data-sdd/data/l7g/assembly/assembly.00.human_g1k_v37.fw.gz"
other_ref="/data-sdd/data/ref/human_g1k_v37.fa.gz"
for xid in `cat list`; do
gff=$gffdir/$xid.gff.gz
echo $xid
mkdir -p out-data/$xid
tabix $gff chrM | \
pasta -a gff-rotini \
-r <( refstream $other_ref MT ) \
-T <( tagset 035e ) \
-A <( tile-assembly tilepath $afn 035e.00 ) \
--chrom chrM | \
pasta -a rotini-fastj \
--tilepath 035e \
-T <( tagset 035e ) \
-A <( tile-assembly tilepath $afn 035e.00 ) | \
bgzip -c > out-data/$xid/035e.fj.gz
fjt -C <( zcat out-data/$xid/035e.fj.gz ) > out-data/$xid/035e.esglf
done
Where list
holds the list of 20.
To build the sglf:
#!/bin/bash
#
fastj2cgflib -V \
-f <( zcat out-data/*/035e.fj.gz ) \
-t <( ../verbose_tagset 035e ) | \
egrep -v '^#' | cut -f2- -d, > 035e-mt-extended-from-20.sglf
merge-tilelib <( zcat /data-sdd/data/sglf/035e.sglf.gz | sort ) \
<( cat 035e-mt-extended-from-20.sglf | sort ) | \
sort | \
bgzip -c > 035e-merge.sglf.gz
And to convert to band format:
#!/bin/bash
sglf="./035e-merge.sglf.gz"
for fjfn in `find out-data -name '*.fj.gz'` ; do
b=`basename $fjfn .fj.gz`
d=`dirname $fjfn`
bb=`basename $d`
oband="$d/$bb-$b.band"
echo $b $bb '..' $d $oband
#continue
fjt -B \
-L <( zcat $sglf ) \
-i <( zcat $fjfn ) > $oband
done
For reference, verbose_tagset
is:
#!/bin/bash
tagver="00"
tilepath=$1
#tagset="/data-sdd/data/l7g/tagset.fa/tagset.fa.gz"
tagset="/data-sdd/data/l7g/tagset.fa/tagset.fa.gz"
if [ "$tilepath" == "" ] ; then
echo "provide tilepath"
exit 1
fi
echo '>{"type":"tagset","path":"'$tilepath'","field":{0:"path",1:"step",2:"startTag"}}'
echo "$tilepath,0000,"
tilestep=1
while read line ; do
hstep=`printf '%04x' $tilestep`
echo "$tilepath,$hstep,$line"
let tilestep="$tilestep + 1"
done < <( samtools faidx $tagset $tilepath.$tagver | egrep -v '^>' | tr -d '\n' | fold -w 24 ; echo )
Though I haven't checked the conversion to make sure the round trip conversion matches, from an initial check, the band conversion looks good. The SGLF needs to be updated for tilepath 862 (0x35e) and the cgfv3 needs to be updated. I'm going to wait for confirmation before updating the SGLF and the cgfv3.