Project

General

Profile

Actions

Idea #11671

closed

Convert 650+ CGF files to new CGFv3

Added by Abram Connelly almost 7 years ago. Updated almost 7 years ago.

Status:
Closed
Priority:
High
Assigned To:
-
Story points:
4.0

Description

CGF files in the l7g Data project need to be converted over to the new CGFv3 format.

There are two directories of CGF files:

The workflow should be to convert each CGF file to 'band' information, then use the new cgft to convert to the CGFv3 format.

This should be fairly straight forward except for some CGF files that have errors in their mitochondrial DNA sequence conversion (tile path 0x35e).

A new tile library for tile path 0x35e needs to be generated before this can be properly completed.

The resulting CGF files should also be stored in the l7g Data project, with the old ones renamed with a timestamp in that same project to differentiate them.

Actions #1

Updated by Abram Connelly almost 7 years ago

The cgb tool should be able to convert to band format from the old CGF version.

fjt from #11672 can be used to convert to the new CGFv3 format.

Actions #2

Updated by Abram Connelly almost 7 years ago

  • Target version set to Lightning Sprint (2017-05-15 to 2017-05-29)
Actions #3

Updated by Abram Connelly almost 7 years ago

After conversion, a double check needs to occur to make sure the conversion went correctly. I think the best way is to do the following:

  • For all tile paths except 0x035e, check to make sure the original CGF matches the band format produced by CGFv3. cgb can be used to get the band format for CGFv2 and cgft can be used to produce the band format for CGFv3.
  • For the mitochondrial DNA tile path, 0x35e, checking that the hashes of the sequence produced by concatenating the FastJ are the same as what's produced from the CGFv3 should be sufficient.

The conversion from CGFv3 to sequence can be done via:

  • cgft to band format
  • extend the fjt tool (or extend/make a tool) to take in band format (and an SGLF file) and output FastJ (or CSV)
  • concatenate FastJ (or CSV) to sequence

This process is slow but since tile path 0x35e is so small, this should be quick enough to do.

Actions #4

Updated by Abram Connelly almost 7 years ago

  • Status changed from New to In Progress
Actions #5

Updated by Abram Connelly almost 7 years ago

  • Status changed from In Progress to Closed

720 CGFv3 files have been converted/created. They've been uploaded to the cgfv3 collection under the l7g Data project.

I've checked the mitochondrial sequences to make sure they match. The script was run on lightning-dev1, so the context makes it hard to re-run elseewhere, but it's provided here to give an idea of what's involved:

#!/bin/bash

sglfgz="/data-sdd/data/sglf/035e.sglf.gz" 
cgfdir="stage.cgfv3" 

for fjgz in `find ./stage ./stage.okg -name 035e.fj.gz` ; do
  name=`basename $( dirname $fjgz )`
  echo $name

  cgfv3="$cgfdir/$name.cgfv3" 

  a0=`cgft -b 862 $cgfv3 | fjt -b -L <( zcat $sglfgz ) | fjt -c 0 | tr -d '\n' | md5sum | cut -f1 -d' '`
  b0=`fjt -c 0 <( zcat $fjgz ) | tr -d '\n' | md5sum | cut -f1 -d' '`

  a1=`cgft -b 862 $cgfv3 | fjt -b -L <( zcat $sglfgz ) | fjt -c 1 | tr -d '\n' | md5sum | cut -f1 -d' '`
  b1=`fjt -c 1 <( zcat $fjgz ) | tr -d '\n' | md5sum | cut -f1 -d' '`

  if [[ "$a0" != "$b0" ]] || [[ "$a1" != "$b1" ]] ; then
    echo "ERROR: $cgfv3 mismatch between mt sequences" 
  else
    echo "  ok" 
  fi

done

A new sglf collection was also created with the new 0x35e sglf tile path library. This was needed for the FastJ conversion.

I'm considering this issue closed. If further checks are needed, we can open another ticket to take care of them.

Actions

Also available in: Atom PDF