https://dev.arvados.org/https://dev.arvados.org/favicon.ico?15576888422017-05-11T05:25:23ZArvadosLightning - Idea #11671: Convert 650+ CGF files to new CGFv3https://dev.arvados.org/issues/11671?journal_id=516772017-05-11T05:25:23ZAbram Connellyabram.connelly@gmail.com
<ul></ul><p>The <a href="https://github.com/abeconnelly/cgf/tree/master/cpp" class="external">cgb</a> tool should be able to convert to band format from the old CGF version.</p>
<p><code>fjt</code> from <a class="issue tracker-6 status-5 priority-4 priority-default closed" title="Idea: Add "band" functionality to fjt tool (Closed)" href="https://dev.arvados.org/issues/11672">#11672</a> can be used to convert to the new CGFv3 format.</p> Lightning - Idea #11671: Convert 650+ CGF files to new CGFv3https://dev.arvados.org/issues/11671?journal_id=517362017-05-13T21:58:39ZAbram Connellyabram.connelly@gmail.com
<ul><li><strong>Target version</strong> set to <i>Lightning Sprint (2017-05-15 to 2017-05-29)</i></li></ul> Lightning - Idea #11671: Convert 650+ CGF files to new CGFv3https://dev.arvados.org/issues/11671?journal_id=517382017-05-13T22:21:51ZAbram Connellyabram.connelly@gmail.com
<ul></ul><p>After conversion, a double check needs to occur to make sure the conversion went correctly. I think the best way is to do the following:</p>
<ul>
<li>For all tile paths except 0x035e, check to make sure the original CGF matches the band format produced by CGFv3. <code>cgb</code> can be used to get the band format for CGFv2 and <code>cgft</code> can be used to produce the band format for CGFv3.</li>
<li>For the mitochondrial DNA tile path, 0x35e, checking that the hashes of the sequence produced by concatenating the FastJ are the same as what's produced from the CGFv3 should be sufficient.</li>
</ul>
<p>The conversion from CGFv3 to sequence can be done via:</p>
<ul>
<li><code>cgft</code> to band format</li>
<li>extend the <code>fjt</code> tool (or extend/make a tool) to take in band format (and an SGLF file) and output FastJ (or CSV)</li>
<li>concatenate FastJ (or CSV) to sequence</li>
</ul>
<p>This process is slow but since tile path 0x35e is so small, this should be quick enough to do.</p> Lightning - Idea #11671: Convert 650+ CGF files to new CGFv3https://dev.arvados.org/issues/11671?journal_id=517392017-05-14T06:48:04ZAbram Connellyabram.connelly@gmail.com
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>In Progress</i></li></ul> Lightning - Idea #11671: Convert 650+ CGF files to new CGFv3https://dev.arvados.org/issues/11671?journal_id=517582017-05-16T10:19:07ZAbram Connellyabram.connelly@gmail.com
<ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Closed</i></li></ul><p>720 CGFv3 files have been converted/created. They've been <a href="https://workbench.su92l.arvadosapi.com/collections/su92l-4zz18-6dkjzstlu2bc8o6" class="external">uploaded to the cgfv3 collection under the l7g Data project</a>.</p>
<p>I've checked the mitochondrial sequences to make sure they match. The script was run on lightning-dev1, so the context makes it hard to re-run elseewhere, but it's provided here to give an idea of what's involved:</p>
<pre><code class="bash syntaxhl"><span class="c">#!/bin/bash</span>
<span class="nv">sglfgz</span><span class="o">=</span><span class="s2">"/data-sdd/data/sglf/035e.sglf.gz"</span>
<span class="nv">cgfdir</span><span class="o">=</span><span class="s2">"stage.cgfv3"</span>
<span class="k">for </span>fjgz <span class="k">in</span> <span class="sb">`</span>find ./stage ./stage.okg <span class="nt">-name</span> 035e.fj.gz<span class="sb">`</span> <span class="p">;</span> <span class="k">do
</span><span class="nv">name</span><span class="o">=</span><span class="sb">`</span><span class="nb">basename</span> <span class="si">$(</span> <span class="nb">dirname</span> <span class="nv">$fjgz</span> <span class="si">)</span><span class="sb">`</span>
<span class="nb">echo</span> <span class="nv">$name</span>
<span class="nv">cgfv3</span><span class="o">=</span><span class="s2">"</span><span class="nv">$cgfdir</span><span class="s2">/</span><span class="nv">$name</span><span class="s2">.cgfv3"</span>
<span class="nv">a0</span><span class="o">=</span><span class="sb">`</span>cgft <span class="nt">-b</span> 862 <span class="nv">$cgfv3</span> | fjt <span class="nt">-b</span> <span class="nt">-L</span> <<span class="o">(</span> zcat <span class="nv">$sglfgz</span> <span class="o">)</span> | fjt <span class="nt">-c</span> 0 | <span class="nb">tr</span> <span class="nt">-d</span> <span class="s1">'\n'</span> | <span class="nb">md5sum</span> | <span class="nb">cut</span> <span class="nt">-f1</span> <span class="nt">-d</span><span class="s1">' '</span><span class="sb">`</span>
<span class="nv">b0</span><span class="o">=</span><span class="sb">`</span>fjt <span class="nt">-c</span> 0 <<span class="o">(</span> zcat <span class="nv">$fjgz</span> <span class="o">)</span> | <span class="nb">tr</span> <span class="nt">-d</span> <span class="s1">'\n'</span> | <span class="nb">md5sum</span> | <span class="nb">cut</span> <span class="nt">-f1</span> <span class="nt">-d</span><span class="s1">' '</span><span class="sb">`</span>
<span class="nv">a1</span><span class="o">=</span><span class="sb">`</span>cgft <span class="nt">-b</span> 862 <span class="nv">$cgfv3</span> | fjt <span class="nt">-b</span> <span class="nt">-L</span> <<span class="o">(</span> zcat <span class="nv">$sglfgz</span> <span class="o">)</span> | fjt <span class="nt">-c</span> 1 | <span class="nb">tr</span> <span class="nt">-d</span> <span class="s1">'\n'</span> | <span class="nb">md5sum</span> | <span class="nb">cut</span> <span class="nt">-f1</span> <span class="nt">-d</span><span class="s1">' '</span><span class="sb">`</span>
<span class="nv">b1</span><span class="o">=</span><span class="sb">`</span>fjt <span class="nt">-c</span> 1 <<span class="o">(</span> zcat <span class="nv">$fjgz</span> <span class="o">)</span> | <span class="nb">tr</span> <span class="nt">-d</span> <span class="s1">'\n'</span> | <span class="nb">md5sum</span> | <span class="nb">cut</span> <span class="nt">-f1</span> <span class="nt">-d</span><span class="s1">' '</span><span class="sb">`</span>
<span class="k">if</span> <span class="o">[[</span> <span class="s2">"</span><span class="nv">$a0</span><span class="s2">"</span> <span class="o">!=</span> <span class="s2">"</span><span class="nv">$b0</span><span class="s2">"</span> <span class="o">]]</span> <span class="o">||</span> <span class="o">[[</span> <span class="s2">"</span><span class="nv">$a1</span><span class="s2">"</span> <span class="o">!=</span> <span class="s2">"</span><span class="nv">$b1</span><span class="s2">"</span> <span class="o">]]</span> <span class="p">;</span> <span class="k">then
</span><span class="nb">echo</span> <span class="s2">"ERROR: </span><span class="nv">$cgfv3</span><span class="s2"> mismatch between mt sequences"</span>
<span class="k">else
</span><span class="nb">echo</span> <span class="s2">" ok"</span>
<span class="k">fi
done</span>
</code></pre>
<p>A new <a href="https://workbench.su92l.arvadosapi.com/collections/su92l-4zz18-fkbdz2w6b25ayj3" class="external">sglf collection</a> was also created with the new 0x35e sglf tile path library. This was needed for the FastJ conversion.</p>
<p>I'm considering this issue closed. If further checks are needed, we can open another ticket to take care of them.</p>