Project

General

Profile

Actions

Cgb » History » Revision 2

« Previous | Revision 2/3 (diff) | Next »
Abram Connelly, 11/15/2016 04:42 PM


cgb

cgb is a tool to help with access to the binary compact genome format (CGF). The tool is still in the prototyping stage.

Code for cgb can be found on github.com/abeconnelly/cgf.

Quick start

$ git clone https://github.com/abeconnelly/cgf
$ cd cgf/cpp
$ ./cmp.sh
$ ./cgb -i data/hu826751-GS03052-DNA_B01.cgf -s 0 -B -k -p 862
[ 79 8 0 0 0 0 0 -1 0 0 0 389 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 -1 34 -1 185 1]
[ 79 2 0 0 0 0 0 -1 0 0 0 390 0 0 0 0 0 1 0 0 0 0 0 0 26 0 0 1 0 0 -1 34 -1 185 1]
[[ ][ ][ ][ ][ ][ ][ 903 1 ][ ][ 16 1 ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ 96 1 ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ 291 2 ]]
[[ ][ ][ ][ ][ ][ ][ 903 1 ][ ][ 16 1 ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ 96 1 ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ 291 2 ]]

Brief overview

cgb is meant to help debug and inspect CGF files. The two main features are to report the contents of a CGF in terms of tile variants and low quality information as well as to do some basic tile concordance operations. The code that cgb uses is shared by the Lightning CGF server and is in part meant to test functionality used there.

Concordance

The CGF has different 'tiers' of information, from a bit vector representing whether the tile is canonical, to a cache holding the first 8 tile variants to the overflow tables if the cache is exceeded. To test and for rough estimates, different 'levels' of concordance are used with cgb.

  • Level 0 - compare canonical tiles only
  • Level 1 - compare canonical tiles and cache
  • Level 2 - a full tile concordance
example
$ ./cgb -i data/hu826751-GS03052-DNA_B01.cgf -i data/hg19.cgf -l 0
level: 0, canonical match: 6491163
$ ./cgb -i data/hu826751-GS03052-DNA_B01.cgf -i data/hg19.cgf -l 1
level: 1, canonical+cache match: 6519788, loq: 148760
$ ./cgb -i data/hu826751-GS03052-DNA_B01.cgf -i data/hg19.cgf -l 2
#match_tot: 6610685

CGF Inspection

JSON Tile Path information

Get tile path 862 (0x35e, which is chrM) starting at tile 0 including low quality information and print in JSON format.

$ ./cgb -i data/hu826751-GS03052-DNA_B01.cgf -p 862 -s 0 -B
{
  "035e":{
    "tilepath":862,
    "start_tilestep":0,
    "allele":[
      [ 79, 8, 0, 0, 0, 0, 0, -1, 0, 0, 0, 389, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, -1, 34, 
      -1, 185, 1 ],
      [ 79, 2, 0, 0, 0, 0, 0, -1, 0, 0, 0, 390, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 26, 0, 0, 1, 0, 0, -1, 34, 
      -1, 185, 1 ]
    ],
    "loq_info":[
      [ [ ], [ ], [ ], [ ], [ ], [ ], [ 903, 1 ], [ ], [ 16, 1 ], [ ], [ ], [ ], [ ], [ ], [ ], [ ], [ ], [ ], [ ], [ ], [ ], [ 96, 1 ], [ ], [ ], [ ], [ ], [ ], [ ], [ ], [ ], [ ], [ ], 
        [ ], [ ], [ 291, 2 ] ],
      [ [ ], [ ], [ ], [ ], [ ], [ ], [ 903, 1 ], [ ], [ 16, 1 ], [ ], [ ], [ ], [ ], [ ], [ ], [ ], [ ], [ ], [ ], [ ], [ ], [ 96, 1 ], [ ], [ ], [ ], [ ], [ ], [ ], [ ], [ ], [ ], [ ], 
        [ ], [ ], [ 291, 2 ] ]
    ]
  }
}
Tile Path Compact Representation

Get tile path 862 (0x35e, which is chrM) starting at tile 0 including low quality information

$ ./cgb -i data/hu826751-GS03052-DNA_B01.cgf -p 862 -s 0 -B -L -k
[ 79 8 0 0 0 0 0 -1 0 0 0 389 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 -1 34 -1 185 1]
[ 79 2 0 0 0 0 0 -1 0 0 0 390 0 0 0 0 0 1 0 0 0 0 0 0 26 0 0 1 0 0 -1 34 -1 185 1]
[[ ][ ][ ][ ][ ][ ][ 903 1 ][ ][ 16 1 ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ 96 1 ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ 291 2 ]]
[[ ][ ][ ][ ][ ][ ][ 903 1 ][ ][ 16 1 ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ 96 1 ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ 291 2 ]]
Inspect Binary File (Debug)

Get a debugging printout of the information in the CGF file

$ ./cgb -i data/hu826751-GS03052-DNA_B01.cgf -D -i data/hg19.cgf
Magic: "cgf.b"{ (7b22622e66676322)
CGFVersion: 0.1.0
LibVersion: 0.1.0
PathCount: 863
TileMapLength: 7044
TileMap:
   [[0+1],[0+1]], [[0+1],[1+1]], [[1+1],[0+1]], [[1+1],[1+1]], [[0+1],[2+1]], [[2+1],[0+1]], [[0+1,0+1],[1+2]], [[1+2],[0+1,0+1]], [[0+2],[0+2]], [[0+1],[3+1]], [[3+1],[0+1]], [[1+1,0+1],[0+2]], [[0+2],[1+1,0+1]], [[0+1],[4+1]], [[4+1],[0+1]], [[1+2],[1+2]], [[2+1],[2+1]], [[1+1],[3+1]], [[3+1],[1+1]], [[1+1],[2+1]], [[2+1],[1+1]], [[0+1],[5+1]], [[5+1],[0+1]], [[0+1],[6+1]], [[6+1],[0+1]], [[0+1,0+1],[2+2]], [[2+2],[0+1,0+1]], [[0+1,0+1],[3+2]], [[3+2],[0+1,0+1]], [[3+1],[3+1]], [[0+1],[7+1]], [[7+1],[0+1]],
...

  035e.Loq.LoqFlagByteCount: 5
  035e.Loq.LoqFlag[5]:
     40 01 20 00 04

  035e.Loq.LoqInfoByteCount: 18
  035e.Loq.LoqInfo[18]:
     01 02 83 87 01 01 02 10 01 01 02 60 01 01 02 81 23 02

Updated by Abram Connelly about 8 years ago · 3 revisions