Project

General

Profile

Cgb » History » Version 1

Abram Connelly, 11/15/2016 04:41 PM

1 1 Abram Connelly
h1. cgb
2
3
@cgb@ is a tool to help with access to the binary compact genome format (CGF).  The tool is still in the prototyping stage.
4
5
Code for @cgb@ can be found on [github.com/abeconnelly/cgf](https://github.com/abeconnelly/cgf).
6
7
h2. Quick start
8
9
<pre>
10
$ git clone https://github.com/abeconnelly/cgf
11
$ cd cgf/cpp
12
$ ./cmp.sh
13
$ ./cgb -i data/hu826751-GS03052-DNA_B01.cgf -s 0 -B -k -p 862
14
[ 79 8 0 0 0 0 0 -1 0 0 0 389 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 -1 34 -1 185 1]
15
[ 79 2 0 0 0 0 0 -1 0 0 0 390 0 0 0 0 0 1 0 0 0 0 0 0 26 0 0 1 0 0 -1 34 -1 185 1]
16
[[ ][ ][ ][ ][ ][ ][ 903 1 ][ ][ 16 1 ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ 96 1 ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ 291 2 ]]
17
[[ ][ ][ ][ ][ ][ ][ 903 1 ][ ][ 16 1 ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ 96 1 ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ 291 2 ]]
18
</pre>
19
20
h2. Brief overview
21
22
@cgb@ is meant to help debug and inspect CGF files.  The two main features are to report the contents of a CGF in terms of tile variants and low quality information as well as to do some basic tile concordance operations.  The code that @cgb@ uses is shared by the Lightning CGF server and is in part meant to test functionality used there.
23
24
h3. Concordance
25
26
The CGF has different 'tiers' of information, from a bit vector representing whether the tile is canonical, to a cache holding the first 8 tile variants to the overflow tables if the cache is exceeded.  To test and for rough estimates, different 'levels' of concordance are used with @cgb@.
27
28
* Level 0 - compare canonical tiles only
29
* Level 1 - compare canonical tiles and cache
30
* Level 2 - a full tile concordance
31
32
h5. example
33
34
<pre>
35
$ ./cgb -i data/hu826751-GS03052-DNA_B01.cgf -i data/hg19.cgf -l 0
36
level: 0, canonical match: 6491163
37
$ ./cgb -i data/hu826751-GS03052-DNA_B01.cgf -i data/hg19.cgf -l 1
38
level: 1, canonical+cache match: 6519788, loq: 148760
39
$ ./cgb -i data/hu826751-GS03052-DNA_B01.cgf -i data/hg19.cgf -l 2
40
#match_tot: 6610685
41
</pre>
42
43
h3. CGF Inspection
44
45
h5. JSON Tile Path information
46
47
Get tile path 862 (0x35e, which is chrM) starting at tile 0 including low quality information and print in JSON format.
48
49
<pre>
50
$ ./cgb -i data/hu826751-GS03052-DNA_B01.cgf -p 862 -s 0 -B
51
{
52
  "035e":{
53
    "tilepath":862,
54
    "start_tilestep":0,
55
    "allele":[
56
      [ 79, 8, 0, 0, 0, 0, 0, -1, 0, 0, 0, 389, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, -1, 34, 
57
      -1, 185, 1 ],
58
      [ 79, 2, 0, 0, 0, 0, 0, -1, 0, 0, 0, 390, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 26, 0, 0, 1, 0, 0, -1, 34, 
59
      -1, 185, 1 ]
60
    ],
61
    "loq_info":[
62
      [ [ ], [ ], [ ], [ ], [ ], [ ], [ 903, 1 ], [ ], [ 16, 1 ], [ ], [ ], [ ], [ ], [ ], [ ], [ ], [ ], [ ], [ ], [ ], [ ], [ 96, 1 ], [ ], [ ], [ ], [ ], [ ], [ ], [ ], [ ], [ ], [ ], 
63
        [ ], [ ], [ 291, 2 ] ],
64
      [ [ ], [ ], [ ], [ ], [ ], [ ], [ 903, 1 ], [ ], [ 16, 1 ], [ ], [ ], [ ], [ ], [ ], [ ], [ ], [ ], [ ], [ ], [ ], [ ], [ 96, 1 ], [ ], [ ], [ ], [ ], [ ], [ ], [ ], [ ], [ ], [ ], 
65
        [ ], [ ], [ 291, 2 ] ]
66
    ]
67
  }
68
}
69
</pre>
70
71
h5. Tile Path Compact Representation
72
73
Get tile path 862 (0x35e, which is chrM) starting at tile 0 including low quality information
74
75
<pre>
76
$ ./cgb -i data/hu826751-GS03052-DNA_B01.cgf -p 862 -s 0 -B -L -k
77
[ 79 8 0 0 0 0 0 -1 0 0 0 389 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 -1 34 -1 185 1]
78
[ 79 2 0 0 0 0 0 -1 0 0 0 390 0 0 0 0 0 1 0 0 0 0 0 0 26 0 0 1 0 0 -1 34 -1 185 1]
79
[[ ][ ][ ][ ][ ][ ][ 903 1 ][ ][ 16 1 ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ 96 1 ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ 291 2 ]]
80
[[ ][ ][ ][ ][ ][ ][ 903 1 ][ ][ 16 1 ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ 96 1 ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ 291 2 ]]
81
</pre>
82
83
h5. Inspect Binary File (Debug)
84
85
Get a debugging printout of the information in the CGF file
86
87
<pre>
88
$ ./cgb -i data/hu826751-GS03052-DNA_B01.cgf -D -i data/hg19.cgf
89
Magic: "cgf.b"{ (7b22622e66676322)
90
CGFVersion: 0.1.0
91
LibVersion: 0.1.0
92
PathCount: 863
93
TileMapLength: 7044
94
TileMap:
95
   [[0+1],[0+1]], [[0+1],[1+1]], [[1+1],[0+1]], [[1+1],[1+1]], [[0+1],[2+1]], [[2+1],[0+1]], [[0+1,0+1],[1+2]], [[1+2],[0+1,0+1]], [[0+2],[0+2]], [[0+1],[3+1]], [[3+1],[0+1]], [[1+1,0+1],[0+2]], [[0+2],[1+1,0+1]], [[0+1],[4+1]], [[4+1],[0+1]], [[1+2],[1+2]], [[2+1],[2+1]], [[1+1],[3+1]], [[3+1],[1+1]], [[1+1],[2+1]], [[2+1],[1+1]], [[0+1],[5+1]], [[5+1],[0+1]], [[0+1],[6+1]], [[6+1],[0+1]], [[0+1,0+1],[2+2]], [[2+2],[0+1,0+1]], [[0+1,0+1],[3+2]], [[3+2],[0+1,0+1]], [[3+1],[3+1]], [[0+1],[7+1]], [[7+1],[0+1]],
96
...
97
98
  035e.Loq.LoqFlagByteCount: 5
99
  035e.Loq.LoqFlag[5]:
100
     40 01 20 00 04
101
102
  035e.Loq.LoqInfoByteCount: 18
103
  035e.Loq.LoqInfo[18]:
104
     01 02 83 87 01 01 02 10 01 01 02 60 01 01 02 81 23 02
105
106
</pre>