Wiki » History » Version 1

Nancy Ouyang, 10/13/2015 12:09 PM

1 1 Nancy Ouyang
h1. Lightning
2 1 Nancy Ouyang
3 1 Nancy Ouyang
Lightning is software designed to enable fast queries and machine learning on genomic data. The genomic data we currently focus on are human whole genomes that are aligned and are called by external software, then imported into Lightning. From there, Lighting:
4 1 Nancy Ouyang
5 1 Nancy Ouyang
* Stores quality information
6 1 Nancy Ouyang
* Stores phased and unphased genomes
7 1 Nancy Ouyang
* Allows fast retrieval of called sequences from regions of interest
8 1 Nancy Ouyang
* Defines flexible queries:
9 1 Nancy Ouyang
* Filters by subsets of the population
10 1 Nancy Ouyang
* And/or by specific regions of interest
11 1 Nancy Ouyang
* Normalizes standard called genome files (such as VCF and gVCF), such that each variant is expressed in the same way
12 1 Nancy Ouyang
* Incorporates new data fast and painlessly
13 1 Nancy Ouyang
* Stores annotations from ClinVar and annotation pipelines, such as CAVA
14 1 Nancy Ouyang
15 1 Nancy Ouyang
Lightning is made possible by the process of tiling, which takes advantage of the high degree of redundancy in a population of genomes. Tiling partitions genomes into tiles: overlapping, variable-length sequences that begin and end with unique k-mers, termed tags. Once a genome has been tiled, the sequences for each tile are stored in a tile library. These sequences may be annotated by using Annotile. The tiled genomes are stored in Compact Genome Format (CGF) files. Genomes stored as CGF files are loaded into Lantern, which is our in-memory database designed to respond to queries quickly. Finally, Sprite is a web browser application for interacting with Lightning.
16 1 Nancy Ouyang
17 1 Nancy Ouyang
h2. Overview of methodology
18 1 Nancy Ouyang
19 1 Nancy Ouyang
Stated another way, Lightning’s basic method is to consider short snippets of genomic sequences as the basic building block of genomes. These short snippets are of variable length, but are mostly in the range of 250 base pairs long. Splitting genomes into short segments allows for savings by only storing a single copy of redundant sequences.
20 1 Nancy Ouyang
21 1 Nancy Ouyang
Each genome is partitioned into these these short read segments. From all tiles in a population, a tile library can be constructed. Tiles are chosen to have 24mer tags on either end that overlap with neighboring tiles. Tags are chosen with with some uniqueness constraint on them and provide convenient anchor points to differentiate tiles from one another.
22 1 Nancy Ouyang
23 1 Nancy Ouyang
Currently, all tags are chosen to be at least 2 edit distance away from each other. The tag set is fixed and acts as anchor points to partition future sample genomic sequences wishing to be analyzed.
24 1 Nancy Ouyang
25 1 Nancy Ouyang
The hope is that tiles, along with information on the population used to generate them, can also be used to aid in read placement.
26 1 Nancy Ouyang
27 1 Nancy Ouyang
Because most genomic sequences are redundant, duplicate tiles need not be stored in a population of genomic sequences. At each tile position, multiple tile variants are stored representing the variation in a population for that tile. Given a partitioned genomic sequence and a tile library, a compact representation of a genome can be constructed by storing the variant numbers contiguously.