Project

General

Profile

Wiki » History » Version 6

Abram Connelly, 07/11/2017 02:55 PM

1 1 Nancy Ouyang
h1. Lightning
2
3
Lightning is software designed to enable fast queries and machine learning on genomic data. The genomic data we currently focus on are human whole genomes that are aligned and are called by external software, then imported into Lightning. From there, Lighting:
4
5
* Stores quality information
6
* Stores phased and unphased genomes
7
* Allows fast retrieval of called sequences from regions of interest
8
* Defines flexible queries:
9
* Filters by subsets of the population
10
* And/or by specific regions of interest
11
* Normalizes standard called genome files (such as VCF and gVCF), such that each variant is expressed in the same way
12
* Incorporates new data fast and painlessly
13
* Stores annotations from ClinVar and annotation pipelines, such as CAVA
14
15
Lightning is made possible by the process of tiling, which takes advantage of the high degree of redundancy in a population of genomes. Tiling partitions genomes into tiles: overlapping, variable-length sequences that begin and end with unique k-mers, termed tags. Once a genome has been tiled, the sequences for each tile are stored in a tile library. These sequences may be annotated by using Annotile. The tiled genomes are stored in Compact Genome Format (CGF) files. Genomes stored as CGF files are loaded into Lantern, which is our in-memory database designed to respond to queries quickly. Finally, Sprite is a web browser application for interacting with Lightning.
16
17
h2. Overview of methodology
18
19
Stated another way, Lightning’s basic method is to consider short snippets of genomic sequences as the basic building block of genomes. These short snippets are of variable length, but are mostly in the range of 250 base pairs long. Splitting genomes into short segments allows for savings by only storing a single copy of redundant sequences.
20
21
Each genome is partitioned into these these short read segments. From all tiles in a population, a tile library can be constructed. Tiles are chosen to have 24mer tags on either end that overlap with neighboring tiles. Tags are chosen with with some uniqueness constraint on them and provide convenient anchor points to differentiate tiles from one another.
22
23
Currently, all tags are chosen to be at least 2 edit distance away from each other. The tag set is fixed and acts as anchor points to partition future sample genomic sequences wishing to be analyzed.
24
25
The hope is that tiles, along with information on the population used to generate them, can also be used to aid in read placement.
26
27
Because most genomic sequences are redundant, duplicate tiles need not be stored in a population of genomic sequences. At each tile position, multiple tile variants are stored representing the variation in a population for that tile. Given a partitioned genomic sequence and a tile library, a compact representation of a genome can be constructed by storing the variant numbers contiguously.
28 2 Nancy Ouyang
29
h2. Motivation
30 3 Nancy Ouyang
31 2 Nancy Ouyang
We developed Lightning in response to the difficulty and time-consuming nature of merging VCFs, querying subsets of a population, finding poorly sequenced regions, and similar issues. After using various ad-hoc solutions, we eventually stepped back and committed time and effort to developing a more sensible and sustainable solution. We hope it will be useful to the broader research community and welcome your feedback.
32
33 6 Abram Connelly
34 5 Nancy Ouyang
h2. Corresponding Author
35 4 Nancy Ouyang
36 5 Nancy Ouyang
awz@curoverse.com
37 6 Abram Connelly
38
---
39
40
h2. 3rd Party Libraries Used by Lightning
41
42
43
h3. github.com/curoverse/l7g
44
45
46
* "minimal (css) by orderedlist":https://github.com/orderedlist/minimal CC-BY-SA
47
  - note: gives credit and license in `login page`
48
* "Login by Marco Biedermann":https://codepen.io/marcobiedermann/pen/Fybpf GPL (v?)
49
  - note: has gpl in `login-license` file
50
* "SO question 2144386 by Oscar":http://stackoverflow.com/questions/2144386/javascript-delete-cookie  CC-BY-SA (as per SO's policy)
51
  - Code snippet with proper attribution
52
  - "Stack Overflow's Legal Policy":https://stackexchange.com/legal
53
54
55
56
h3. github.com/curoverse/glfd
57
58
59
* "Dan Eden Animate CSS":http://daneden.me/animate) MIT
60
  - note: in `html/index.html`, accreditation referenced in file
61
62
h3. github.com/curoverse/cgf
63
64
65
* "DukTape":http://duktape.org/ MIT
66
  - note: available in `cpp/muduk/lib`.  Duktape source has license and attribution
67
* "Dan Eden Animate CSS":http://daneden.me/animate MIT
68
  - note: in `cpp/muduk/html/index.html`, accreditation referenced in file
69
70
h3. github.com/curoverse/lightning
71
72
See repository for details
73
74
75
h3. github.com/curoverse/l7g-p7e-untap
76
77
78
* "Dan Eden Animate CSS":http://daneden.me/animate MIT
79
  - note: in @html/index.html@, accreditation referenced in file
80
81
82
h3. github.com/curoverse/l7g-v5t-clinvar
83
84
* "Dan Eden Animate CSS":http://daneden.me/animate MIT
85
  - note: in @html/index.html@, accreditation referenced in file
86
87
88
h3. github.com/curoverse/lci
89
90
91
* "jquery":https://jquery.org MIT
92
  - note: using `jquery.min.js` in `html/js` directory
93
* "bootstrap":http://getbootstrap.com MIT
94
  - note: using @bootstrap.min.js@ in @html/js@ directory
95
* bootstrap css & html MIT
96
  - note: using "dashboard.css":https://getbootstrap.com/examples/dashboard/
97
* "Dan Eden Animate CSS":http://daneden.me/animate MIT
98
  - note: in @html/index.html@, accreditation referenced in file