Project

General

Profile

Future sprints (Lightning)

over 4 years late (10/13/2020) (Sprint start date 09/29/2020)

90%

29 issues   (26 closed3 open)

Future sprints (Lightning)

Lightning

Lightning is software designed to enable fast queries and machine learning on genomic data. The genomic data we currently focus on are human whole genomes that are aligned and are called by external software, then imported into Lightning. From there, Lighting:

  • Stores quality information
  • Stores phased and unphased genomes
  • Allows fast retrieval of called sequences from regions of interest
  • Defines flexible queries:
  • Filters by subsets of the population
  • And/or by specific regions of interest
  • Normalizes standard called genome files (such as VCF and gVCF), such that each variant is expressed in the same way
  • Incorporates new data fast and painlessly
  • Stores annotations from ClinVar and annotation pipelines, such as CAVA

Lightning is made possible by the process of tiling, which takes advantage of the high degree of redundancy in a population of genomes. Tiling partitions genomes into tiles: overlapping, variable-length sequences that begin and end with unique k-mers, termed tags. Once a genome has been tiled, the sequences for each tile are stored in a tile library. These sequences may be annotated by using Annotile. The tiled genomes are stored in Compact Genome Format (CGF) files. Genomes stored as CGF files are loaded into Lantern, which is our in-memory database designed to respond to queries quickly. Finally, Sprite is a web browser application for interacting with Lightning.

Overview of methodology

Stated another way, Lightning’s basic method is to consider short snippets of genomic sequences as the basic building block of genomes. These short snippets are of variable length, but are mostly in the range of 250 base pairs long. Splitting genomes into short segments allows for savings by only storing a single copy of redundant sequences.

Each genome is partitioned into these these short read segments. From all tiles in a population, a tile library can be constructed. Tiles are chosen to have 24mer tags on either end that overlap with neighboring tiles. Tags are chosen with with some uniqueness constraint on them and provide convenient anchor points to differentiate tiles from one another.

Currently, all tags are chosen to be at least 2 edit distance away from each other. The tag set is fixed and acts as anchor points to partition future sample genomic sequences wishing to be analyzed.

The hope is that tiles, along with information on the population used to generate them, can also be used to aid in read placement.

Because most genomic sequences are redundant, duplicate tiles need not be stored in a population of genomic sequences. At each tile position, multiple tile variants are stored representing the variation in a population for that tile. Given a partitioned genomic sequence and a tile library, a compact representation of a genome can be constructed by storing the variant numbers contiguously.

Motivation

We developed Lightning in response to the difficulty and time-consuming nature of merging VCFs, querying subsets of a population, finding poorly sequenced regions, and similar issues. After using various ad-hoc solutions, we eventually stepped back and committed time and effort to developing a more sensible and sustainable solution. We hope it will be useful to the broader research community and welcome your feedback.

Corresponding Author

Time tracking
Estimated time 8.00 hours
Issues by
Bug

2/2

Task

4/4

Idea

17/17

Related issues
# Subject Story Points
Idea #5056 Lightning deployment plan / Uploading data specifications 0.5
Idea #7488 gVCF/VCF Tyler Specification 1.0
Idea #5002 Implement Annotation database in Sprite
Idea #5983 align2gvcf test suite 2.0
Idea #7441 [API] Implement GET /callsets/{callset-name}/vcf API call
Idea #5470 [API] Implement "GET /assemblies" API Calls
Idea #3538 Ensure GFF to FASTJ in Lightning git repo
Idea #3536 FASTQ to FASTJ Implementation
Bug #5302 CGF path vectors have no-calls on end 2.0
Bug #5903 [gff2fj] Does not correctly label no-calls at end of cytogenetic band
Idea #5904 [gff2fj] Add in error checks to check common errors and do sanity checks on resulting FastJ files
Idea #6653 Implement ProbabilisticPCA on Tiled Genomes 2.5
Idea #5336 Specify/Document how to use Lightning for family analysis
Idea #5526 Specify lifting-over assemblies that change the tags 3.0
Idea #7440 Build support for lifting-over assemblies that change the tags
Idea #7079 [Tyler] Implementation 3.0
Idea #4070 [Sprite] Polyphen data addition to annotations
Idea #13376 Test the effect of phasing imputation workflow
Idea #13216 Write phasing imputation workflow
14.0