A Standard Database Architecture Will Build a Stronger Foundation for Genome Discoveries

big data genome sequencing hannes smarason

The general adoption of the Genomically-Ordered Relational database (GOR) as a data standard for storing genomic data may greatly accelerate the spread of sequencing and its effectiveness as a tool for advancing medicine.

It is widely accepted that the ability to share the analysis and insights from DNA sequencing will be a key driver of discovery and innovation. But one current limitation to extending this knowledge is that sequencing and analysis platforms, as well as samples, are often proprietary to and stored at different institutions. Perhaps more important, the structures and formats in which genomic data has customarily been stored—the relational databases developed by the likes of IBM and Oracle—make it unwieldy to analyze as the amount of data grows, and very difficult to share. The upshot is that institutions cannot easily share and consolidate information to generate more robust analyses and clinically relevant insights. This presents a serious hurdle to discovery both in rare disorders, where samples need to be gathered in order to generated adequate analytical power, and in complex ones, where truly massive studies can tease apart different facets of disease and reveal their causes.

Over the past decade, a novel and comprehensive database model has been developed to solve this bottleneck, offering a flexible and fast means to overcome these problems. It is called the Genomically-Ordered Relational database, or GOR, and was designed to manage and query the detailed genomic data amassed by deCODE genetics in Iceland – the world’s first and still by far largest and most comprehensive population-based genomic database.

The thinking behind the GOR is as simple as it is revolutionary. Genomic data is a sort of big data but one with an important difference: It is divided up in distinct packets—the chromosomes—and then arranged within each chromosome in linear fashion. The GOR makes use of this by storing and querying sequence data according to its unique position in the genome, rather than as huge files as long as the sequence. This radically reduces the data burden of querying even large numbers of whole genomes, at the same time making it possible to store and visualize instantly the raw sequence underlying an analysis.

In practice, the GOR thereby enables researchers to home in on specific variants without having first to call up entire patient genomes, and separates raw data from annotations to focus in on only the most relevant search components. It’s these types of functions and features that can be consistently applied across data storing systems to allow for more multi-institutional, collaborative research and consistency in outcomes worldwide.

Leaders in the genomic research community are now beginning to create coalitions and working groups to underpin and coordinate the adoption of standards for sharing genomic data. As these groups create flexible and efficient policy frameworks, the GOR is tested and ready to support the fundamental data requirements of global data sharing and the acceleration of discoveries in genome-based medicine. The general adoption of the GOR as a data standard for storing genomic data may greatly accelerate the spread of sequencing and its effectiveness as a tool for advancing medicine around the world.