As Cancer Databases Grow, A Global Platform Leaps the Big Data Hurdle

cancer databases

As massive cancer databases like The Cancer Genome Atlas (TCGA) proliferate and expand worldwide, WuXi NextCODE expects to see—and to drive—a boom in discoveries of cancer biomarkers that will advance our ability to treat cancer and improve outcomes for patients.

One of the fastest-growing areas in medicine today is the creation of massive cancer databases. Their aim is to provide the scale of data required to unravel the complexity and heterogeneity of cancer—the key to getting patients more precise diagnoses faster, and to getting them the best treatments for their particular disease.

In short, this data has the potential to save lives.

Such databases are not new, but they are now proliferating and expanding at an unprecedented pace. Driven by governments, hospitals, and pharmaceutical companies, they catalogue a growing range of genetic data and biomarkers together with clinical information about their effects on disease, therapy, and outcomes.

Only with such data can we answer the key questions: Does a certain marker suggest that a cancer will be especially aggressive? Does it signal that the tumor responds best to particular treatments? Are there new pathways involved in particular cancers that we can target to develop new drugs?

It’s the cutting edge of oncology, but to be powered to answer these questions, these databases have to be very, very big. They have to bring together whole-genome sequence data on patients and their tumors as well as a host of other ‘omics and biological data. One of the biggest challenges to realizing this potential is to manage and analyze datasets of that scale around the world. It’s one we are addressing in a unique manner through our global platform.

One of the most renowned and widely used of these is The Cancer Genome Atlas, a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI). TCGA data is freely available to those who qualify, and there is a lot of it. It already comprises 2.5 petabytes of data describing tumor tissue and matched normal tissues for 33 tumor types from more than 11,000 patients. Researchers all over the world can apply to use this data for their own studies, and many have.

Yet asking questions of TCGA alone can take months for most groups and requires sophisticated tools. At Boston’s recent Bio-IT World conference, WuXi NextCODE’s director of tumor product development, Jim Lund, explained how we have put TCGA on our global platform—providing a turnkey solution with integrated analytics to transform the data into valuable findings.

Jim and his team have imported into WuXi NextCODE’s cloud platform virtually all key TCGA data: raw whole exome sequence data from patients and tumors, as well as variant calls using MuTect2 and Varscan2; RNA and microRNA sequence and expression data; and data on copy number variation, methylation arrays, and some 150 different clinical attributes. But this data isn’t just hosted in the cloud: it can all now be queried directly and at high speed online, enabling researchers to quickly ask and answer highly complex questions without having to download any data or provide their own bioinformatics software.

To demonstrate the power of this approach, Jim’s team decided to run the same queries in a recent published study that looked at sequence data from the exons of 173 genes in 2,433 primary breast tumors (Pereira et al., Nature 2016). They were specifically looking for driver mutations of cancer’s spread and growth. In a matter of minutes, rather than months, they were able to replicate key mutations identified in the study. That analysis was then extended to all cancer genes, and additional driver genes were found. More important, because they were able to correlate these mutations with clinical outcomes data, they were also able to begin systematically matching specific mutation patterns to patient outcomes.

Next, Jim’s team looked at the genomics of lung adenocarcinoma, the leading cause of death from cancer worldwide. Following up on the findings in another published study (Collison et al., Nature 2014), they profiled the 230 samples examined in the paper and immediately made several observations. Eighteen genes were mutated in a significant number of samples; EGFR mutations (which are well known) were more common in samples from women; and RBM10 mutations were more common in samples from men. These results were extended to 613 samples and shown to be robust. But because they had a wide range of data including mRNA, microRNA, DNA sequencing, and methylation, Jim’s team was further able to suggest some actual biological processes that may be fueling the origin and growth of lung adenocarcinomas.

What’s making this type of research possible? It’s our global platform for genomic data. The platform spans everything required to make the genome useful for helping patients around the world, from CLIA/CAP sequencing to the world’s most widely used system for organizing, mining, and sharing large genomic datasets. At its heart is our database—the Genomically Ordered Relational database (GORdb). Because it references sequence data according to its position on the genome, it makes queries of tens of thousands of samples computationally efficient, enabling the fast, online mining of vast datasets stored in multiple locations.

That’s how we are making the TCGA—and every major reference dataset in the world—available and directly minable by any researcher using our platform. Those users can combine all that data with their own to conduct original research at massive scale.

These breast and lung cancer studies are just two of more than a thousand that have been carried out so far on TCGA data. As more such datasets become available, we expect to see—and to drive—a boom in discoveries of cancer markers that will advance our ability to treat cancer and improve outcomes for patients. For those who want to go further still, our proprietary DeepCODE AI tools offer a means of layering in even more datasets to drive insights even deeper into the biology of cancer and other diseases. And that’s a topic I’ll return to in the weeks ahead.

email

Global Projects Move Genomic Medicine to the Next Level

nextcode-genomics-england-hannes-smarason

NextCODE takes top marks in Genomics England analysis and interpretation “bake-off:” NextCODE’s proven population-scale platform delivered the best results in rare disease and cancer clinical interpretation, as well as secondary analysis and variant refinement.

New genomics-based technologies and tools are making their way into a range of exciting research programs and clinical studies around the world. Leading-edge organizations are quickly adopting hardware for sequencing and systems for collecting genomic data. Now, the focus has turned to analysis and interpretation – the critical component necessary to gain the insights from the sequence data that will transform medicine.

Earlier this year, Genomics England announced investments for broad sequencing and analysis of 100,000 human genomes. At the time, Genomics England had selected Illumina as its sequencing partner and was coordinating resources and centers to support the effort, including resourcing for analysis and interpretation. [See blog post here]. Other initiatives, such as the Qatar genomics program and the initiatives by Longevity and Regeneron also represent the accelerated progress in seeking medical advancements from genomic data insights. [See blog post here.]

This week, Genomics England announced a select group of companies with advanced capabilities to move to the next stage of evaluation to provide clinical interpretation for the 100K Genomes Project. At the tip top was NextCODE, which received top marks by Genomics England for its analytical capabilities across all the categories evaluated: rare disease interpretation, secondary pipeline analysis and cancer interpretation. [See press release here.] The company’s advanced Genomically-Ordered Relational database, or GOR, combined with its clinical and discovery interfaces offer the most advanced and reliable capabilities to support the ambitious tasks undertaken by Genomics England, and are already proven at population scale. [Read more on the GOR database here.]

The coming months will be a very exciting time for genomic medicine, with interpretation taking the spotlight as we take leaps toward the next stage of personalized medicine.

Genome Data Interpretation: How to Ease the Bottleneck

Bloomberg NextCODE Hannes Smarason

Bloomberg BNA Business’ “Diagnostic Testing & Emerging Technologies,” highlights how NextCODE is providing a qualitatively different way to store and analyze genomic information to meet growing opportunities in personalized medicine.

With advances in sequencing technology and reduced costs, more and more data are generated every day on the genetic basis of disease. The challenge has become how to derive meaningful information from these mountains of data.

While various systems have been established in recent years to store the large amounts of genomic data from patients’ DNA, a remaining obstacle is to “break the bottleneck” so that researchers can process the vast data in multiple human genomes in order to identify and isolate a small, useful piece of information about disease. Conventional databases and algorithms have not been able to efficiently and reliably identify subset information among the millions of genetic markers in order to inform clinical decisions. This has become a major data management roadblock.

The key is to find new approaches for databases and algorithms that accommodate the unique ways that genomic information is analyzed and interpreted. As discussed in Bloomberg BNA, Diagnostic Testing & Emerging Technologies, NextCODE is already easing this bottleneck by providing a qualitatively different way to store and analyze genomic information and apply it to meet the growing opportunities for personalized medicine.

NextCODE’s Genomically Ordered Relational (or GOR) database infrastructure is a truly different way of storing this huge amount of data. The principle is very simple: rather than store sequence and reference data in vast unwieldy files, it ties data directly to its specific genomic position. As a result, the algorithms are vastly more efficient compared to a traditional relational database because they can isolate by location in the genome. That makes analysis faster, more powerful, and radically more efficient, both in terms of clinicians’ and researchers’ time, as well as computer infrastructure, I/O, and CPU usage.

This holistic approach applies broadly to the priorities of genome scientists around the world, helping them eliminate the data management bottleneck to identify more culprits to many inherited diseases, more quickly and cost effectively.

Read more about NextCODE’s work here.

Trends in Sequencing and Analysis Today Leading to Tomorrow’s Clinical Advances

The insights we’re gaining from sequencing and analysis techniques are delivering new advances in healthcare with ever greater speed and precision.

The challenge for programs seeking to accelerate their research discoveries with genomic data is how to analyze the wealth of information—to make it clinically relevant and rapidly deliver reliable insights to better inform patient care.

The insights we’re gaining from sequencing and analysis techniques are delivering new advances in healthcare with ever greater speed and precision. It’s a particularly exciting time to be a part of this evolving industry, with continual opportunities for new clinical applications of these technologies and platforms.

Companies like Illumina and others who are delivering next-generation sequencing technologies are gaining global exposure. New partnerships and programs are placing these advanced techniques into the hands of the world’s leading clinicians and researchers, who are then applying them to some of today’s greatest medical challenges.  Recently, plans to integrate sequencing technologies have been announced by world renowned organizations like the Baylor College of Medicine in the U.S., Genomics England, and Sidra Medical and Research Center in Qatar.

The challenge for these and other programs seeking to accelerate their research discoveries with genomic data is how to analyze this wealth of information – to make it clinically relevant and rapidly deliver reliable insights to better inform patient care.

NextCODE Health is working to advance this piece of the puzzle with its Genomically Ordered Relational (GOR) database and its clinical and discovery interfaces (the Clinical Sequence Analyzer​™ and Sequence Miner™).  Combining next-generation sequencing techniques with increasingly robust analysis tools, NextCODE Health is helping to accelerate global research progress today to deliver unprecedented advances in patient care in the years just ahead.

A Standard Database Architecture Will Build a Stronger Foundation for Genome Discoveries

big data genome sequencing hannes smarason

The general adoption of the Genomically-Ordered Relational database (GOR) as a data standard for storing genomic data may greatly accelerate the spread of sequencing and its effectiveness as a tool for advancing medicine.

It is widely accepted that the ability to share the analysis and insights from DNA sequencing will be a key driver of discovery and innovation. But one current limitation to extending this knowledge is that sequencing and analysis platforms, as well as samples, are often proprietary to and stored at different institutions. Perhaps more important, the structures and formats in which genomic data has customarily been stored—the relational databases developed by the likes of IBM and Oracle—make it unwieldy to analyze as the amount of data grows, and very difficult to share. The upshot is that institutions cannot easily share and consolidate information to generate more robust analyses and clinically relevant insights. This presents a serious hurdle to discovery both in rare disorders, where samples need to be gathered in order to generated adequate analytical power, and in complex ones, where truly massive studies can tease apart different facets of disease and reveal their causes.

Over the past decade, a novel and comprehensive database model has been developed to solve this bottleneck, offering a flexible and fast means to overcome these problems. It is called the Genomically-Ordered Relational database, or GOR, and was designed to manage and query the detailed genomic data amassed by deCODE genetics in Iceland – the world’s first and still by far largest and most comprehensive population-based genomic database.

The thinking behind the GOR is as simple as it is revolutionary. Genomic data is a sort of big data but one with an important difference: It is divided up in distinct packets—the chromosomes—and then arranged within each chromosome in linear fashion. The GOR makes use of this by storing and querying sequence data according to its unique position in the genome, rather than as huge files as long as the sequence. This radically reduces the data burden of querying even large numbers of whole genomes, at the same time making it possible to store and visualize instantly the raw sequence underlying an analysis.

In practice, the GOR thereby enables researchers to home in on specific variants without having first to call up entire patient genomes, and separates raw data from annotations to focus in on only the most relevant search components. It’s these types of functions and features that can be consistently applied across data storing systems to allow for more multi-institutional, collaborative research and consistency in outcomes worldwide.

Leaders in the genomic research community are now beginning to create coalitions and working groups to underpin and coordinate the adoption of standards for sharing genomic data. As these groups create flexible and efficient policy frameworks, the GOR is tested and ready to support the fundamental data requirements of global data sharing and the acceleration of discoveries in genome-based medicine. The general adoption of the GOR as a data standard for storing genomic data may greatly accelerate the spread of sequencing and its effectiveness as a tool for advancing medicine around the world.