Domain Expertise: Jumpstarting Artificial Intelligence in Biomedicine

Is artificial intelligence the “single most transformative technology in modern history?” That’s the view of Tom Chittenden, who leads WuXiNextCODE’s AI program. And Tom is not alone in his enthusiasm, as numerous analysts are predicting this technology will be one of the fastest growing fields in the world.

In recent talks at Boston’s BioIT World and the EmTech conference in Hong Kong, Tom described some of the strides we’ve been making with our DeepCODE AI tools. Their power is in part thanks to a novel, causal statistical-learning method and deep-learning classification strategy. But another advantage is that they were built on—and are extending the reach of—our global platform for genomic data. That means that Tom’s team has that rare combination of both of the key ingredients to AI making an impact in biomedicine: cutting-edge algorithms AND deep domain expertise and access to the biggest datasets.

Tom—who also holds appointments at Harvard, MIT, and Boston Children’s Hospital—and his growing team have the former in spades; our platform and expertise in genomics provide a key edge in the latter. Our platform has been built over more than 20 years and today underpins the majority of the world’s largest genomics efforts and includes all major global reference databases. It stores, manages, and integrates any type of genomic data and correlates it with phenotype, ‘omics’, biology, outcome, and virtually any other type of data that may be relevant to a particular medical challenge.

That means that we can routinely train and test our AI tools on some of the most comprehensive data sets in the world, such as that in The Cancer Genome Atlas (TCGA). “Today we can take ‘omics data and clinical information and map those to curated resources such as SNOMED CT and biomedical ontologies, and then use AI to identify patterns that lead us to novel findings,” Tom says.

This is a powerful approach to tease out which of hundreds of genetic variants are really involved in a particular disease, based on which ones are actually associated with aberrant expression pathways. You may find hundreds of genetic mutations in a single type of breast cancer tumor, for example, but it is determining which ones are drivers of the disease that matters.

Put simply, AI can lead us to both better diagnoses and easier discovery of more and better drug targets, by taking a range of genomic data and marrying it to clinical information and scientific knowledge. AI is not just going to better match patients to the right drugs, it is going to help further our understanding of the relationships between genes and complex molecular signaling networks, one of the most challenging arenas in our field and the most sought-after starting point for discovering validated pathways and targets.

Valuable insights in real-world medical challenges are already emerging from this AI effort uniquely developed on and applied to the genomic and medical data that counts.

WuXi NextCODE  recently presented preliminary data from analyses using our novel AI technology to diagnose subtypes of tumors. Our DeepCODE tools were validated on six patient-derived tumor xenografts from mouse models, and then tested against approximately 8,200 human tumors from a collection of 22 cancer types in The National Cancer Institute’s TCGA collection. That study included five ‘omics data types. We achieved 98% accuracy overall, and our analyses of human breast and lung cancer subtypes were accurate in 96% and 99% of cases, respectively. This points to an improvement over current methods for matching patients to treatments for their particular cancer, and we have refined that accuracy further still. This capability is also going to be central to the development of liquid biopsies.

In another oncology study, using the same multi-omics data, DeepCODE identified a signal predictive of survival across 21 cancers, pointing to novel and holistic pathways for developing broad oncotherapies.

A recent study published in Nature, meanwhile, describes a potential new role for a well-known growth factor. That report, led by Yale University scientist Michael Simons, looked at blood vessel growth regulation—a crucial process in some very common conditions, including cardiovascular disease and cancer. Our Shanghai team provided RNA sequencing for this study. Our Cambridge AI team drove some of the key insights pointing to novel disease mechanisms.

Simons’ team studied knockout mice, whose fibroblast growth factor (FGF) receptor genes were turned off. They proved, for the first time, that FGFs have a key role in blood vessel growth, uncovering some metabolic processes that were “a complete surprise,” according to scientists on the team. Further, they mapped out pathways that could help provide new drug leads.

Our AI team is just getting started. We’re looking forward to many more intriguing findings from this group as they leverage their expertise and massive amounts of the relevant data to improve medicine and healthcare.


As Cancer Databases Grow, A Global Platform Leaps the Big Data Hurdle

cancer databases

As massive cancer databases like The Cancer Genome Atlas (TCGA) proliferate and expand worldwide, WuXi NextCODE expects to see—and to drive—a boom in discoveries of cancer biomarkers that will advance our ability to treat cancer and improve outcomes for patients.

One of the fastest-growing areas in medicine today is the creation of massive cancer databases. Their aim is to provide the scale of data required to unravel the complexity and heterogeneity of cancer—the key to getting patients more precise diagnoses faster, and to getting them the best treatments for their particular disease.

In short, this data has the potential to save lives.

Such databases are not new, but they are now proliferating and expanding at an unprecedented pace. Driven by governments, hospitals, and pharmaceutical companies, they catalogue a growing range of genetic data and biomarkers together with clinical information about their effects on disease, therapy, and outcomes.

Only with such data can we answer the key questions: Does a certain marker suggest that a cancer will be especially aggressive? Does it signal that the tumor responds best to particular treatments? Are there new pathways involved in particular cancers that we can target to develop new drugs?

It’s the cutting edge of oncology, but to be powered to answer these questions, these databases have to be very, very big. They have to bring together whole-genome sequence data on patients and their tumors as well as a host of other ‘omics and biological data. One of the biggest challenges to realizing this potential is to manage and analyze datasets of that scale around the world. It’s one we are addressing in a unique manner through our global platform.

One of the most renowned and widely used of these is The Cancer Genome Atlas, a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI). TCGA data is freely available to those who qualify, and there is a lot of it. It already comprises 2.5 petabytes of data describing tumor tissue and matched normal tissues for 33 tumor types from more than 11,000 patients. Researchers all over the world can apply to use this data for their own studies, and many have.

Yet asking questions of TCGA alone can take months for most groups and requires sophisticated tools. At Boston’s recent Bio-IT World conference, WuXi NextCODE’s director of tumor product development, Jim Lund, explained how we have put TCGA on our global platform—providing a turnkey solution with integrated analytics to transform the data into valuable findings.

Jim and his team have imported into WuXi NextCODE’s cloud platform virtually all key TCGA data: raw whole exome sequence data from patients and tumors, as well as variant calls using MuTect2 and Varscan2; RNA and microRNA sequence and expression data; and data on copy number variation, methylation arrays, and some 150 different clinical attributes. But this data isn’t just hosted in the cloud: it can all now be queried directly and at high speed online, enabling researchers to quickly ask and answer highly complex questions without having to download any data or provide their own bioinformatics software.

To demonstrate the power of this approach, Jim’s team decided to run the same queries in a recent published study that looked at sequence data from the exons of 173 genes in 2,433 primary breast tumors (Pereira et al., Nature 2016). They were specifically looking for driver mutations of cancer’s spread and growth. In a matter of minutes, rather than months, they were able to replicate key mutations identified in the study. That analysis was then extended to all cancer genes, and additional driver genes were found. More important, because they were able to correlate these mutations with clinical outcomes data, they were also able to begin systematically matching specific mutation patterns to patient outcomes.

Next, Jim’s team looked at the genomics of lung adenocarcinoma, the leading cause of death from cancer worldwide. Following up on the findings in another published study (Collison et al., Nature 2014), they profiled the 230 samples examined in the paper and immediately made several observations. Eighteen genes were mutated in a significant number of samples; EGFR mutations (which are well known) were more common in samples from women; and RBM10 mutations were more common in samples from men. These results were extended to 613 samples and shown to be robust. But because they had a wide range of data including mRNA, microRNA, DNA sequencing, and methylation, Jim’s team was further able to suggest some actual biological processes that may be fueling the origin and growth of lung adenocarcinomas.

What’s making this type of research possible? It’s our global platform for genomic data. The platform spans everything required to make the genome useful for helping patients around the world, from CLIA/CAP sequencing to the world’s most widely used system for organizing, mining, and sharing large genomic datasets. At its heart is our database—the Genomically Ordered Relational database (GORdb). Because it references sequence data according to its position on the genome, it makes queries of tens of thousands of samples computationally efficient, enabling the fast, online mining of vast datasets stored in multiple locations.

That’s how we are making the TCGA—and every major reference dataset in the world—available and directly minable by any researcher using our platform. Those users can combine all that data with their own to conduct original research at massive scale.

These breast and lung cancer studies are just two of more than a thousand that have been carried out so far on TCGA data. As more such datasets become available, we expect to see—and to drive—a boom in discoveries of cancer markers that will advance our ability to treat cancer and improve outcomes for patients. For those who want to go further still, our proprietary DeepCODE AI tools offer a means of layering in even more datasets to drive insights even deeper into the biology of cancer and other diseases. And that’s a topic I’ll return to in the weeks ahead.