The world of DNA sequencing and bioinformatics has evolved at a staggering pace. In 2009, the biggest problem for researchers was creating an efficient solution for sequencing DNA. At that time, sequencing cost of $100,000 for a single human and took 30 days of chemical processing to produce the raw sequencing output. In four years, DNA sequencing technology has advanced rapidly, outpacing even Moore’s Law. Today, researchers can sequence a whole human genome for about $3,000 and get the results back in a single day. The challenge has now shifted from sequencing DNA to managing and understanding the extraordinary mass of data that is produced from each DNA sequence.
The burgeoning field of Next Generation Sequencing (NGS) Bioinformatics has exploded. Three key areas have emerged as the focus for innovative solutions critical to keep the research pipeline moving; data storage, data analysis, and improved accuracy of data analysis.
Data storage may not seem critical in an era where you can by a 3 terabyte hard drive for less than $150, but for the sequencing industry, it’s a pain point. Every human genome that is sequenced utilizes about 200 gigabytes of raw data. That means that a researcher only needs to sequence five humans to fill up an entire terabyte drive. The size of genomic datasets, in combination with new genome-in-a-day sequencers, archiving requirements for clinical applications, and finite on-site data storage at sequencing facilities is a recipe for disaster. Ask most sequencing centers how they are coping with the data explosion, and it’s apparent that the current solution of an ever-expanding physical data center is fundamentally unsustainable.
The analysis of this mass quantity of data poses an even larger challenge. Even state of the art sequencing centers that have dedicated, 1,000+ machine computing clusters, struggle to churn through the growing mountain of data. Most sequencing centers take between 5-14 days to perform basic analysis of the raw reads to produce a list of annotated variants. If you are lucky enough to have one of the new genome-in-a-day sequencers your data bottleneck is painfully obvious – for the data produced in one month, five months should be allocated to analyze it.
As an industry, our focus has been on the first two areas of storage and processing raw data; however, we now need to shift our focus to addressing the gaping blind spots in our current analysis methodology. Data analysis is only as useful as the data is accurate.
This is best exemplified in BGI’s 2010 published findings. In the Building the Sequence Map of the Human Pan Genome publication, it was revealed that as much as 50% of human variation is missed with the current industry-standard analysis methodologies. This is in part due to the fact that the variants are much more complex than the current tools are able to detect. Current algorithms cannot detect the vast majority of complex genetics variations such as larger insertions, deletions and structural rearrangements. As an industry, we are quickly discovering that many of the genetically influenced conditions that have been most challenging to understand including Autism, schizophrenia, and Alzheimer disease, are heavily correlated with these more complex variations. Given the existing data processing challenges, switching to the methodology described by BGI, which is far more computationally intense, is not a realistic option. Until someone provides a feasible solution, the industry will continue to struggle with a puzzle that is missing one third of the pieces.
It’s clear that we in the genetics research community have a lot of work to do if we are going to successfully fulfill the promise of personalized medicine, advanced crop development and effective biofuel production, amongst other things. The good news is that there are thousands of biologists, computer scientists and statisticians working to solve these problems. We are witnessing a new wave of innovation and the velocity of genetic advancements is the fastest that we have seen in history of humanity. Looking to the future, it’s safe to say, that bioinformatics is poised to change the world.
Adina Mangubat is the Chief Executive of Spiral Genetics.
During the 2013 BIO International Convention, she joined a group of Forbes magazine’s “30 Under 30″ rising stars in science and healthcare for a keynote luncheon discussion with BIO President and CEO Jim Greenwood. Learn more about her on the BIO International Convention web site.