Discovery: Big Data’s
From analyzing reams of ones and zeros comes vast potential for new drug discovery
By Louis Greenstein
Photo illustration by Clint Blowers
What do Netflix and Google have in common with biomedical research? They all leverage large datasets to identify trends and make recommendations. While e-commerce engines collect keywords and purchasing histories to recommend everything from antiques to Zumba, researchers participate in groundbreaking big data consortia that help understand diseases and recommend treatments — not by relying on traditional diagnostics, but by analyzing mountains of aggregated data.
“We really do live in the age of big data,” said Stephan Schürer, Ph.D., a member of Sylvester Comprehensive Cancer Center, professor of molecular and cellular pharmacology at the Miller School of Medicine, and program director of drug discovery at the University of Miami Center for Computational Science (CCS).
Big data is more than reams of ones and zeros. It’s also how they’re processed into usable knowledge. And as big data gets bigger, standardizing how information is extracted is critical so that researchers can contribute to and share the knowledge.
Machine learning and artificial intelligence rely on big data. Most everyone is familiar with AI applications that use algorithms to recognize faces. The same technology can identify tumors.
“AI has advanced to a point where it has exceeded human performance in some areas,” Dr. Schürer said. Not only can it recognize faces more quickly, but it even recognizes melanoma. “The top algorithms are now on par with the best dermatologists in the world,” he said. “You can take images of the eye, for example, and diagnose certain eye diseases. Cancer-no cancer, or a positive-negative diagnosis based on MRIs. Currently these methods work best for images, because that’s what the deep neural network architectures have initially been developed for and that’s where we have a huge amount of data. Humans are 90% visual in how we recognize our environment. Now the same technology is being reused for biomedical images.”
The Vs and Ds of Big Data
But the analogy to facial and object recognition only goes so far. “Common images such as a cat, a dog or even some biomedical images are mostly uncontroversial,” Dr. Schürer said. But chemical structures, for example, are different, and there are many ways to describe and code a chemical structure and that will influence how well a model works, in addition to the large amounts of data required for AI models.
“Volume is one characteristic, but not the only characteristic,” said Nicholas Tsinoremas, Ph.D., vice provost for research computing and data, and founding director of CCS. “Computer scientists often refer to the three Vs of big data: volume, velocity, variety. It’s not just how much data we have, but more about the variety and velocity that determines how quickly they can be turned into actionable information.”
In addition to the three Vs, “we look at two Ds,” Dr. Tsinoremas said. The first is disruption. Take Uber, for example. “Somewhere in the Bay area is a supercomputer that disrupted the taxi and the automotive industries. In the medical field, data is disrupting how we see patients.”
The second D is democratization — making sure that researchers can access the data. One CCS project ensures that the data it collects are not only secure, but also accessible to Miller School faculty so they can develop ideas and design studies.
Moving Data to Researchers’ Hands
Dr. Schürer’s research group currently works in three national research consortia that are part of the NIH Common Fund — a home for collaborative high-risk, high-reward biomedical explorations. One, the LINCS Consortium, is developing a library of molecular signatures in different types of cells after being exposed to agents that disrupt normal cellular functions.
“Everything is open source, so scientists can go online and find the best algorithms for their own projects,” said Vasileios Stathias, Ph.D., lead data scientist at Sylvester and the Miller School’s Department of Molecular and Cellular Pharmacology. “The open source attribute is the most important thing — a few years ago it was all proprietary. Open source encourages democratization by getting algorithms and data out of private warehouses and into researchers’ hands.”
Another NIH consortium is the Big Data to Knowledge (BD2K) program, which supports the development of tools to accelerate the use of big data in biomedical research. BD2K addresses issues such as how to extract knowledge from data. And it supports efforts toward making datasets “FAIR” — Findable, Accessible, Interoperable and Reusable.
The third — Illuminating the Druggable Genome (IDG) — is an international research consortium funded by the NIH to develop technologies that will help in validating novel therapeutic targets and in particular those that currently ignored with almost no research or funding.
“One of the large efforts in the lab is to harmonize and integrate data from many different projects and resources,” Dr. Stathias said. “For example, LINCS has generated data from over 1.2 million experiments, and the next release is going to make available more than 2 million experiments.”
“The ultimate goal, of course, is to find the best treatment for a specific patient, and that’s going to come from the aggregation and analysis of data from multiple areas,” Dr. Stathias said. “Scientists can spend up to 80% of their time finding and cleaning data. The integrated repositories would enable researchers to spend more time analyzing the data than cleaning. I think we are maybe a year away from the first version. We want to make it future-proof, so that regardless of new technologies, it will improve the job of scientists.”
Dr Schürer adds: “Another important goal is to preserve the data, enable re-analysis, for example if better tools or algorithms are available, and to have all data available as standardized signatures, prepared for immediate analysis and AI application. Remember, data is now the most valuable resource. We need to make sure we preserve it here at Sylvester and the University of Miami and enable researchers to take advantage of state-of-the-art AI technologies.”
Extending Big Data’s Reach
Big data’s reach isn’t limited to faster diagnostics, Dr. Tsinoremas said. “We’re talking about digital therapeutics — bringing big data to the next level, such as sensors on bottles that tell us when a patient didn’t take their medications. You won’t be able to just tell a physician you’re taking your meds, because the physician will be able to see!”
And for those who worry about AI, machine learning and big data eclipsing humanity, Dr. Schürer is less worried. Whether we’re using it to diagnose diseases, prioritize drugs, big data ultimately comes from people.
“If you really think about it, all the training information that goes into making predictions in the first place comes from humans,” Dr. Schürer said. “You have to label the data, provide context knowledge and interpret the results, so I don’t really see how the current methods can eclipse human capabilities. That doesn’t mean they can’t exceed human performance, because they are very fast. But asking the right questions is most critical to do research. AI at the moment doesn’t function independently of the human context. All that said, data privacy and data ownership are critical currently unresolved issues.”
“Humans are 90% visual in how we recognize our environment.
Now the same technology is being reused for biomedical images.”
— Stephan Schürer, Ph.D.