Medicine today is imprecise. For the top 20 prescription drugs in the U.S., 80% of patients are non-responders. Recent disruptions in sensor technology have enabled precise categorization of diseases and treatment effects. For example, sequencing technology has reached the exciting point of $1000 human genome. Major cancer centers have begun to sequence tumors routinely for personalized cancer diagnosis and treatment.
However, progress in precision medicine is difficult, as genome-scale knowledge and reasoning become the ultimate bottleneck in deciphering cancer and other complex diseases. Today, it takes hours for a molecular tumor board of many highly trained specialists to review one patient’s omics data and make treatment decisions. With 1.6 million new cancer cases and 600 thousand deaths in the U.S. alone each year, this is clearly not scalable.
We envision that AI-powered decision support for precision medicine will become an explosive growth area in cloud-based health analytics. In Project Hanover, building on prior work in Literome, we are making progress in three directions:
Biomedical knowledge has been growing at an explosive rate. PubMed adds two new papers every minute, over one million each year. Machine reading automatically converts text to structured databases, making it easy to search and reason with this vast body of knowledge. However, natural languages are ambiguous and same meaning can be expressed in many ways. Traditional machine learning approaches require a large amount of annotated examples to train an extractor, and are hard to generalize to new domains.
An overarching theme of our research seeks to overcome this annotation bottleneck by leveraging prior knowledge and joint inference for indirect supervision. In the past, we have successfully applied this paradigm to a variety of NLP tasks, including coreference resolution (EMNLP-08), morphology (NAACL-09 Best Paper), and semantic parsing (EMNLP-09 Best Paper).
More recently, we have been developing methods that use existing knowledge bases to automatically annotate a large amount of noisy training examples (a.k.a. distant supervision) for precision medicine. We have successfully applied distant supervision to PubMed-scale extraction of cancer pathways (PSB-15), developed the first approach to extract complex, nested gene regulation relations w/o direct supervision (NAACL-15), and developed the first distant-supervision approach for cross-sentence relation extraction of drug-gene interactions. To facilitate efficiently inference of facts not explicitly stated in text, we have developed deep learning approaches to embed knowledge and text into vector representation (EMNLP-15, ACL-16).
We plan to make available our extracted knowledge through Literome on Azure. This knowledge can be applied to downstream analytics tasks, e.g., as rich features for machine learning in cancer decision support. We are also augmenting Literome with feedback interface, with the goal to faciliate computer-assisted knowledge curation. Instead of starting from scratch reading millions of papers, curators would simply go through the extracted results and fix errors with a mouse click. The machine reading system can incorporate feedback on the fly and continuously refine extraction quality. We expect that such symbiosis could drastically improve the productivity and coverage of manual curation effort. One concrete use case under development is drug-gene interactions, where major cancer centers are investing heavily to support molecular tumor board.
The code and data for our 2017 TACL journal paper Cross-Sentence N-ary Relation Extraction with Graph LSTMs are available here: Download
Cancer is actually a thousand diseases that share similar symptoms. Cancer arises from genetic mutations, which come in a great many varieties as combinations among the 20,000 genes in the human genome. In principle, the tumor genome contains almost all the secrets of the individual cancer. However, it is not easy to translate from the genome to actionable decisions. A tumor might have hundreds, if not thousands, of mutations, only a handful of which actually drive the tumor growth. Genes interact with each other in elaborate ways, forming a complex gene network where feedback loops and cross-talks abound. Consequently, it remains a formidable challenge to diagnose cancers given the omics data, and relapses still occur frequently even if the treatment elicits a strong initial response.
Our research aims to advance AI technology for empowering cancer researchers and tumor boards. First, we strive to automate repetitive processes that are done manually but hard to scale. The prime example is the knowledge bottleneck. Specifically, given the mutations in a tumor genome, what do we already know about these genes, what network submodules or functions do they participate in, and what drugs are known to target them, directly or through network intermediates? Manually curated knowledge bases have sparse coverage, and are hard to keep up with the rapid growth of research literature. Oncologists often resort to keyword search in PubMed to find relevant information, which is laborious and slow. Our machine reading efforts extract such knowledge from research literature, and make them available in the cloud to facilitate search and computation. We are also developing tools for computer-assisted knowledge curation and tumor board decision support.
Furthermore, we are developing AI tools to support tasks that are very difficult to perform, even manually. The prominent example is the reasoning bottleneck. Today's tumor boards are limited to consideration of single genes and drugs. Next-generation molecular tumor boards should factor in interactions among mutations and recommend combinations of drugs to attain synergistic effect and preempt relapse. With hundreds of candidate targeted drugs, there are tens of thousands combinations even if only pairs are considered. Exhaustive personalized experimentation is infeasible. To combat this combinatorial explosion, we are developing a machine learning approach that models complex drug interactions and off-target effect using the Literome gene network. We are collaborating with the Knight Cancer Institute in the BeatAML project.
At the present rate, the U.S. healthcare spending will exceed 20% of GDP by 2025. It's estimated that one third of this spending didn't lead to any improvement in health, resulting in a staggering waste of $750 billion per year, much of which stemming from imprecision medicine. Chronic diseases take up 86% of healthcare spending. It thus becomes imperative to develop predictive, preventive, and personalized medicine for managing chronic diseases, including but not limited to cancer.
For cancer, omics data is particularly informative. For general chronic diseases, electronic medical records (EMRs), activity sensors, and even query logs provide key data sources for disease management. Analogous to the cancer case, we are working on two fronts where AI is particularly helpful. First, we are developing medical machine reading methods for converting text in EMRs (e.g., clinical notes) to structured databases, which provide rich information for machine learning. Second, we are developing novel methods for modeling chronic disease progression. Our ultimate goal is to infer health states for individual patients, predict impending state transitions, and suggest interventions to prevent unfavorable ones.