Machine Reading for Precision Medicine

Overview

Bringing biomedicine and computer science together

Medicine today is imprecise. For the top 20 prescription drugs in the U.S., 80% of patients are non-responders. The advent of big data heralds a new era of precision medicine, where treatments become increasingly effective by tailoring to individual patients. For example, rapid technical advances have reached the exciting disruption point of $1000 person genome, making it affordable to sequence genetic mutations in individual tumors. Unfortunately, big data also leads to information overload, making it hard to separate signal from noise and discern knowledge from data. Today, it takes hours for a molecular tumor board of many highly trained specialists to review a patient's genomics data and make treatment decisions. With 1.7 million new cancer cases and 600 thousand deaths in the U.S. each year, this is clearly not scalable.

Biomedical text contains valuable structured information that can faciliate harnessing big data for precision medicine. Examples include oncology knowledge in biomedical literature and patient treatment outcomes in electronic medical records (EMRs). These opportunities have spawn rapid growth in “Curation-as-a-Service” (CaaS), with Roche's $2-billion acquisition of Flatiron being a prominent case in point. Current CaaS vendors, however, generally rely on manual curation by human experts, and face steep challenges in scalability.

In Project Hanover, we aspire to advance the state of the art of machine reading for accelerating CaaS in precision medicine. Standard machine reading approaches require painstakingly annotating many labeled examples, which limits their applicability. We developed a general framework for incorporating diverse forms of indirect supervision to compensate for the lack of labeled examples, by combining deep learning with probabilistic logic. Motivated by biomedical applications, we expanded the scope of machine reading from single sentences to cross-sentence and document-level, and proposed novel neural architectures such as graph LSTMs for incorporating and reasoning with linguistic constraints.

We envision that CaaS can be accelerated by orders of magnitude via a well-construed human-computer symbiosis. Indirect supervision bootstraps a machine reader for a specific domain with little labeled data, while expert curators quickly vet machine-read results in an assisted curation interface. As long as the initial machine reader attains sufficiently high recall and reasonable precision, assisted curation will be more efficient than manual curation. Once assisted curation takes hold, the curation decisions provide direct supervision to keep improving machine reading.

Currently, we focus on three representative areas: molecular tumor board, real-world evidence, and clinical trial matching. They are important in each's own right, and collectively span the full spectrum in precision health applications. In the long run, we're also very interested in combining machine reading results with causal machine learning to facilitate cancer decision support and chronic disease management.

Molecular Tumor Board

Cancer is actually a thousand diseases driven by disparate genetic mutations. Advances in sequencing technology make these mutations easily assessible for individual patients, yet deciphering such "assembly code" of cancer requires staying on top of a vast biomedical literature, which comprises of tens of millions of papers and grows at thousands per day.

By combining deep learning and probabilistic logic, we have developed machine reading technology to automatically extract knowledge from publications, thus empowering molecular tumor boards to curate much faster and "leave no fact behind." Building on past work in Literome, these advances enable us to create literature machine readers for a variety of domains, from fundamental biology (e.g., genetic pathways) to translational medicine (e.g., precision oncology), all without labeled examples. Our latest system has read all publicly available biomedical literature (30 million PubMed abstracts and 5 million PMC full-text articles). In a matter of minutes, it found several times as many facts as a whole year of manual curation at an NCI-designated cancer center, which could be quickly validated by expert curators in an assisted curation interface on Azure.

Today's tumor boards focus on single genes and drugs. Next-generation tumor boards should factor in elaborate interactions among genes and mutations, and consider treatment combinations to attain synergistic effect and preempt relapse. However, the resulting combinatorial explosion is hard to contain by manual effort. In addition to the traditional slash-burn-poison regimens, there are thousands of targeted drugs available, along with the rapidly growing arsenal of immunotherapies. Experimental data will remain relatively scarce given the astronomical amount of potential combinations. Harnessing prior knowledge among drugs, genes, and mutations could provide key support for learning to prioritize promising treatment combinations.

Real-World Evidence

Developing an FDA-approved drug now takes over a decade and costs more than $2 billion. Randomized-controlled trials are the gold standard of medicine, but they are expensive and time-consuming, while covering only a tiny fraction of patients. Electronic medical records (EMRs) contain valuable clinical observations that can be used to augment clinical trial data. E.g., Flatiron's seminal work shows that synthetic control using EMRs can alleviate the burden in recruiting real controls. EMRs also document off-label prescriptions when standard-of-care fails, thus offering important leads for drug repurposing. Finally, even after a drug has been approved, it is important to conduct post-market surveillance to monitor adverse effect and efficacy in the general population. This traditionally requires additional trials that often cost as much as all pre-approval trials combined. Harnessing real-world evidence from EMRs can potentially accelerate drug development at substantially reduced cost.

Real-world evidence curation currently relies on manual effort and can take hours per patient, with the bulk of time spent on chart review (i.e., reading doctor's notes). Compared to literature machine reading, clinical text presents additional challenges due to prevalent use of idiosyncratic abbreviations and heightened level of variations. Meanwhile, EMRs also provide additional opportunities for indirect supervision, e.g., using available structured elements such as claim codes. We have obtained promising results in extracting cancer recurrences in a preliminary exploration and start developing general methods for extracting real-world evidence in oncology.

Clinical Trial Matching

Over 20% of clinical trials fail due to insufficient patients. Patient recruitment is largely done by word of mouth, placing the burden on physicians and patients to keep track of thousands of open trials and match elaborate eligibility criteria to a given patient's case. For drug development, matching efficiency could determine success or failure of a trial. For a patient, it can be life-or-death. Machine reading can speed up clinical trial matching by extracting patient attributes from both EMRs and eligibility criteria to facilitate matching. We are in discussion with various stake holders to explore potential opportunities for assisted curation in clinical trial matching.

About

Core team

Past Contributors

  • Chris Quirk
  • Kristina Toutanova
  • Scott Wen-tau Yih
  • Ravi Pandya
  • David Heckerman
  • Bill Bolosky
  • Lucy Vanderwende
  • Andrey Rzhetsky
  • Jeff Tyner
  • Brian Druker

Interns

  • Maxim Grechkin
  • Stephen Mayhew
  • Sheng Wang
  • Victoria Lin
  • Daniel Fried
  • Nanyun Peng
  • Hai Wang
  • Robin Jia

Resources

  • Tutorials: AAAI-18, ACL-17 [Slides].
  • Deep Probabilistic Logic: A Unifying Framework for Indirect Supervision [Paper, Code]
  • Cross-Sentence N-ary Relation Extraction with Graph LSTMs [Paper, Code]
  • Literome: PubMed-Scale Genomic Knowledge Base in the Cloud. [Paper, Azure]