Natural Language Processing Lab
We are currently involved in many multi-site and inter-disciplinary research projects:
i2b2 - Informatics for Integrating Biology & the Bedside
We are currently involved several i2b2 driving biology projects. Partnering with clinicians across multiple sites such as BWH, MGH, CHB allows us to closely align our research with clinical needs. These projects include cohort identification and disease activity classification in several domains such as rheumatoid arthritis, Type 2 diabetes, multiple sclerosis, irritable bowel disease from plain text notes within the EHR.
MiPACQ - Multi-source Integrated Platform for Answering Clinical Questions
MiPACQ (Multi-source integrated platform for answering clinical questions) Clinical question answering (cQA) systems focus on the physician needs usually at the point of care, or the investigator in the lab. The questions usually asked either require information highly specific to their patient, e.g. the patient’s lab results or previous history, answered by the patient’s health record, or a more general type of information usually answered through generally available information sources. MiPACQ aims to provide a unified multi-source solution for semantic retrieval, access and summarization of relevant information at the point of care or the lab, and represents a high impact area that has the potential to improve healthcare delivery because it addresses needs that have been well-documented and studied. MiPACQ will be released open source under an Apache license. MiPACQ is built within the UIMA (Unstructured Information Management Architecture), the engineering framework IBM Watson question answering system was built. It incorporates cTAKES and CLEAR-TK.
The project is a collaboration between Childrens Hospital Boston/Harvard Medical School, University of Colorado (Profs. Martha Palmer, Jim Martin and Wayne Ward), and Mayo Clinic (Dr. Christopher Chute).
PGRN - PharmacoGenomics Research Network
We are working on developing a RA disease activity level classifier for clinical notes directly from Electrical Health Records with chart review and with Natural Language Processing techniques. Each document can be represented as a vector of terms specified by domain experts or comprehensive features, which are populated automatically by NLP technologies/tools. We are experimenting feature selection techniques as well as mainstream classification methods to automate the determination of disease activity in document level. Further goal is to define disease activity level in patient level in cohort datasets. We aim to derive robust and generic methods, which can be applied to other Driving Biology Projects.
ShARe - Shared Annotated Resources
Much of the clinical information required for accurate clinical research, active decision support, and broadcoverage surveillance is locked in text files in an electronic medical record (EMR). The only feasible way to leverage this information for translational science is to extract and encode the information using natural language processing. Over the last two decades, several research groups have developed NLP tools for clinical notes, but a major bottleneck preventing progress in clinical NLP is the lack of standard, annotated dat sets for training and evaluating NLP applications. Without these standards, individual NLP applications abound without the ability to train different algorithms on standard annotations, share and integrate NLP modules, or compare performance. Under the ShARe project, we are developing standards and infrastructure that can enable technology to extract scientific information from textual medical records. We are annotating a 500K word clinical narrative corpus for syntactic information following the Penn Treebank guidelines and for semantic information following the UMLS definitions. The corpus will be made available to the research community in 2014.
The project is a collaboration between Childrens Hospital Boston/Harvard Medical School, Columbia University (Prof. Noemie Elhadad, University of California at San Diego (Prof. Wendy Chapman) and University of Colorado (Prof. Martha Palmer).
SHARPn - Strategic Health IT Advanced Research Projects
For the SHARP project, we are in the process of creating several open source NLP modules for semantic analysis of clinical narratives, which include a module for coreference, relation extraction, and predicate-argument structure of the sentence.
Our approach to NLP heavily relies to a on machine learning algorithms, which require annotated data for component training and evaluation. We are currently involved in several annotation tasks that aim to create a richly annotated corpus of clinical texts. This corpus will include multiple layers of syntactic and semantic annotation such as treebank, propbank, and UMLS annotations.
We are also involved in projects which focuses on utilizing active learning to reduce the cost of annotation.
THYME - Temporal History of Your Medical Events
Temporal relations are of prime importance in biomedicine as they are intrinsically linked to diseases, signs and symptoms, and treatments. Understanding the timeline of clinically relevant events is key to the next generation of translational research where the importance of generalizing over large amounts of data holds the promise of deciphering biomedical puzzles. The goal of our current proposal is to automatically discover temporal relations from clinical free text and create a timeline.
The project is a collaboration between Childrens Hospital Boston/Harvard Medical School, University of Colorado (Profs. Martha Palmer, Jim Martin and Wayne Ward), Brandeis University (Prof. James Pustejovsky) and Mayo Clinic (Drs. Piet de Groen and Bradley Ericson).