Individual electronic health records (EHRs) and clinical reports are often part of a larger sequence-for example, a single patient may generate multiple reports over the trajectory of a disease. In applications such as cancer pathology reports, it is necessary not only to extract information from individual reports, but also to capture aggregate information regarding the entire cancer case based off case-level context from all reports in the sequence. In this paper, we introduce a simple modular add-on for capturing case-level context that is designed to be compatible with most existing deep learning architectures for text classification on individual reports. We test our approach on a corpus of 431,433 cancer pathology reports, and we show that incorporating case-level context significantly boosts classification accuracy across six classification tasks-site, subsite, laterality, histology, behavior, and grade. We expect that with minimal modifications, our add-on can be applied towards a wide range of other clinical text-based tasks.

Document Type


Publication Date


Notes/Citation Information

Published in PLOS ONE, v. 15, no. 5, 0232840, p. 1-21.

This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication.

Digital Object Identifier (DOI)


Funding Information

Georgia Tourassi (GT) at the Oak Ridge National Laboratory received funding from the Department of Energy (energy.gov) and the National Cancer Institute (cancer.gov). The grant number is 2450-Z301-19. These funds were used to facilitate this study. The provided funding via this grant was used to support of salaries for SG, MA, NS, AR, and GT. In addition to the grant, the National Cancer Institute (NCI) employs or provides funding for authors from NCI (LP), state registries (XCW, and EBD) and Information Management Services Inc (LC) as part of the Surveillance, Epidemiology, and End Results program and authorized their participation in this study. Their efforts included data collection, cleaning, analysis or final review of this study. This work has been supported in part by the Joint Design of Advanced Computing Solutions for Cancer (JDACS4C) program established by the U.S. Department of Energy (DOE) and the National Cancer Institute (NCI) of the National Institutes of Health. This work was performed under the auspices of the U.S. Department of Energy by Argonne National Laboratory under Contract DE-AC02-06-CH11357, Lawrence Livermore National Laboratory under Contract DEAC52-07NA27344, Los Alamos National Laboratory under Contract DE-AC5206NA25396, and Oak Ridge National Laboratory under Contract DE-AC05-00OR22725 This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.

Related Content

The data used in our experiments may be accessible, upon request and with subsequent authorized approvals, by individuals by contacting the Louisiana Tumor Registry (LTR-info@lsuhsc.edu) and Kentucky Cancer Registry (ericd@kcr.uky.edu).

pone.0232840.s001.tif (1409 kB)
S1 Fig. https://doi.org/10.1371/journal.pone.0232840.s001

pone.0232840.s002.tif (1394 kB)
S2 Fig. Histogram of number of pathology reports associated with each unique tumor ID. https://doi.org/10.1371/journal.pone.0232840.s002

pone.0232840.s003.pdf (80 kB)
S1 Table. McNemar’s tests of statistical significance. https://doi.org/10.1371/journal.pone.0232840.s003

pone.0232840.s004.pdf (78 kB)
S2 Table. Case-level context F-score breakdown by class. https://doi.org/10.1371/journal.pone.0232840.s004

pone.0232840.s005.pdf (73 kB)
S3 Table. Modular vs end-to-end training. https://doi.org/10.1371/journal.pone.0232840.s005