Abstract

Background: Automated text classification has many important applications in the clinical setting; however, obtaining labelled data for training machine learning and deep learning models is often difficult and expensive. Active learning techniques may mitigate this challenge by reducing the amount of labelled data required to effectively train a model. In this study, we analyze the effectiveness of 11 active learning algorithms on classifying subsite and histology from cancer pathology reports using a Convolutional Neural Network as the text classification model.

Results: We compare the performance of each active learning strategy using two differently sized datasets and two different classification tasks. Our results show that on all tasks and dataset sizes, all active learning strategies except diversity-sampling strategies outperformed random sampling, i.e., no active learning. On our large dataset (15K initial labelled samples, adding 15K additional labelled samples each iteration of active learning), there was no clear winner between the different active learning strategies. On our small dataset (1K initial labelled samples, adding 1K additional labelled samples each iteration of active learning), marginal and ratio uncertainty sampling performed better than all other active learning techniques. We found that compared to random sampling, active learning strongly helps performance on rare classes by focusing on underrepresented classes.

Conclusions: Active learning can save annotation cost by helping human annotators efficiently and intelligently select which samples to label. Our results show that a dataset constructed using effective active learning techniques requires less than half the amount of labelled data to achieve the same performance as a dataset constructed using random sampling.

Document Type

Article

Publication Date

3-9-2021

Notes/Citation Information

Published in BMC Bioinformatics, v. 22, article no. 113.

© The Author(s) 2021

This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (https://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Digital Object Identifier (DOI)

https://doi.org/10.1186/s12859-021-04047-1

Funding Information

This work has been supported in part by the Joint Design of Advanced Computing Solutions for Cancer (JDACS4C) program established by the U.S. Department of Energy (DOE) and the National Cancer Institute (NCI) of the National Institutes of Health. This work was performed under the auspices of the U.S. Department of Energy by Argonne National Laboratory under Contract DE-AC02-06-CH11357, Lawrence Livermore National Laboratory under Contract DEAC52-07NA27344, Los Alamos National Laboratory under Contract DE-AC5206NA25396, and Oak Ridge National Laboratory under Contract DE-AC05-00OR22725. This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the US Department of Energy Office of Science and the National Nuclear Security Administration. This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the US Department of Energy under contract DE-AC05-00OR22725. The funding offices from the DOE and NCI did not have any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Related Content

The data used for our experiments consists of cancer pathology reports that potentially contain identifiers as defined under HIPAA and would be protected health information; as such we are not authorized to make our dataset publicly available. The data that we have been provided has been done so under an approved IRB protocol and data use agreement with the data owners—the National Cancer Institute’s Louisiana, Kentucky, Utah, and New Jersey Surveillance, Epidemiology, and End Results Program (SEER) cancer registries. The data used in our experiments may be accessible, upon request and with subsequent authorized approvals, by individuals by contacting the Louisiana Tumor Registry, Kentucky Cancer Registry, Utah Cancer Registry, and New Jersey State Cancer Registry.

12859_2021_4047_MOESM1_ESM.tif (3536 kB)
Additional file 1. Dataset class imbalance plots.

12859_2021_4047_MOESM2_ESM.docx (14 kB)
Additional file 2. Text preprocessing steps.

12859_2021_4047_MOESM3_ESM.docx (13 kB)
Additional file 3. Bootstrapping procedure for confidence.

12859_2021_4047_MOESM4_ESM.tif (2838 kB)
Additional file 4. Performance of cold start ratio sampling, warm start ratio sampling, and random sampling on the histology task (small dataset).

12859_2021_4047_MOESM5_ESM.xlsx (10 kB)
Additional file 5. Training and inference time for the histology experiment (large dataset) using ratio sampling.

12859_2021_4047_MOESM6_ESM.xlsx (10 kB)
Additional file 6. Large dataset: micro/macro F-1 scores table - histology task.

12859_2021_4047_MOESM7_ESM.xlsx (10 kB)
Additional file 7. Large dataset: micro/macro F-1 scores table - subsite task.

12859_2021_4047_MOESM8_ESM.xlsx (11 kB)
Additional file 8. Small dataset: micro/macro F-1 scores table - histology task.

12859_2021_4047_MOESM9_ESM.xlsx (10 kB)
Additional file 9. Small dataset: micro/macro F-1 scores table - subsite task.

12859_2021_4047_MOESM10_ESM.tif (1426 kB)
Additional file 10. Large dataset: class imbalance - histology task.

12859_2021_4047_MOESM11_ESM.tif (1426 kB)
Additional file 11. Large dataset: class imbalance - subsite task.

12859_2021_4047_MOESM12_ESM.tif (1426 kB)
Additional file 12. Small dataset: class imbalance - histology task.

12859_2021_4047_MOESM13_ESM.tif (1427 kB)
Additional file 13. Small dataset: class imbalance - subsite task.

12859_2021_4047_MOESM14_ESM.tif (2825 kB)
Additional file 14. Large dataset: class proportion plots - histology.

12859_2021_4047_MOESM15_ESM.tif (2822 kB)
Additional file 15. Large dataset: class proportion plots - subsite task.

12859_2021_4047_MOESM16_ESM.tif (2822 kB)
Additional file 16. Small dataset: class proportion plots - histology task.

12859_2021_4047_MOESM17_ESM.tif (2822 kB)
Additional file 17. Small dataset: class proportion plots - subsite task.

12859_2021_4047_MOESM18_ESM.tif (1430 kB)
Additional file 18. Document embeddings generated via TSNE for histology task (small dataset) with and without10 iterations of active learning. Documents are colored by majority class (number of total samples in dataset aboveaverage) and minority class (number of total samples in dataset below average).

Share

COinS