Without high-quality crash data and robust interpretive/analytical tools to analyze these data, transportation agencies will struggle to develop evidence-based strategies for improving road safety. Crash narratives are one element of crash reports that pose especially acute interpretive challenges. These narratives supplement coded data and give an account of incidents authored by responding law enforcement officers. Despite their value, conducting manual reviews of the 150,000+ crash reports and narratives issued in Kentucky each year is not feasible. To address this challenge, reviewers examined approximately 8,000 crash narratives from calendar year 2020 using a proprietary web-based quality control tool to identify discrepancies between narratives and coded data. The most pronounced inconsistencies between coded data and narratives were found in questions related to aggressive driving, distracted driving, intersection and secondary crashes, and travel direction. Building on this exercise, researchers developed a machine learning algorithm that automatically classifies attributes in crash records based on the interpretation of unstructured narrative text. Although this model performed well, goodness-of-fit metrics showed that a Google AI Language model (Bidirectional Encoder Representations from Transformers [BERT]) was more accurate and precise as well as having better recall. Future crash data quality control efforts that incorporate machine learning applications should use BERT, however, the latest advances in AI technology need to be integrated into new applications and models as they are developed.

Report Date


Report Number


Digital Object Identifier