Enhancing Primary Care with NLP Medical Coding: Advances and Future Directions

The application of Natural Language Processing (NLP) in medical coding within primary care settings is gaining momentum, promising to streamline clinical workflows and improve data analysis. Our evaluation of various text classifiers demonstrates the potential of NLP to accurately categorize medical consultations, with conventional BERT models achieving a top F1 score of 0.51 on our test dataset. This performance, characterized by a recall of 56% and precision of 55%, significantly surpasses traditional n-gram-based classifiers, marking a substantial step forward in objective analysis of patient interactions within primary care.

This improved classification was achieved using models trained on medical code descriptions, which notably outperformed standard supervision methods that relied on a limited dataset of 191 transcripts (complete with codes, transcripts, and notes), which only reached an F1 score of 0.45. Furthermore, the inclusion of patient speech transcripts proved beneficial. Excluding these transcripts led to a performance decrease from an F1 score of 0.55 to 0.45, underscoring the value of capturing the full scope of patient-doctor conversations for effective medical coding. While these results are encouraging, further refinement is necessary to fully realize the potential of NLP medical coding in primary care.

It is crucial to assess whether classifiers at this performance level can genuinely aid clinicians in their daily practice. Current scores are at the lower end when compared to similar multiclass text categorization tasks, where RoBERTa classifiers have achieved average accuracies between 53% and 86% with just 100 training examples. Intent classification using BERT on dialogue benchmarks has reached even higher accuracy, nearing 93% with a mere 10 training examples. Future research should explore insights from these related tasks to identify strategies for enhancing the performance of NLP classifiers in primary care medical coding.

Interestingly, Naive Bayes (NB) classifiers proved competitive with BERT, suggesting that unigrams and bigrams are strong indicators of health-related topics. This also implies that datasets like the one used in our Online Interactive Medical record (OIAM) study might be too small to fully leverage the capabilities of deep learning models. Contrary to initial expectations, conventional BERT marginally outperformed BERT MLM on the test set. It’s worth noting the computational cost differences: BERT models require significant GPU training time (hours for all variants) compared to NB (seconds), and testing with BERT is approximately 100 times slower. However, this might be acceptable if training is a one-time deployment process. Future research could explore replacing PubMedBERT with other domain-specific pre-trained models such as BioBERT and ClinicalBERT to potentially improve performance and efficiency in Nlp Medical Coding For Primary Care. Extremely large language models (LLMs) also present an avenue for improved few-shot learning, although they demand extensive prompt engineering and substantial computational resources. The potential of LLMs to generate explanations for their coding decisions could be particularly valuable, highlighting relevant conversation segments for clinicians.

Multilabel classifiers in our study underperformed compared to multiclass classifiers. This could be attributed to the imbalanced nature of the training data, potentially harming recall, or to instances where multiple labels were assigned when only one was truly appropriate, affecting precision. However, the complex and broad nature of primary care consultations necessitates the ability to suggest multiple medical areas. Therefore, focusing future research on enhancing multilabel methods is essential for developing clinically relevant NLP medical coding systems for primary care.

Overfitting emerged as a challenge in supervised learning, especially given the limited examples for some medical codes (e.g., only five consultations coded as ‘F: eye’). This was evident in the higher performance on the training set compared to validation and test sets. Distant supervision, utilizing NICE CKS Health Topics and ICPC-2 code descriptions, showed clear improvements. The key phrases within ICPC-2 descriptions align well with NB models, as these features are individually informative, enabling linear models like NB to perform effectively. The imperfect mapping between CKS topics and ICPC-2 codes might have slightly reduced the performance of NB on CKS topics. Improving this mapping would involve resource-intensive manual editing of scraped CKS health topics due to the lack of one-to-one correspondence with ICPC-2 codes in some cases. Despite this, CKS topics achieved competitive performance with BERT, which is pre-trained on complete sentences, indicating that health topics contain valuable training signals. Future research could explore ensemble methods, stacking models trained with different data sources to leverage the strengths of each approach in NLP medical coding for primary care.

To better understand common errors, a clinician on our research team reviewed individual consultation transcripts alongside human and NLP-predicted codes. Several error types were identified. Firstly, simpler classifiers exhibited basic linguistic misunderstandings, such as misinterpreting idioms. For example, the phrase ‘keeping an eye on it’ led the NB classifier to incorrectly code a consultation as ophthalmology-related. BERT, by considering context beyond isolated words, overcame such errors. Secondly, reviewing consultations with significant coding discrepancies revealed errors in the original human labeling process. Thirdly, the ‘A: General’ category was often incorrectly selected due to its non-specific nature (low precision of 0.154 for NB multiclass trained on ICPC-2 descriptions). However, removing this category often negatively impacted overall performance. Finally, a lack of clinical knowledge sometimes led to errors, such as the NLP classifier misinterpreting a wrist consultation as musculoskeletal instead of neurological (e.g., carpal tunnel syndrome).

Many of these error types are linked to dataset limitations, including size, labeling quality, and the coding scheme itself. We believe dataset size is the most critical issue. Expanding datasets for NLP medical coding in primary care needs to address limitations such as the current dataset being exclusively in English and limited to a single region in the UK. Clinical machine learning has seen significant success in radiology and pathology, largely due to large, accessible, and anonymized datasets. Creating a large, anonymized, free-text dataset for primary care would be invaluable for advancing research in this field. The COVID-19 pandemic has accelerated the adoption of online consultations, generating potential sources of patient-entered free text and recorded audio/video consultations. We advocate for routinely incorporating consent for using digitally recorded clinical consultations in research, alongside robust anonymization protocols, to facilitate valuable and translational research in this area of NLP medical coding for primary care.

Future research directions include processing consultations in real-time and assigning more granular codes based on NICE CKS health topics rather than ICPC-2 codes. This would enable the system to automatically link clinicians to relevant health topic guidelines during consultations. Furthermore, combining textual data with other information from electronic medical records could further enhance performance and the clinical utility of NLP medical coding in primary care.

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *