Truth or Tale: Decoding LLM Hallucinations in Healthcare
How frequently do you have a bad experience with your map application, taking an expected detour or making it difficult for you to figure out directions at an intersection? Just as we have learned not to trust everything our map application suggests, we must be cautious about the outputs generated by large language models (LLMs), especially for healthcare implementations.
LLMs have shown tremendous value potential in healthcare, with high diagnostic accuracy, empathy, and bedside manners in multiple benchmarks. We have already seen early adoption with use cases like task automation, healthcare conversational assistants, and diagnostic aid/co-pilot.
These models, despite their vast knowledge and advanced capabilities, can sometimes produce incorrect outputs i.e. hallucinations. Healthcare is a highly regulated, fact & science-driven field where precision is paramount for any tech adoption. Hallucinations in LLMs is one such challenge that needs to be navigated for a successful implementation.
What are hallucinations?
In the LLM context, hallucinations refer to instances where the model generates output that are factually incorrect or non-sensical. This happens because of the probabilistic nature of the foundation models where they predict the next word in sequence based on patterns learned from the training data.
Why do LLMS hallucinate?
Hallucinations can occur in LLMs because of several factors –
1. Flawed Training Data: Biased, contradictory, or incomplete data in the training dataset may lead to inaccurate output by the LLM. The absence of domain-specific datasets or cultural context in the training data can worsen the situation. For example – If the model is not trained on a recent clinical guideline, it may fail to include it in the treatment plan.
2. Lack of Grounding: Since LLMs lack human-like world experiences, their comprehension and response accuracy is limited. For example – A doctor knows from practical experience how to store certain medications or change the dosage for specific patients while LLM only knows what is documented in textbooks, journals, and reports.
3. Model Architecture: LLMs excel at pattern recognition but can’t make logical connections or judgments. Hence, they sometimes fail to truly understand the meaning behind a pattern. For example – A model may generate erroneous output if it starts connecting unrelated medical symptoms solely based only on how frequently they appear together.
How to navigate LLM hallucinations?
1. System Prompts: LLMs are more likely to hallucinate when the prompts are ambiguous. Providing more context through system prompts (which are instructions and contextual information provided to the model), helps in guiding the model on how to interpret and respond to user queries. For example – LLMs often generate more creative & inaccurate outputs due to randomness, which is controlled by a parameter called Temperature. By using a system prompt with a low temperature, the accuracy and reliability of the model can be increased.
2. Advanced Prompt Engineering Techniques: Advanced techniques like in-context learning, iterative querying, and prompt chaining can provide more clarity on the task and expected output of a complex task. For example – Prompt chaining can be a much more effective technique for patient symptom collection use-case instead of a single, complicated prompt. Prompt chaining breaks down a complex task into a series of simplified instructions leading to a more accurate model response.
3. High-Quality Data & Model Training: Use-case-specific, accurate data can reduce the instances of hallucinations in an LLM. Select a model that comes pre-trained in the context of your application or implement techniques like RAG (Retrieval Augmented Generation) and fine-tuning to equip the LLM with additional data. For example – LLMs with high scores on medical benchmarks (like MedQA, MedMCQA, etc.) have been trained & tested on healthcare data. Hence, these models have a lower probability of hallucination in healthcare compared to the LLMs with low scores on medical benchmarks.
4. Explainable AI is a set of techniques that give clarity on how LLMs produce outputs, making them more transparent and understandable. For example – Providing a step-by-step process on how an LLM arrives at a diagnosis can help the doctor-in-the-loop to assess the accuracy of the diagnosis recommendation.
Conclusion
LLM hallucination is a key consideration when thinking about Generative AI integration in healthcare. Standards of precision, accuracy, and trust are of utmost importance when it comes to integrating AI in Healthcare. It is important to understand what could trigger these hallucinations and implement the relevant techniques to mitigate the potential risks. High-quality, use-case-specific datasets along with layers to pre-process information going into the LLM (prompt engineering, RAG) and a post-processing layer (human-in-loop & xAI) can help in successful healthcare implementations.
Namit Chugh
Namit Chugh is Principal, W Health Ventures