In the previous blog, we discussed user requirements for Generative Artificial Intelligence (Gen AI) solutions to work in healthcare. We proposed that we not only need the application to take responsibility for meeting user expectations but also that we need a way to formalize the output of Large Language Models (LLMs) to detect errors more reliably. However, the question remains: Are LLMs enough in and of themselves? Every other day, groundbreaking discoveries are being made. What began as chain-of-thought and tool-use by LLMs has become agentic, with the promise of executing all sorts of complex workflows. Not to mention that LLMs review each other's work to increase accuracy and effectiveness. So, a sophisticated set of LLMs programmed in the right agentic framework may be all that is required to accomplish anything. As a result, some AI experts believe that Artificial General Intelligence (AGI) is imminent.
In this blog, we argue that LLMs alone cannot meet all the user requirements in healthcare. This is regardless of how big they are or how much data they have been trained on. Make no mistake: LLMs are necessary to reimagine the user experience in natural language to understand user input and produce user-friendly output. However, they are insufficient to deliver reliable software that non-expert end users will adopt on their own.
There are many reasons for this, but the fundamental nature of LLMs is the main reason. Specifically, they must learn facts and store them in a way that can be accessed using natural language querying. Only recently have controlled studies been done to understand better how LLMs store and retrieve information. These studies provide key insights into what influences the quantity of information stored and the quality of retrieval. Here is a concise summary of key insights.
To store and retrieve a piece of information reliably, LLMs need to see that piece of information expressed in natural language in various ways. This must happen during pre-training for reliable results and cannot be done during the fine-tuning or prompting phase. To recapitulate, pre-training is typically done on a large volume of lower-quality data and is the most expensive step; fine-tuning is done with a small volume of high-quality data and is cheaper, especially if techniques like low-rank adaptation are used; and prompting is the cheapest because it takes advantage of the ability of these LLMs to follow instructions and requires only inference. Net-net, this means that without additional structure like a Retrieval Augmented Generation (RAG) pipeline, using the LLM for storage and retrieval will only be effective if the LLM has seen the same piece of information expressed in multiple ways during the most expensive pre-training step.
The amount of information that an LLM can reliably store for a given size is determined by how many times each piece of information is presented to it during training. If a piece of information appears 100 times in the training data, an LLM can store about one piece of information for each parameter. As stated above, there must be numerous ways to express the same piece of information in the training set. An LLM with 8 billion parameters may store 8 billion bits of information if it sees each "bit of information" approximately 100 times during training. If it sees each bit of information 1000 times, it can store twice as much, or around 16 billion pieces of information. Now, if there is garbage in the data, the capacity can be reduced by up to 20x, depending on the signal-to-noise ratio. The capacity of a 8 billion parameter model can drop to 400 million pieces of information.
Medical information needs of healthcare providers fall under two broad buckets:
Medical knowledge that is publicly available, including textbooks and published research. Each year, over a million articles are published as indexed in PubMed.
Patient data is private and confidential and available in their electronic health records (EHRs), either within a single EHR system or across multiple EHR systems.
LLMs trained using public data, such as Med Palm 2, claim to do well on the USMLE examination, but it is unclear whether they perform well on questions that are not publicly available. Furthermore, whether the training regimen for these medical LLMs followed the best practices implied by the above results is unclear. In either case, we at ThetaRho are now mainly focused on offering a natural language interface for accessing confidential patient information. While we would love to use these large medical LLMs in our application, the cost would be prohibitive, even if used just for inference. We believe we can provide a significantly more cost-effective alternative, especially for smaller clinics that do not have large IT budgets.
When it comes to private information, we cannot share it with a vendor LLM and must use a "separable" instance of the LLM per clinic to ensure compliance with regulations such as HIPAA. This means that, from the economic standpoint, we need to use smaller LLMs, preferably open source and aware of the domain-specific vocabulary. To determine what size of LLM is sufficient, we will assume the following:
An average small clinic has 20,000 patients (assuming 2,000 patients per physician and ten physicians per clinic on average),
An average of 1,000 pieces of information become available per patient per year. Including vitals and a range of laboratory test results. This figure could be substantially higher if we could use patient-generated data from all the health-tech gadgets they carry. However, this exercise assumes that various interactions between patients and physicians generate this information.
The average length of time a patient stays with a clinic is 20 years.
Assuming the foregoing, the average small clinic potentially needs to access up to 400 million pieces of information about its patients. The problem is that the current user experience in EHR systems provides physicians with a fragmented view of this information that is optimized for billing rather than patient care. A natural language interface powered by an LLM can be enormously valuable in this case. However, for an LLM to reliably answer questions in natural language, each of these bits of information must be presented to the LLM in a variety of ways during pre-training. In addition, the training regimen should expose the same bit of information multiple times so that the LLM can reliably store it. While a few billion-parameter model can store all this information in theory it is difficult to achieve the variety of expression and repetition required during training to store and access this information reliably. Given the current economics of GPUs and the state of the art in automation, it is unclear whether we can train an LLM from scratch for a single clinic's data in an economically viable manner.
Hence, we must consider RAG-like frameworks to augment the base LLM to provide the retrieval needed because the application can control many critical aspects to improve the quality of the results. For example, given a query, the application can determine the type, the most relevant documents from various repositories, and the prompt that will elicit the best response from the LLM. One does not need a LLM with a particularly large context window to do this well because using a large context window reduces the accuracy slightly and incurs a significant performance penalty. In addition, if the documents provided to the LLM are in a canonical form typical for the domain, such as JSON representations of Fast Healthcare Interoperability Resources (FHIR) resources in healthcare, the task of grounding the LLM result is easier. In an application-driven RAG-like framework, we can minimize the LLM's requirements to the point where a smaller model is sufficient.
If one were to accept the claims of the community pushing agentic systems as a manifestation of compound AI systems, they can achieve whatever the application logic can. There are many hurdles though, including:
There is objective evidence that LLMs are not good at planning, a key skill required for autonomous agentic systems.
In domains like healthcare where the cost of mistakes is high, getting agentic systems to be accurate can be expensive because multiple LLMs must be invoked multiple times until the goal is reached.
More complex reasoning and planning requires larger models. The number of layers in the model limits the complexity of reasoning.
In conclusion, I hope this post has convinced you that LLMs cannot wave a magic wand and magically provide a natural language experience for retrieving patient information that will work reliably. However, with clever application logic and a RAG-like framework, we can not only provide a natural language interface to all the patient data that physicians need access to, but we can also "ground" the output in many cases, increasing the confidence in the AI system.
Our journey has just begun. There is a fair amount of AI design, application logic development, operations hardening, and persistent testing and validation to be done to make the output of GenAI usable. But we are well on our way with established beta deployments.
We are seeking a few physician groups that use Athenahealth to help finalize the product. To learn more, please visit ThetaRho.ai and sign up.
Spend less time in the EHR so you can spend more time taking care of your patients, your family, and yourself.