Introducing industry-first Agentic AI to virtualize services. In natural language. Learn More >>
Best Practices for Controlling LLM Hallucinations at the Application Level
Large language models (LLMs) offer transformative potential for building applications but pose a critical challenge with hallucinations. Discover practical, engineering-led strategies to build trustworthy LLM-infused features.
Jump to Section
Large language models (LLMs) offer transformative potential for building applications but pose a critical challenge with hallucinations. Discover practical, engineering-led strategies to build trustworthy LLM-infused features.
Large language models (LLMs) are being integrated into applications at a breathtaking pace, promising to revolutionize user experiences. But as any team working on the front lines will tell you, it comes with a critical challenge: hallucinations.
An LLM’s ability to invent, extrapolate, and confidently fabricate information isn’t a bug to be fixed, but a core characteristic of its creative nature. For developers, this means we can’t simply hope a model won’t hallucinate this time and hope for the best. We have to build systems that anticipate this behavior and control it at the application level.
At Parasoft, while integrating LLMs into our software testing products, we’ve run into the challenge of managing hallucinations. In an early internal project, we gave an LLM summarized information from an OpenAPI service definition and asked it to create scenarios out of related APIs. As one of our engineers noted, "It would, you know, sometimes respond with two APIs that were real and one that was made up."
Without any guardrails, LLMs are unpredictable, which then makes applications that depend on them unreliable.
In this blog, I want to share some practical, engineering-led strategies our teams have developed to build trustworthy LLM-infused features.
The first line of defense against hallucinations is the prompt itself, and the most effective approach is often the most direct.
One very basic technique is to give the LLM foundational rules for how it should behave, such as instructing the model to not make up information. For example, you can use prompts like:
This approach acts as a primary guardrail. It keeps the model from defaulting to its creative tendencies and grounds its response in the specific context you’ve supplied.
So far, we find this to be a simple way to begin managing hallucinations at the source. Additionally, LLMs support a parameter called "temperature" that controls how creative the model is. When consistency and accuracy matter, set the model’s temperature low (~0), and raise it only when you need the LLM to be more creative. But be direct and tell it exactly what you want.
Clear instructions are a good start, but for true reliability, you need enforcement. This is where you move from "telling" the LLM what you want to programmatically "forcing" it to comply with a specific output format using structured outputs.
Most major model providers allow you to define a required JSON schema in your API call that tells the model the specific format in which you want the data, using a feature called "structured outputs." This constrains the model’s response, ensuring it adheres to your defined fields, data types, and structure—so that your code can depend on that specific format.
For example, when asking a model to generate a table of data, we use a schema like this:
JSON
"data": {
"type": "object",
"description": "The generated data table based on user requirements.",
"properties": {
"columnNames": {
"type": "array",
"description": "Header row of the table containing the column names.",
"items": { "type": "string" }
},
"generatedDataRows": {
"type": "array",
"items": {
"type": "array",
"description": "A row of data with the generated values for each column.",
"items": { "type": "string" }
}
}
}
}
This ensures the model returns a clean JSON object containing a columnNames
array and a generatedDataRows
array—not an unstructured paragraph or malformed JSON. Usage of structured outputs became essential when we saw the LLM returning a response that didn’t contain the data in a format that we could programmatically use.
To handle models that don’t support structured outputs, we found that we can include the JSON schema in the system prompt itself, which doesn’t guarantee the expected output but is a surprisingly effective fallback.
While direct instructions and structured outputs give you control over the prompt and the output format, you also need to control the knowledge the LLM uses to generate its answer.
Instead of letting the model pull only from its vast and sometimes outdated internal training data, you can ground it with domain knowledge using retrieval-augmented generation (RAG) technique. This incorporates an additional step for enhancing LLM output by adding information from external knowledge sources before generating a response.
By providing additional domain-specific information, you change the task from "recall an answer from your general memory" to "get an answer from these specific facts." This dramatically reduces hallucinations.
For example, instead of asking the model to name related APIs from memory (where it might invent one), a RAG technique would first retrieve the actual definitions of valid APIs and then ask the LLM to summarize them.
This grounds the model in relevant context and keeps the information current.
A common early mistake when working with LLMs is to create complex prompts filled with conditional logic, attempting to "script" the LLM to behave differently based on different conditions. This kind of over-explaining often confuses LLMs so that they focus on the wrong information and provide unexpected output. And even if the complex prompt works for one model it may not work for other models.
We realized that a better strategy is to split up the workflow into separate smaller agents that each know how to perform a specific task. As the workflow progresses, control gets transferred between agents based on the input from the user and the next step in the workflow. Instead of trying to control the model’s behavior with a single prompt, you split the behavior between the different agents. This allows the LLM to focus on the specific task at hand in each step.
Even with simple direct prompts and RAG, an LLM can still misinterpret source material or stretch the truth. The next layer of defense is a "trust, but verify" step that checks the model’s output before continuing the workflow.
Lightweight "judge" LLMs can compare the LLM output against the retrieved information used in RAG to assign a confidence score. Outputs that fall below a specific threshold are presented to the LLM again in a different way, or even automatically rejected, allowing the workflow to continue with content that has already been vetted.
Another technique is to use rule-based post-processing filters that validate the LLM output against known expectations. It’s important to clarify that these checks don’t judge the semantic quality of the AI’s response, for example, by asking, "Is this a good idea?" Instead, they validate its structural integrity or the content of its output.
As one of our engineers put it, you need to "actually have code double check the answer." This ensures the output conforms to known business rules before it’s used by the application.
The techniques we have mentioned can help you reduce hallucinations, but in the end you often need to build human oversight into the workflow supported by the product, so that the user has the ultimate decision about what is produced.
Whether this is needed depends on how severe the consequences of incorrect information would be. In some cases, you might conditionally require human intervention based on automated validation of the LLM output, as we mentioned previously.
To solicit human feedback, we implement a multi-step agentic flow that prompts for human input at various points in the process. A great example is our Test Scenario Creation Agent:
This propose-then-execute pattern gives the user final control, creating a crucial checkpoint that prevents the system from acting on a plausible-sounding hallucination that is incorrect.
Controlling hallucinations isn’t about finding a single magic prompt. It’s about building a multi-layered defense system within your application based on techniques such as those we have mentioned.
Since LLMs are going to hallucinate, you need to build a series of defenses to ensure you get reliable results. By combining resilient instructions, enforced schemas, and a robust verification process, you can move from hoping an LLM behaves to engineering a system that compensates for possible misbehavior and produces dependable output.
As the technology evolves, perhaps these application-level guardrails will change. In the meantime, keep them in mind to get more reliability out of LLMs when integrating them into your applications.
Want to learn more about building trustworthy LLM-infused features?