Introducing industry-first Agentic AI to virtualize services. In natural language. Learn More >>
Addressing NASA Concerns About LLM Use in Safety-Critical Development
GenAI may speed up engineering tasks like drafting safety cases, but NASA cautions that its tendency to generate believable but unverified content makes human oversight essential in critical systems. Read on to discover how combining constrained LLMs with traceable evidence and rigorous review offers a safer path forward.
GenAI may speed up engineering tasks like drafting safety cases, but NASA cautions that its tendency to generate believable but unverified content makes human oversight essential in critical systems. Read on to discover how combining constrained LLMs with traceable evidence and rigorous review offers a safer path forward.
Generative AI has become an everyday engineering tool in record time. Development teams now rely on large language models (LLMs) to draft code and summarize test results—even to draft safety cases in the language that regulators require for embedded safety-critical development.
Generating compliance evidence for safety-critical development is still a manual, error-prone process. Teams export unit test logs, hand-label trace tables, and draft Goal-Structured Notation (GSN) diagrams line-by-line.
What makes large language models so useful is that they can draft those artifacts automatically, so long as we constrain them to verifiable sources.
However, a recent NASA report cautioned against using LLMs in this respect. The paper, "Examining Proposed Uses of LLMs to Produce or Assess Assurance Arguments," asks whether technology that generates natural-sounding text can be trusted when lives depend on it. The authors argue that the fundamental problem is that LLMs aim for plausible-sounding answers, not proven facts. So when an issue as simple as a single invented citation happens, it could invalidate an entire certification package. Or worse.
With safety-critical development, every conclusion needs to be backed up with a verifiable argument that it’s safe or secure. It’s called an assurance argument, and along with other documents, it makes up what’s known as the safety case.
Assurance arguments are typically structured with Goal Structuring Notation (GSN), a formal graphical diagram that breaks each safety claim into evidenced-based sub-goals. These are strongly encouraged by standards like ISO 26262 (automotive), DO-178C (aviation), and similar frameworks where each claim must trace to objective, verifiable evidence. It’s time-consuming, which is why using GenAI for this purpose is so useful.
But if you read the NASA paper and head up software safety, quality, or compliance, you may be concerned about the red-flag examples NASA raises for using LLMs for this.
In the sections below, we’ll translate their cautions into practical guardrails and show where disciplined AI can still deliver value without jeopardizing approval.
NASA’s authors address the misconception that most development teams are already aware of: that because LLMs sound authoritative with their answers, they must therefore be accurate.
In example after example, the report shows how this assumption is incorrect. There are LLMs inventing references, misquoting regulations, and gliding past corner-case hazards that make or break a safety case. NASA’s verdict is that until repeatable studies prove reliability, any LLM-generated argument must be treated as experimental and reviewed line by line by qualified engineers.
Their conclusion isn’t that AI should be forbidden, but they make the point that using it seems to shift time and responsibility somewhere else rather than simply saving worthy amounts of time. The engineer no longer writes every sentence, sure, but now every line the LLM proposes must be revalidated. So any efficiency that is gained in development, according to the paper, has a new supervisory burden to go with it.
If your title includes design assurance, software safety, QA director, or principal systems engineer, the issues highlighted by NASA directly affect your workflow.
As NASA says, LLMs are great at sounding correct, but they have no built-in sense of truth. They can invent facts, miss edge-case failures, and cite sources that don’t exist.
Used carefully, though, the same model can also flag potential weak spots, sometimes called "defeaters". They can scan your test logs for coverage gaps and static-analysis reports for recurring violations. That lets your team fix issues before an auditor finds them.
In every safety-critical domain—avionics, rail, medical—you can let AI tools write code, suggest tests, or group defects, but only if the AI-generated content links back to evidence you can trace and rerun. And because LLMs can sound confident when they’re wrong, a human reviewer still has to prepare or approve the final assurance argument.
For safety-critical work, we agree with NASA’s conclusion: an LLM may help, but a qualified human must still build and sign off on the actual assurance argument. The amount of effort required depends on the depth of verification each industry demands. Missed bugs cost money in the cloud but could cost lives in the real world.
NASA’s paper also points out how much AI-automated code modern pipelines generate now. Far more than humans alone can review and produce safety evidence for.
To handle that volume, you need tooling that can triage findings deterministically first, then let a tightly scoped, on‑prem LLM re‑express those vetted results. This is a situation in which solutions like Parasoft’s static analysis workflows will show you the violations that matter most. They can group those vetted findings and flag the ones that auditors actually need to see, without adding additional information (real or fabricated) to evidence you already know is correct.
The triaged findings can then be passed to a guardrailed LLM to summarize and put into the proper format. Guardrails help when you want to accurately re-express results to auditors. Guardrails are explicit constraints on what the model can see, rules on how it can reply, and post checks on what it produces. They’re to keep the LLM from inventing new information.
For a safety-critical example, think of an aerospace project where a constrained, on-premises LLM condenses a 50,000-line static analysis report into 10 prioritized defect patterns in 45 seconds so that engineers can focus on critical issues.
There’s also the observation that AI might render the term "self-healing test" obsolete, in that dynamic adaptation can correct a failing assertion before a tester ever sees red. That possibility could thrill DevOps teams but petrify safety engineers.
NASA’s paper reminds us why: If the correction itself is uncontrolled or untraceable, then the cure is worse than the failure. Parasoft’s approach is therefore to log every automated "fix" beside the failing baseline, so the human still signs off.
NASA’s authors conclude that, until repeatable studies prove reliability, every LLM-generated assurance argument should be treated strictly as an experiment. They’re useful to explore, but never safe to deploy on trust alone, and questionable as to how much time they actually save.
Since AI is advancing at an accelerated pace, an open mind and experimentation are key. Parasoft is exploring features that validate LLM-generated assurance snippets against actual evidence from our tools. Most recently, for example, we executed an internal research project about how a domain-specific model, trained solely on Parasoft artifacts and assurance patterns, might offer more reliability than a general-purpose chatbot.
But in line with NASA’s recommendation, we treat all such work as experimental until the community produces independent proof of safety and cost benefit. Overall, we think the time benefits are very real.
A guardrailed LLM that sticks to proven test logs, trace links, and code scans isn’t only an excellent tool, it’s necessary to keep up. However, since it can still make things up (and is convincing about it) you still need a human to fact check.
But they’re learning fast. Fast enough that some models already zero in on static analysis violations better than we can. Use them with evidence in hand, wisely, and maybe you can turn today’s review slog into tomorrow’s head start.
Want to learn more about using LLMs in safety-critical development?