Parasoft Logo

Addressing NASA Concerns About LLM Use in Safety-Critical Development

By Igor Kirilenko August 19, 2025 5 min read

GenAI may speed up engineering tasks like drafting safety cases, but NASA cautions that its tendency to generate believable but unverified content makes human oversight essential in critical systems. Read on to discover how combining constrained LLMs with traceable evidence and rigorous review offers a safer path forward.

Addressing NASA Concerns About LLM Use in Safety-Critical Development

By Igor Kirilenko August 19, 2025 5 min read

GenAI may speed up engineering tasks like drafting safety cases, but NASA cautions that its tendency to generate believable but unverified content makes human oversight essential in critical systems. Read on to discover how combining constrained LLMs with traceable evidence and rigorous review offers a safer path forward.

Generative AI has become an everyday engineering tool in record time. Development teams now rely on large language models (LLMs) to draft code and summarize test results—even to draft safety cases in the language that regulators require for embedded safety-critical development.

Generating compliance evidence for safety-critical development is still a manual, error-prone process. Teams export unit test logs, hand-label trace tables, and draft Goal-Structured Notation (GSN) diagrams line-by-line.

What makes large language models so useful is that they can draft those artifacts automatically, so long as we constrain them to verifiable sources.

However, a recent NASA report cautioned against using LLMs in this respect. The paper, "Examining Proposed Uses of LLMs to Produce or Assess Assurance Arguments," asks whether technology that generates natural-sounding text can be trusted when lives depend on it. The authors argue that the fundamental problem is that LLMs aim for plausible-sounding answers, not proven facts. So when an issue as simple as a single invented citation happens, it could invalidate an entire certification package. Or worse.

Building a Case That Auditors Trust

With safety-critical development, every conclusion needs to be backed up with a verifiable argument that it’s safe or secure. It’s called an assurance argument, and along with other documents, it makes up what’s known as the safety case.

Assurance arguments are typically structured with Goal Structuring Notation (GSN), a formal graphical diagram that breaks each safety claim into evidenced-based sub-goals. These are strongly encouraged by standards like ISO 26262 (automotive), DO-178C (aviation), and similar frameworks where each claim must trace to objective, verifiable evidence. It’s time-consuming, which is why using GenAI for this purpose is so useful.

But if you read the NASA paper and head up software safety, quality, or compliance, you may be concerned about the red-flag examples NASA raises for using LLMs for this.

In the sections below, we’ll translate their cautions into practical guardrails and show where disciplined AI can still deliver value without jeopardizing approval.

The Core Message From NASA’s Report

NASA’s authors address the misconception that most development teams are already aware of: that because LLMs sound authoritative with their answers, they must therefore be accurate.

In example after example, the report shows how this assumption is incorrect. There are LLMs inventing references, misquoting regulations, and gliding past corner-case hazards that make or break a safety case. NASA’s verdict is that until repeatable studies prove reliability, any LLM-generated argument must be treated as experimental and reviewed line by line by qualified engineers.

Their conclusion isn’t that AI should be forbidden, but they make the point that using it seems to shift time and responsibility somewhere else rather than simply saving worthy amounts of time. The engineer no longer writes every sentence, sure, but now every line the LLM proposes must be revalidated. So any efficiency that is gained in development, according to the paper, has a new supervisory burden to go with it.

Why GenAI Could Be Bad With Safety Cases

If your title includes design assurance, software safety, QA director, or principal systems engineer, the issues highlighted by NASA directly affect your workflow.

  • Audit trails cannot tolerate invented evidence. Regulators will insist every AI-produced claim traces back to a deterministic artifact. Things like test results, static analysis findings, coverage metrics, and so on.
  • Schedules have to absorb a new loop. Someone must police each line of machine-generated text. But now the question is who.
  • Budgets have new ROI questions. Any productivity boost claimed must survive the cost of extra scrutiny of AI-generated documentation.

Where to Use Human Intervention With AI & LLMs

As NASA says, LLMs are great at sounding correct, but they have no built-in sense of truth. They can invent facts, miss edge-case failures, and cite sources that don’t exist.

Used carefully, though, the same model can also flag potential weak spots, sometimes called "defeaters". They can scan your test logs for coverage gaps and static-analysis reports for recurring violations. That lets your team fix issues before an auditor finds them.

In every safety-critical domain—avionics, rail, medical—you can let AI tools write code, suggest tests, or group defects, but only if the AI-generated content links back to evidence you can trace and rerun. And because LLMs can sound confident when they’re wrong, a human reviewer still has to prepare or approve the final assurance argument.

For safety-critical work, we agree with NASA’s conclusion: an LLM may help, but a qualified human must still build and sign off on the actual assurance argument. The amount of effort required depends on the depth of verification each industry demands. Missed bugs cost money in the cloud but could cost lives in the real world.

A Guardrailed Approach for Using Generative AI

NASA’s paper also points out how much AI-automated code modern pipelines generate now. Far more than humans alone can review and produce safety evidence for.

To handle that volume, you need tooling that can triage findings deterministically first, then let a tightly scoped, on‑prem  LLM re‑express those vetted results. This is a situation in which solutions like Parasoft’s static analysis workflows will show you the violations that matter most. They can group those vetted findings and flag the ones that auditors actually need to see, without adding additional information (real or fabricated) to evidence you already know is correct.

The triaged findings can then be passed to a guardrailed LLM to summarize and put into the proper format. Guardrails help when you want to accurately re-express results to auditors. Guardrails are explicit constraints on what the model can see, rules on how it can reply, and post checks on what it produces. They’re to keep the LLM from inventing new information.

For a safety-critical example, think of an aerospace project where a constrained, on-premises LLM condenses a 50,000-line static analysis report into 10 prioritized defect patterns in 45 seconds so that engineers can focus on critical issues.

The Future of Self-Healing Tests

There’s also the observation that AI might render the term "self-healing test" obsolete, in that dynamic adaptation can correct a failing assertion before a tester ever sees red. That possibility could thrill DevOps teams but petrify safety engineers.

NASA’s paper reminds us why: If the correction itself is uncontrolled or untraceable, then the cure is worse than the failure. Parasoft’s approach is therefore to log every automated "fix" beside the failing baseline, so the human still signs off.

Our Thoughts on Their Conclusion

NASA’s authors conclude that, until repeatable studies prove reliability, every LLM-generated assurance argument should be treated strictly as an experiment. They’re useful to explore, but never safe to deploy on trust alone, and questionable as to how much time they actually save.

Since AI is advancing at an accelerated pace, an open mind and experimentation are key. Parasoft is exploring features that validate LLM-generated assurance snippets against actual evidence from our tools. Most recently, for example, we executed an internal research project about how a domain-specific model, trained solely on Parasoft artifacts and assurance patterns, might offer more reliability than a general-purpose chatbot.

But in line with NASA’s recommendation, we treat all such work as experimental until the community produces independent proof of safety and cost benefit. Overall, we think the time benefits are very real.

How to Make Safety Paramount While You Save Time

  • Anchor every AI suggestion to ground truth. A hyperlink to raw evidence turns "trust me" into "verify me."
  • Figure out the reviewer cost. If AI saves ten hours of manual triage but adds ten hours of oversight, then re-examine if it’s worth it.
  • Separate by risk. Use the strict playbook in life-critical domains. Use the faster one where it’s easier to roll back.
  • Insist on transparency from vendors—including us. Ask how the model is constrained, where the guardrails live, and what happens when it goes off script.

Keep Experimenting With Workflow, But Don’t Assume

A guardrailed LLM that sticks to proven test logs, trace links, and code scans isn’t only an excellent tool, it’s necessary to keep up. However, since it can still make things up (and is convincing about it) you still need a human to fact check.

But they’re learning fast. Fast enough that some models already zero in on static analysis violations better than we can. Use them with evidence in hand, wisely, and maybe you can turn today’s review slog into tomorrow’s head start.

Want to learn more about using LLMs in safety-critical development?

Talk With One of Our Experts