Take a faster, smarter path to AI-driven C/C++ test automation. Discover how >>
Whitepaper
Get a sneak peek of the study below.
This study evaluates how GitHub Copilot and Parasoft’s prompt templates generate code fixes for static analysis violations detected by Parasoft C/C++test. Both tools used GPT-4o, with fixes assessed using GPT-4o-2024-08-06 for pairwise comparisons.
Results show Parasoft’s prompts significantly outperformed GitHub Copilot: with reasoning questions, Parasoft was superior in 64.45% of cases, tied in 20.5%, and underperformed in 15.05%. Bare prompts without reasoning outperformed Copilot in 57.16% of cases.
Manual analysis suggests Parasoft’s prompts produce more complete and robust fixes through rule documentation and chain-of-thought reasoning.
| Win Rate | Tie Rate | Lose Rate | |
|---|---|---|---|
| GitHub Copilot | 0.150895 | 0.204604 | 0.644501 |
| C++test with reasoning questions | 0.644501 | 0.204604 | 0.150895 |
| Win Rate | Tie Rate | Lose Rate | |
|---|---|---|---|
| GitHub Copilot | 0.199488 | 0.2289 | 0.571611 |
| C++test without reasoning questions | 0.571611 | 0.2289 | 0.199488 |
| GitHub Copilot | C++test With Reasoning | C++test Without Reasoning | |
|---|---|---|---|
| GitHub Copilot | — | 0.150895 | 0.199488 |
| C++test with reasoning | 0.644501 | — | 0.313433 |
| C++test without reasoning | 0.571611 | 0.186567 | — |
The visualization clearly demonstrates Parasoft C++test’s superior performance across both prompt approaches:
In both configurations, C++test wins more often than it ties or loses combined, demonstrating consistent superiority in fix quality.
This analysis demonstrates that fixes obtained with Parasoft’s prompts consistently rank better than those from GitHub Copilot. Performance is observed for both bare and reasoning prompt variants, with reasoning prompts performing slightly better.
Manual inspection of sample data revealed that fixes generated with Parasoft’s prompts are often more complete (such as fixing all instances of an issue on adjacent lines), more robust (implementing better error handling), and conform to standard coding practices.
The superior performance is hypothesized to stem from two key factors in Parasoft’s prompt design:
These elements work together to enhance the model’s fix generation capabilities, resulting in more reliable and comprehensive code corrections.
This study acknowledges several methodological constraints:
Ready to dive deeper?