Parasoft vs GitHub Copilot for Static Analysis Violation Fixes

Jump to Section

Study Overview

This study evaluates how GitHub Copilot and Parasoft’s prompt templates generate code fixes for static analysis violations detected by Parasoft C/C++test. Both tools used GPT-4o, with fixes assessed using GPT-4o-2024-08-06 for pairwise comparisons.

Results show Parasoft’s prompts significantly outperformed GitHub Copilot: with reasoning questions, Parasoft was superior in 64.45% of cases, tied in 20.5%, and underperformed in 15.05%. Bare prompts without reasoning outperformed Copilot in 57.16% of cases.

Manual analysis suggests Parasoft’s prompts produce more complete and robust fixes through rule documentation and chain-of-thought reasoning.

Performance Comparison Results

GitHub Copilot vs C++test With Reasoning

	Win Rate	Tie Rate	Lose Rate
GitHub Copilot	0.150895	0.204604	0.644501
C++test with reasoning questions	0.644501	0.204604	0.150895

GitHub Copilot vs C++test Without Reasoning

	Win Rate	Tie Rate	Lose Rate
GitHub Copilot	0.199488	0.2289	0.571611
C++test without reasoning questions	0.571611	0.2289	0.199488

Pairwise Win Rates

	GitHub Copilot	C++test With Reasoning	C++test Without Reasoning
GitHub Copilot	—	0.150895	0.199488
C++test with reasoning	0.644501	—	0.313433
C++test without reasoning	0.571611	0.186567	—

The visualization clearly demonstrates Parasoft C++test’s superior performance across both prompt approaches:

C++test without reasoning achieved a 57.9% win rate against Copilot, with 22.9% ties and 19.9% losses
C++test with reasoning performed even better at 64.5% win rate, 20.5% ties, and only 15.1% losses
The reasoning-enhanced approach shows approximately 6.6 percentage point improvement in win rate over bare prompts

In both configurations, C++test wins more often than it ties or loses combined, demonstrating consistent superiority in fix quality.

Key Findings

This analysis demonstrates that fixes obtained with Parasoft’s prompts consistently rank better than those from GitHub Copilot. Performance is observed for both bare and reasoning prompt variants, with reasoning prompts performing slightly better.

Manual inspection of sample data revealed that fixes generated with Parasoft’s prompts are often more complete (such as fixing all instances of an issue on adjacent lines), more robust (implementing better error handling), and conform to standard coding practices.

The superior performance is hypothesized to stem from two key factors in Parasoft’s prompt design:

Rule documentation inclusion: Prompts incorporate the complete coding rule documentation, providing the model with explicit context about what constitutes a violation and how to properly address it
Chain-of-thought reasoning enforcement: Both the reasoning questions and the structural design of bare prompts promote systematic problem analysis, stimulating the model to think through the issue methodically before generating fixes

These elements work together to enhance the model’s fix generation capabilities, resulting in more reliable and comprehensive code corrections.

Research Limitations

This study acknowledges several methodological constraints:

Scope limitation. Analysis was limited to violations produced by Parasoft C/C++test, with only violations appearing within function bodies analyzed. Violations outside of function bodies were not included in this analysis.
Evaluation context bias. Ranking prompts included only the function body, not other relevant code, providing the same limited information as Parasoft’s fixing prompt. However, Copilot uses full (or nearly full) file context. In rare cases where Parasoft’s prompt templates indicate a violation is a false positive but the model lacks information to fix it, while Copilot’s broader context allows proper assessment, the ranking model with limited data might be inclined to agree with Parasoft’s assessment.