Benchmarks

CWE detection rates, comparison with Semgrep and CodeQL, and false positive analysis.

CWE Detection (v0.1.0, GPT-5.4)

Run on 47 vulnerable code samples across Python, JavaScript, Go, Java.

CWENameRateSamples
CWE-78OS Command Injection94%8/8
CWE-79Cross-site Scripting91%10/11
CWE-89SQL Injection100%6/6
CWE-94Code Injection88%7/8
CWE-200Sensitive Info Exposure83%5/6
CWE-259Hard-coded Password100%4/4
CWE-295Improper Cert Validation75%3/4
CWE-327Broken Crypto86%6/7
CWE-352CSRF80%4/5
CWE-502Unsafe Deserialization100%3/3
CWE-601Open Redirect83%5/6
CWE-611XXE100%2/2
CWE-798Hard-coded Credentials100%5/5
CWE-918SSRF86%6/7

Overall: 91.5% detection rate (43/47 samples).

Comparison with Static Analysis Tools

VulnerabilityCodeSightSemgrepCodeQL
SQL Injection100%100%100%
XSS91%82%91%
Command Injection94%75%88%
Hard-coded Secrets100%60%40%
Broken Crypto86%71%57%
SSRF86%43%71%
Deserialization100%67%100%
Logic-dependent83%12%25%

False Positive Rate

Tested on 30 clean code samples:

ModelFalse PositivesRate
GPT-5.426.7%
Claude Opus 4.6310.0%
Gemini 3.1 Pro413.3%
Llama 3 (8B)723.3%

Where CodeSight Wins

Where Traditional Tools Win

Run Your Own Benchmarks

Terminal
$ codesight benchmark --models gpt-5.4 claude-opus-4-6-20251101 llama3
$ codesight benchmark --json > results.json

OWASP Top 10 (2021) Coverage

CategoryCoverage
A01 - Broken Access ControlPartial
A02 - Cryptographic FailuresFull
A03 - InjectionFull
A04 - Insecure DesignPartial
A05 - Security MisconfigurationPartial
A06 - Vulnerable ComponentsNo (use pip-audit/npm audit)
A07 - Auth FailuresFull
A08 - Data Integrity FailuresPartial
A09 - Logging FailuresPartial
A10 - SSRFFull