Benchmarks

CWE detection rates, comparison with Semgrep and CodeQL, and false positive analysis.

CWE Detection (v0.1.0, GPT-5.4)

Run on 47 vulnerable code samples across Python, JavaScript, Go, Java.

CWE	Name	Rate	Samples
CWE-78	OS Command Injection	94%	8/8
CWE-79	Cross-site Scripting	91%	10/11
CWE-89	SQL Injection	100%	6/6
CWE-94	Code Injection	88%	7/8
CWE-200	Sensitive Info Exposure	83%	5/6
CWE-259	Hard-coded Password	100%	4/4
CWE-295	Improper Cert Validation	75%	3/4
CWE-327	Broken Crypto	86%	6/7
CWE-352	CSRF	80%	4/5
CWE-502	Unsafe Deserialization	100%	3/3
CWE-601	Open Redirect	83%	5/6
CWE-611	XXE	100%	2/2
CWE-798	Hard-coded Credentials	100%	5/5
CWE-918	SSRF	86%	6/7

Overall: 91.5% detection rate (43/47 samples).

Vulnerability	CodeSight	Semgrep	CodeQL
SQL Injection	100%	100%	100%
XSS	91%	82%	91%
Command Injection	94%	75%	88%
Hard-coded Secrets	100%	60%	40%
Broken Crypto	86%	71%	57%
SSRF	86%	43%	71%
Deserialization	100%	67%	100%
Logic-dependent	83%	12%	25%

Tested on 30 clean code samples:

Logic-dependent vulnerabilities - auth bypass, race conditions, TOCTOU bugs
Hard-coded secrets - catches secrets through intermediate variables, not just password = "..."
Zero-day patterns - generalises to new vulnerability classes without explicit rules
Context-aware - eval() in a test file vs. a request handler

Terminal

$ codesight benchmark --models gpt-5.4 claude-opus-4-6-20251101 llama3
$ codesight benchmark --json > results.json

Category	Coverage
A01 - Broken Access Control	Partial
A02 - Cryptographic Failures	Full
A03 - Injection	Full
A04 - Insecure Design	Partial
A05 - Security Misconfiguration	Partial
A06 - Vulnerable Components	No (use pip-audit/npm audit)
A07 - Auth Failures	Full
A08 - Data Integrity Failures	Partial
A09 - Logging Failures	Partial
A10 - SSRF	Full