Skip to content

Commit fecb4ff

Browse files
hyperpolymathclaude
andcommitted
fix(assail): strip C-family line comments before cross-language URL detection
Self-scan repro: a doc-comment line like /// types: ["http://hyperpolymath.dev/panic-attack/AssailReport"] was flagging InsecureProtocol PA-IP. The string in the comment is an illustrative JSON-LD `@type` namespace URI, not a configured endpoint — it has no runtime effect, but the cross-language detector's regex matched it anyway because `analyze_cross_language` was being passed the raw file content (no comment stripping). Fix: new `strip_c_family_line_comments` helper, applied to the content before the http://-URL regex runs in `analyze_cross_language`. The helper detects `//` (and `///`, `//!` by extension), respects string literals (so `"http://localhost"` is preserved), and handles escape sequences inside strings. Naturally covers Rust, JavaScript, TypeScript, Java, C, C++, Go — every language whose comment syntax is `//`. Python `#`, Lisp `;`, Lua/Idris `--` etc. are not (yet) language-aware; the cross-language detector remains best-effort and could be extended per file_path extension in a follow-up. Limits: - Block comments (/* */) and raw-string literals (r#"..."#) are not consumed here. The detector's existing localhost exemption + this line-comment strip handle the bulk of FPs in practice. - A real string-literal URL like `"http://example.com"` is STILL flagged — the regex sees through the string. That's correct: a hardcoded HTTP endpoint in production code is the signal we want. - JSON-LD `@type` URIs that genuinely live in code (not in comments) remain a TP from the regex's perspective; suppress them via the user-classification registry if they're audited. Regression coverage: 6 new tests in `assail::analyzer::tests` covering basic `//`, doc-comments, string-literal preservation, escape sequences, and the self-scan repro patterns (doc-URL stripped / JSON-LD-string preserved). Test URLs use http://localhost so panic-attack scanning its own source under self-scan doesn't trip the InsecureProtocol detector on the test data. Verification: - cargo test --bin panic-attack --features signing,http — 242 passed (was 236; +6 new tests, no failures) - cargo clippy --all-targets --features signing,http -D warnings — clean - cargo fmt --check — clean - self-scan before: 2 InsecureProtocol findings in storage/mod.rs (1 doc-comment FP, 1 JSON-LD literal — out of scope) - self-scan after: 1 InsecureProtocol finding in storage/mod.rs (the JSON-LD literal remains; the doc-comment FP is gone) - self-scan total findings: 11 (down from 12 yesterday; was already 11 after #51, this PR preserves the count while fixing a categorical FP class) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 0f6ea94 commit fecb4ff

1 file changed

Lines changed: 132 additions & 3 deletions

File tree

src/assail/analyzer.rs

Lines changed: 132 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4729,15 +4729,26 @@ impl Analyzer {
47294729
file_path: &str,
47304730
) -> Result<()> {
47314731
// HTTP (insecure) URLs - should be HTTPS
4732-
// Count http:// URLs that are NOT localhost/127.0.0.1 (those are fine)
4732+
// Count http:// URLs that are NOT localhost/127.0.0.1 (those are fine).
4733+
//
4734+
// Strip C-family line comments first: `/// example: http://example.com`
4735+
// in a doc-comment is illustrative prose, not a configured endpoint,
4736+
// and was the source of repeated FPs on JSON-LD `@type` namespace URIs
4737+
// documented in /// blocks. The same heuristic naturally handles JS,
4738+
// Java, C, C++, Go — every language whose comment syntax is `//`.
4739+
// Python `#`, Lisp `;`, etc. are not currently de-noised; the
4740+
// cross-language detector is best-effort, and a follow-up can add
4741+
// language-aware stripping per file_path extension.
4742+
let scan_content_owned = strip_c_family_line_comments(content);
4743+
let scan_content: &str = &scan_content_owned;
47334744
let http_re = RE_HTTP_URL
47344745
.get_or_init(|| Regex::new(r#"http://[a-zA-Z0-9]"#).expect("static regex is valid"));
47354746
let http_localhost_re = RE_HTTP_LOCALHOST.get_or_init(|| {
47364747
Regex::new(r#"http://(localhost|127\.0\.0\.1|0\.0\.0\.0|\[::1\])"#)
47374748
.expect("static regex is valid")
47384749
});
4739-
let http_total = http_re.find_iter(content).count();
4740-
let http_local = http_localhost_re.find_iter(content).count();
4750+
let http_total = http_re.find_iter(scan_content).count();
4751+
let http_local = http_localhost_re.find_iter(scan_content).count();
47414752
let http_count = http_total.saturating_sub(http_local);
47424753
if http_count > 0 {
47434754
weak_points.push(WeakPoint {
@@ -5531,6 +5542,57 @@ fn strip_proof_comments(content: &str, line_marker: &str, block: Option<(&str, &
55315542
out
55325543
}
55335544

5545+
/// Strip C-family line comments (`//`, `///`, `//!`) from each line, leaving
5546+
/// the rest of the source intact. String literals are detected so a `//`
5547+
/// inside `"http://example.com"` is NOT treated as a comment.
5548+
///
5549+
/// Used by `analyze_cross_language` to keep URL / secret regexes from firing
5550+
/// on illustrative code in doc-comments. Best-effort: handles `"..."` strings
5551+
/// with `\\` escapes. Raw-string literals (`r#"..."#`) and block comments
5552+
/// (`/* ... */`) are NOT consumed here — adding them would broaden semantics
5553+
/// and is left for a follow-up if FP evidence warrants it.
5554+
fn strip_c_family_line_comments(content: &str) -> String {
5555+
let mut out = String::with_capacity(content.len());
5556+
for (i, line) in content.lines().enumerate() {
5557+
if i > 0 {
5558+
out.push('\n');
5559+
}
5560+
out.push_str(strip_c_family_line_comment(line));
5561+
}
5562+
out
5563+
}
5564+
5565+
fn strip_c_family_line_comment(line: &str) -> &str {
5566+
let bytes = line.as_bytes();
5567+
let mut in_string = false;
5568+
let mut escape = false;
5569+
let mut idx = 0;
5570+
while idx < bytes.len() {
5571+
let b = bytes[idx];
5572+
if escape {
5573+
escape = false;
5574+
idx += 1;
5575+
continue;
5576+
}
5577+
if in_string {
5578+
if b == b'\\' {
5579+
escape = true;
5580+
} else if b == b'"' {
5581+
in_string = false;
5582+
}
5583+
idx += 1;
5584+
continue;
5585+
}
5586+
match b {
5587+
b'"' => in_string = true,
5588+
b'/' if idx + 1 < bytes.len() && bytes[idx + 1] == b'/' => return &line[..idx],
5589+
_ => {}
5590+
}
5591+
idx += 1;
5592+
}
5593+
line
5594+
}
5595+
55345596
/// Strip Isabelle prose constructs that are documentation, not proof code:
55355597
///
55365598
/// 1. `@{text ...}` antiquotations — used inside prose blocks AND inside
@@ -5869,6 +5931,73 @@ mod tests {
58695931
use std::fs;
58705932
use tempfile::TempDir;
58715933

5934+
// ---------------------------------------------------------------
5935+
// 0a. C-family line-comment stripping (cross-lang URL/secret FPs)
5936+
// ---------------------------------------------------------------
5937+
5938+
#[test]
5939+
fn strip_c_family_line_comment_handles_basic_double_slash() {
5940+
let line = "let x = 1; // a comment";
5941+
assert_eq!(strip_c_family_line_comment(line), "let x = 1; ");
5942+
}
5943+
5944+
#[test]
5945+
fn strip_c_family_line_comment_handles_doc_comments() {
5946+
// URLs use http://localhost so panic-attack's own InsecureProtocol
5947+
// detector (which scans this file during self-scan) treats them as
5948+
// exempt — otherwise the regression test plants a finding in its
5949+
// own source. Same exemption rule used in production for dev
5950+
// endpoints.
5951+
for prefix in ["///", "//!"] {
5952+
let line = format!("{prefix} example: http://localhost/example");
5953+
assert_eq!(strip_c_family_line_comment(&line), "");
5954+
}
5955+
}
5956+
5957+
#[test]
5958+
fn strip_c_family_line_comment_preserves_urls_in_strings() {
5959+
// Real string literals containing // must NOT be treated as comments.
5960+
let line = r#"let url = "http://localhost/path";"#;
5961+
let kept = strip_c_family_line_comment(line);
5962+
assert!(
5963+
kept.contains("http://localhost"),
5964+
"string-literal URL was wrongly stripped: {kept:?}"
5965+
);
5966+
}
5967+
5968+
#[test]
5969+
fn strip_c_family_line_comment_handles_escaped_quote_in_string() {
5970+
let line = r#"let s = "she said \"hi\""; // trailing"#;
5971+
let kept = strip_c_family_line_comment(line);
5972+
assert!(kept.contains(r#"\"hi\""#));
5973+
assert!(!kept.contains("trailing"));
5974+
}
5975+
5976+
#[test]
5977+
fn strip_c_family_line_comments_doc_comment_url_fp_gone() {
5978+
// Self-scan repro: doc-comment example URL must not feed the
5979+
// InsecureProtocol detector.
5980+
let src = "/// Endpoint: http://localhost/X\nfn foo() {}\n";
5981+
let stripped = strip_c_family_line_comments(src);
5982+
assert!(!stripped.contains("http://"));
5983+
assert!(stripped.contains("fn foo"));
5984+
}
5985+
5986+
#[test]
5987+
fn strip_c_family_line_comments_keeps_jsonld_type_string() {
5988+
// The second self-scan FP: an http:// URL inside a JSON literal
5989+
// (NOT a comment) must STILL be present after stripping. The
5990+
// JSON-LD-identifier-vs-endpoint distinction is left to the
5991+
// user-classification registry; the comment stripper must not
5992+
// accidentally hide real string-literal URLs.
5993+
let src = r#"json!({"types": ["http://localhost/X"]});"#;
5994+
let stripped = strip_c_family_line_comments(src);
5995+
assert!(
5996+
stripped.contains("http://"),
5997+
"string-literal URL was wrongly stripped: {stripped:?}"
5998+
);
5999+
}
6000+
58726001
// ---------------------------------------------------------------
58736002
// 0. Isabelle prose-stripping (regression for #43 false positives)
58746003
// ---------------------------------------------------------------

0 commit comments

Comments
 (0)