diff --git a/CHANGELOG.md b/CHANGELOG.md index 1bbd28e329..2c4e66549d 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -23,6 +23,7 @@ checkpoint, and status-only commits are intentionally omitted. ### Fixed +- Downgraded screenshot-only browser runtime proof so ClawSweeper no longer accepts "no visible console/CSP violation" screenshots as sufficient real behavior proof. Thanks @BunsDev. - Reduced default worker fan-out by about 20% across review shards, hot intake, commit review pages, repair live-worker caps, and automatic implementation dispatches. diff --git a/docs/pr-review-comments.md b/docs/pr-review-comments.md index 3f654387da..9506137540 100644 --- a/docs/pr-review-comments.md +++ b/docs/pr-review-comments.md @@ -55,7 +55,10 @@ report has: `realBehaviorProof` field. When proof is missing, mock-only, or insufficient, this section should tell contributors that terminal screenshots, console output, copied live output, linked artifacts, recordings, and redacted logs - count even for non-visual CLI or text changes + count even for non-visual CLI or text changes. Ordinary app screenshots count + only for behavior they directly show; browser runtime, network, CSP, and + security proof needs visible diagnostic output, not a "no visible console + violation" claim - `**Next step before merge**` for PRs, or `**Next step**` for issues, from the work-candidate reason or next action - `**Security**` from the typed `securityReview` field, so supply-chain, diff --git a/instructions/low-signal-prs.md b/instructions/low-signal-prs.md index b4441cdd6b..01f9f26e4e 100644 --- a/instructions/low-signal-prs.md +++ b/instructions/low-signal-prs.md @@ -33,7 +33,9 @@ owner discussion, green checks, or a focused fix that should be preserved. but provides no after-fix evidence from a real setup, or only lists unit tests, mocks, snapshots, lint, typechecks, or CI. Ask for screenshots, terminal screenshots, console output, copied live output, linked artifacts, recordings, - or redacted runtime logs instead. + or redacted runtime logs instead. For browser runtime, network, CSP, or + security changes, an ordinary app screenshot or "no visible console violation" + claim is not enough without visible diagnostic output. ## Evidence bar diff --git a/prompts/review-item.md b/prompts/review-item.md index 36913d945f..072e83cbdf 100644 --- a/prompts/review-item.md +++ b/prompts/review-item.md @@ -84,7 +84,7 @@ likely owner. For PRs, include a dedicated security review pass in addition to the functional review. Inspect whether the diff could introduce a security or supply-chain regression, especially when it touches CI workflows, GitHub Action refs, dependency sources, lockfiles, install/build/release scripts, package publishing metadata, secrets handling, permissions, downloaded artifacts, generated/vendor/minified files, or other code execution paths. Check whether those changes are consistent with the PR title, body, discussion, and stated purpose before deciding. Be cautious when a small or unrelated functional change also introduces new third-party code execution, broadens secret or permission access, changes package resolution, adds lifecycle hooks, downloads and executes artifacts, or mixes infrastructure changes into otherwise cosmetic work. Do not infer malicious intent without concrete evidence. Always summarize this pass in `securityReview`; set `status: "cleared"` when the diff has no concrete security or supply-chain concern, `status: "needs_attention"` when there is a concrete concern, and `status: "not_applicable"` for non-PR items without a security-sensitive report. Put concrete security concerns in `securityReview.concerns` with file/line when possible, and also include blocking concerns in `risks` and `evidence` when they affect the merge/close decision. -For PRs, include a dedicated `realBehaviorProof` assessment before any pass, automerge, or repair verdict. External PRs must show that the contributor ran the changed behavior after the fix in a real setup. Unit tests, mocks, snapshots, lint, typechecks, and CI are supplemental only; they are not real behavior proof by themselves. Treat screenshots, recordings, terminal screenshots, console output, copied live output, linked artifacts, and redacted runtime logs as valid proof, including for non-visual CLI, console, text, or error-message changes. Use your tools and best judgement: inspect the PR body, comments, links, screenshots, videos, logs, terminal output, and changed behavior context; you may download/open GitHub attachment links, generate stills or contact sheets from videos, inspect terminal screenshots and logs, and compare the proof against the PR diff. Use the provided scratch directory for downloaded artifacts and keep the target checkout read-only. Use `status: "sufficient"` only when the evidence convincingly shows after-fix real behavior and an observed improved result. Use `status: "missing"` when proof is absent, `status: "mock_only"` when proof is only tests/mocks/CI, `status: "insufficient"` when the evidence is unrelated, unviewable, too weak, or does not show the changed real behavior after the fix, `status: "override"` when the PR has `proof: override`, and `status: "not_applicable"` for non-PR items or maintainer/bot PRs where the gate does not apply. When proof is missing, mock-only, or insufficient, set `needsContributorAction: true`, make the PR a human-only merge blocker, and do not request ClawSweeper repair markers because automation cannot prove the contributor's setup for them. +For PRs, include a dedicated `realBehaviorProof` assessment before any pass, automerge, or repair verdict. External PRs must show that the contributor ran the changed behavior after the fix in a real setup. Unit tests, mocks, snapshots, lint, typechecks, and CI are supplemental only; they are not real behavior proof by themselves. Treat screenshots, recordings, terminal screenshots, console output, copied live output, linked artifacts, and redacted runtime logs as valid proof, including for non-visual CLI, console, text, or error-message changes. A plain app screenshot is sufficient only for behavior it directly shows. Do not mark screenshot-only proof sufficient for browser runtime, CSP, CORS, `connect-src`, auth callback, network, or security changes when the proof only says no console error, warning, or violation is visible; require console output, a network trace, terminal/live output, logs, a recording with diagnostics, or a linked artifact that actually shows the runtime path. Use your tools and best judgement: inspect the PR body, comments, links, screenshots, videos, logs, terminal output, and changed behavior context; you may download/open GitHub attachment links, generate stills or contact sheets from videos, inspect terminal screenshots and logs, and compare the proof against the PR diff. Use the provided scratch directory for downloaded artifacts and keep the target checkout read-only. Use `status: "sufficient"` only when the evidence convincingly shows after-fix real behavior and an observed improved result. Use `status: "missing"` when proof is absent, `status: "mock_only"` when proof is only tests/mocks/CI, `status: "insufficient"` when the evidence is unrelated, unviewable, too weak, or does not show the changed real behavior after the fix, `status: "override"` when the PR has `proof: override`, and `status: "not_applicable"` for non-PR items or maintainer/bot PRs where the gate does not apply. When proof is missing, mock-only, or insufficient, set `needsContributorAction: true`, make the PR a human-only merge blocker, and do not request ClawSweeper repair markers because automation cannot prove the contributor's setup for them. For PRs, also emit Codex `/review`-style findings in `reviewFindings`. Review the diff as another engineer's proposed patch and list every discrete, @@ -324,12 +324,14 @@ Always fill `realBehaviorProof`. For external PRs, this is a merge gate, not a nice-to-have. Missing, mock-only, or insufficient proof should appear near the top of the public review as "needs real behavior proof before merge"; tell the contributor that terminal screenshots, console output, copied live output, -linked artifacts, recordings, and redacted logs count. If the proof links to -public or GitHub-hosted media, inspect it when possible before deciding. Also -tell contributors that after they add proof, updating the PR body should trigger -a fresh ClawSweeper review automatically; if it does not, they can ask a -maintainer to comment `@clawsweeper re-review`. Use `evidenceKind: "none"` when -proof is absent or mock-only, and set +linked artifacts, recordings, and redacted logs count. For non-visual browser +runtime, network, CSP, or security behavior, do not accept an ordinary app +screenshot or "no visible console violation" claim without visible diagnostic +output. If the proof links to public or GitHub-hosted media, inspect it when +possible before deciding. Also tell contributors that after they add proof, +updating the PR body should trigger a fresh ClawSweeper review automatically; if +it does not, they can ask a maintainer to comment `@clawsweeper re-review`. Use +`evidenceKind: "none"` when proof is absent or mock-only, and set `needsContributorAction: false` only for `sufficient`, `override`, or `not_applicable`. diff --git a/schema/clawsweeper-decision.schema.json b/schema/clawsweeper-decision.schema.json index 02cfa32206..08756ffbac 100644 --- a/schema/clawsweeper-decision.schema.json +++ b/schema/clawsweeper-decision.schema.json @@ -311,10 +311,11 @@ }, "summary": { "type": "string", - "description": "One concise sentence explaining whether the PR body contains after-fix real behavior proof." + "description": "One concise sentence explaining whether the PR body contains after-fix real behavior proof. Do not call screenshot-only proof sufficient for browser runtime, CSP, CORS, network, or security changes when it only claims no console error or violation is visible." }, "evidenceKind": { "type": "string", + "description": "Use screenshot only for directly visible UI or flow proof. For browser runtime, CSP, CORS, network, or security proof, prefer terminal, logs, live_output, or linked_artifact unless the media visibly includes the diagnostic output.", "enum": [ "screenshot", "recording", diff --git a/src/clawsweeper.ts b/src/clawsweeper.ts index 866f8bf533..b86303b5f9 100644 --- a/src/clawsweeper.ts +++ b/src/clawsweeper.ts @@ -4484,12 +4484,49 @@ function reportRealBehaviorProof(markdown: string): RealBehaviorProof { ? (evidenceKindValue as RealBehaviorProofEvidenceKind) : undefined; if (!status || !evidenceKind || !summary) return defaultRealBehaviorProof(markdown); - return { + return normalizeRealBehaviorProof({ status, summary, evidenceKind, needsContributorAction: /^true$/i.test(needsContributorActionValue ?? ""), - }; + }); +} + +function screenshotProofNeedsRuntimeOutput(summary: string): boolean { + if ( + /\b(?:no|without|absence of|zero|none)\b[^.]{0,120}\b(?:visible\s+)?(?:console|network|error|warning|violation|csp|cors)\b/i.test( + summary, + ) + ) { + return true; + } + if ( + !/\b(?:csp|content[- ]security[- ]policy|connect-src|script-src|style-src|img-src|cors)\b/i.test( + summary, + ) + ) { + return false; + } + return !/\b(?:devtools|developer tools|console output|console panel|network trace|network panel|network tab|terminal|logs?|live output|request|response|status code|har)\b/i.test( + summary, + ); +} + +function normalizeRealBehaviorProof(proof: RealBehaviorProof): RealBehaviorProof { + if ( + proof.status === "sufficient" && + proof.evidenceKind === "screenshot" && + screenshotProofNeedsRuntimeOutput(proof.summary) + ) { + return { + status: "insufficient", + summary: + "The screenshot proof is not enough for browser runtime or security behavior; include console, network, terminal, live output, or logs showing the changed behavior after the fix.", + evidenceKind: "screenshot", + needsContributorAction: true, + }; + } + return proof; } function nextRealBehaviorProofSufficientLabels( diff --git a/test/clawsweeper.test.ts b/test/clawsweeper.test.ts index b728796870..72b537cf5f 100644 --- a/test/clawsweeper.test.ts +++ b/test/clawsweeper.test.ts @@ -1840,6 +1840,64 @@ Full review comments: assert.doesNotMatch(markers, /clawsweeper-verdict:needs-human/); }); +test("screenshot-only browser runtime proof blocks pass markers", () => { + const report = `${reportFrontMatter({ + type: "pull_request", + number: "74460", + decision: "keep_open", + close_reason: "none", + review_status: "complete", + confidence: "high", + author: "contributor", + author_association: "CONTRIBUTOR", + labels: JSON.stringify(["clawsweeper:automerge"]), + work_candidate: "none", + pull_head_sha: "abc123def456", + })} + +## Summary + +Keep this focused PR open for automerge. + +## What This Changes + +Adds tweakcn.com to the Control UI connect-src directive. + +## Best Possible Solution + +Ask the contributor to add browser runtime proof from their real setup. + +${realBehaviorProofReportSection({ + status: "sufficient", + evidenceKind: "screenshot", + needsContributorAction: false, + summary: + "The inspected screenshot shows an after-fix Control UI import success state for a tweakcn theme, with no visible console CSP violation.", +})} + +## Review Findings + +Overall correctness: patch is correct + +Overall confidence: 0.9 + +Full review comments: + +- none +`; + + const comment = renderReviewCommentFromReport(report, "none"); + const markers = reviewAutomationMarkersFromReport(report); + + assert.match(comment, /Codex review: needs real behavior proof before merge\./); + assert.match(comment, /Needs stronger real behavior proof before merge:/); + assert.match(comment, /not enough for browser runtime or security behavior/); + assert.match(comment, /console, network, terminal, live output, or logs/); + assert.match(markers, /clawsweeper-verdict:needs-human/); + assert.doesNotMatch(markers, /clawsweeper-verdict:pass/); + assert.doesNotMatch(markers, /proof: sufficient/); +}); + test("missing real behavior proof blocks pass and repair markers", () => { const report = `${reportFrontMatter({ type: "pull_request", @@ -2716,6 +2774,8 @@ test("review prompt requires real behavior proof for PR reviews", () => { assert.match(prompt, /download\/open GitHub attachment links/); assert.match(prompt, /generate stills or contact sheets from videos/); assert.match(prompt, /compare the proof against the PR diff/); + assert.match(prompt, /screenshot-only proof sufficient/); + assert.match(prompt, /no visible console violation/); assert.match(prompt, /scratch directory/); assert.match(prompt, /@clawsweeper re-review/); assert.match(