Skip to content

WebVoyager: 643-task live-web agent benchmark on Kernel browsers#42

Open
rgarcia wants to merge 11 commits into
hypeship/cua-bench-online-mind2webfrom
hypeship/bench-webvoyager
Open

WebVoyager: 643-task live-web agent benchmark on Kernel browsers#42
rgarcia wants to merge 11 commits into
hypeship/cua-bench-online-mind2webfrom
hypeship/bench-webvoyager

Conversation

@rgarcia

@rgarcia rgarcia commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

WebVoyager

WebVoyager (He et al., 2024, arXiv:2401.13919) is a benchmark for end-to-end web agents: given a natural-language goal, the agent drives a real browser on a live website and returns an answer. It is 643 tasks across 15 real-world sites — Allrecipes, Amazon, Apple, ArXiv, GitHub, Booking, ESPN, Coursera, Cambridge Dictionary, BBC News, Google Maps, Google Search, Google Flights, Huggingface, and Wolfram Alpha. Because the goals are open-ended and the web is live, success is not an exact-match check: a single-call GPT-4V multimodal judge reads the task, the agent's final answer, and the last k screenshots of the run and returns SUCCESS / NOT SUCCESS. The paper reports this auto-judge at 85.3% agreement with human judgment.

How upstream runs it

  • Agent / runner: run.py drives the browser and emits actions; the agent reports its result with an Action: ANSWER; [...] final action.
  • Grader: evaluation/auto_eval.py — the canonical single-call multimodal judge. It builds a fixed SYSTEM_PROMPT, attaches the last k screenshots of the episode, calls GPT-4V once, and parses the reply for NOT SUCCESS (→ fail) / SUCCESS (→ pass). The published auto-eval invocation (evaluation/run_eval.sh, README) uses --max_attached_imgs 15.

What this PR does

Brings WebVoyager to Kernel browsers as a runnable eval — generating the 643 tasks as Harbor task dirs that run on the Kernel environment, with the agent driving a sandboxed Chrome and the upstream judge re-implemented as an in-VM verifier. Parity details and a line-by-line upstream comparison are in PARITY.md.

Borrowed verbatim (for grading parity):

  • The dataset / tasks — all 643 records (web_name, id, ques, web) vendored from upstream commit 0915445 (sha-pinned in adapter_metadata.json), so generation is hermetic and the task set is identical.
  • The judge's SYSTEM_PROMPT — byte-for-byte from auto_eval.py.
  • Last-k screenshot selection and the SUCCESS / NOT SUCCESS verdict parse — same logic and the same NOT SUCCESS-wins-over-SUCCESS precedence; MAX_IMAGES defaults to 15 to match the canonical auto-eval invocation.

Differs from upstream (deliberate Kernel adaptations, prompt + decision logic unchanged) — and why:

  • Anthropic judge via @earendil-works/pi-ai instead of OpenAI GPT-4V. Standardizes the live-web adapters on the Anthropic judge and lets the transport handle provider routing, vision, and retries. The prompt and the decision logic are carried over unchanged, so the grading contract holds; absolute scores depend on the judge model, so JUDGE_MODEL is recorded alongside results. Default claude-sonnet-4-5.
  • Self-contained bundled node verifier (pi-ai bundled into judge.js) so the Kernel verifier VM needs no install at grade time.
  • Whole-answer-file as the judge's Result Response instead of the brittle ANSWER[; ]+[...] regex. The cua harness controls the answer end-to-end, so re-imposing the regex could only silently drop a present answer.
  • Task text sourced directly from ground_truth["task"] instead of regexing it back out of agent logs — same string by construction, strictly more reliable.
  • Fail-closed reward: an abstain (judge emits neither marker) or any judge-call error maps to reward 0, with the raw verdict saved to grading_details.json for audit, because Harbor's reward channel is a single float.
  • Task-id slugification (spaces → -, lowercased) to satisfy Harbor's name pattern; does not change which task is which.

In this repo

The adapter lives in benchmarks/adapters/webvoyager/. Generate the task dirs (after building the judge bundle once) and run on Kernel + cua:

cd benchmarks
python3 adapters/webvoyager/src/webvoyager/main.py --output-dir adapters/webvoyager/.tasks
uv run harbor run -p adapters/webvoyager/.tasks -e kernel \
  --agent-import-path cua_harbor:CuaHarborAgent -m anthropic/claude-sonnet-4-6 \
  --ae ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY

A live 20-task smoke and the full run/parity notes are in SMOKE.md and PARITY.md. Stacks on the shared cua-harbor core (#40).

Test plan

  • uv run pytest adapters/webvoyager/tests — generation + judge parse/verdict/error-handling (network stubbed)
  • uv run ruff check clean; judge package vitest green
  • Live 20-task smoke on Kernel (SMOKE.md)
  • Full parity run with extended per-task budget + residential proxy (follow-up)

@rgarcia

rgarcia commented Jun 27, 2026

Copy link
Copy Markdown
Contributor Author

Parity pass vs MinorJerry/WebVoyager @ 0915445 (9ca1f57): fixed a real fidelity bug — the judge's last-k screenshot count was 3 vs the canonical --max_attached_imgs 15. Raised MAX_IMAGES 3→15 (still env-overridable). This is the cause of the smoke's judge screenshot-coverage false-negatives. The GPT-4V SYSTEM_PROMPT + verdict parsing were confirmed verbatim; Anthropic-vs-OpenAI judge kept as an intentional adaptation. See PARITY.md.

@rgarcia rgarcia marked this pull request as ready for review June 27, 2026 22:33

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using high effort and found 2 potential issues.

Autofix Details

Bugbot Autofix prepared fixes for both issues found in the latest run.

  • ✅ Fixed: Invalid max-images breaks last-k
    • The judge CLI now validates --max-images as a positive integer and throws on invalid values, preventing slice(-0/-NaN) from attaching all screenshots.
  • ✅ Fixed: Negative limit drops wrong tasks
    • Task limit handling now rejects negative values in both CLI parsing and adapter initialization so slicing cannot silently select the wrong task set.

Create PR

Or push these changes by commenting:

@cursor push 8070bc1ebc
Preview (8070bc1ebc)
diff --git a/benchmarks/adapters/webvoyager/judge/src/judge.ts b/benchmarks/adapters/webvoyager/judge/src/judge.ts
--- a/benchmarks/adapters/webvoyager/judge/src/judge.ts
+++ b/benchmarks/adapters/webvoyager/judge/src/judge.ts
@@ -32,7 +32,15 @@
   detailsOut?: string;
 }
 
-function parseArgs(argv: string[]): Args {
+function parsePositiveIntFlag(flag: string, value: string): number {
+  const parsed = Number(value);
+  if (!Number.isInteger(parsed) || parsed <= 0) {
+    throw new Error(`--${flag} must be a positive integer`);
+  }
+  return parsed;
+}
+
+export function parseArgs(argv: string[]): Args {
   const flags = new Map<string, string>();
   for (let i = 0; i < argv.length; i += 1) {
     const arg = argv[i];
@@ -51,7 +59,7 @@
     answer: flags.get("answer") ?? "/logs/agent/answer.txt",
     shots: flags.get("shots") ?? "/logs/agent/shots",
     judgeModel: flags.get("judge-model") ?? "claude-sonnet-4-5",
-    maxImages: Number(flags.get("max-images") ?? "15"),
+    maxImages: parsePositiveIntFlag("max-images", flags.get("max-images") ?? "15"),
     rewardOut: required("reward-out"),
     detailsOut: flags.get("details-out"),
   };

diff --git a/benchmarks/adapters/webvoyager/judge/test/judge.test.ts b/benchmarks/adapters/webvoyager/judge/test/judge.test.ts
--- a/benchmarks/adapters/webvoyager/judge/test/judge.test.ts
+++ b/benchmarks/adapters/webvoyager/judge/test/judge.test.ts
@@ -3,7 +3,7 @@
 import { join } from "node:path";
 import { describe, expect, it } from "vitest";
 import type { Args } from "../src/judge.ts";
-import { run } from "../src/judge.ts";
+import { parseArgs, run } from "../src/judge.ts";
 import type { GradingDetails, JudgeContent, JudgeModel } from "../src/types.ts";
 
 /** A /logs/agent + /tests layout, plus the verifier output paths run() writes. */
@@ -46,6 +46,22 @@
   return JSON.parse(readFileSync(args.detailsOut!, "utf8")) as GradingDetails;
 }
 
+describe("parseArgs", () => {
+  it("defaults max-images to 15", () => {
+    const args = parseArgs(["--reward-out", "/tmp/reward.txt"]);
+    expect(args.maxImages).toBe(15);
+  });
+
+  it.each(["0", "-1", "abc", "2.5", ""])(
+    "rejects invalid max-images value %s",
+    (raw) => {
+      expect(() =>
+        parseArgs(["--reward-out", "/tmp/reward.txt", "--max-images", raw])
+      ).toThrow("--max-images must be a positive integer");
+    }
+  );
+});
+
 describe("run", () => {
   it.each([
     ["The agent did it. SUCCESS", "1"],

diff --git a/benchmarks/adapters/webvoyager/src/webvoyager/adapter.py b/benchmarks/adapters/webvoyager/src/webvoyager/adapter.py
--- a/benchmarks/adapters/webvoyager/src/webvoyager/adapter.py
+++ b/benchmarks/adapters/webvoyager/src/webvoyager/adapter.py
@@ -100,6 +100,8 @@
         **kwargs: object,
     ):
         self.output_dir = Path(output_dir)
+        if limit is not None and limit < 0:
+            raise ValueError("limit must be >= 0")
         self.limit = limit
         self.overwrite = overwrite
         self.task_ids = task_ids

diff --git a/benchmarks/adapters/webvoyager/src/webvoyager/main.py b/benchmarks/adapters/webvoyager/src/webvoyager/main.py
--- a/benchmarks/adapters/webvoyager/src/webvoyager/main.py
+++ b/benchmarks/adapters/webvoyager/src/webvoyager/main.py
@@ -21,6 +21,13 @@
     return Path(__file__).resolve().parents[3] / ".tasks"
 
 
+def _non_negative_int(value: str) -> int:
+    parsed = int(value)
+    if parsed < 0:
+        raise argparse.ArgumentTypeError("--limit/--num-tasks must be >= 0")
+    return parsed
+
+
 def _parse_args() -> argparse.Namespace:
     parser = argparse.ArgumentParser(
         description="Generate Harbor tasks for the WebVoyager benchmark",
@@ -34,7 +41,7 @@
     parser.add_argument(
         "--limit",
         "--num-tasks",
-        type=int,
+        type=_non_negative_int,
         dest="limit",
         default=None,
         help="Generate only the first N tasks",

diff --git a/benchmarks/adapters/webvoyager/tests/test_adapter.py b/benchmarks/adapters/webvoyager/tests/test_adapter.py
--- a/benchmarks/adapters/webvoyager/tests/test_adapter.py
+++ b/benchmarks/adapters/webvoyager/tests/test_adapter.py
@@ -2,6 +2,7 @@
 
 from __future__ import annotations
 
+import argparse
 import json
 import re
 import sys
@@ -14,6 +15,7 @@
 sys.path.insert(0, str(SRC))
 
 from webvoyager.adapter import WebVoyagerAdapter, _index_reference, _toml_escape  # noqa: E402
+from webvoyager.main import _non_negative_int  # noqa: E402
 
 
 @pytest.fixture
@@ -132,6 +134,18 @@
     assert ids == {"Amazon--3", "Apple--1"}
 
 
+def test_negative_limit_rejected(tmp_path: Path) -> None:
+    with pytest.raises(ValueError, match="limit must be >= 0"):
+        WebVoyagerAdapter(output_dir=tmp_path / "out", limit=-1)
+
+
+def test_limit_cli_parser_rejects_negative_values() -> None:
+    assert _non_negative_int("0") == 0
+    assert _non_negative_int("3") == 3
+    with pytest.raises(argparse.ArgumentTypeError, match="--limit/--num-tasks must be >= 0"):
+        _non_negative_int("-1")
+
+
 def test_overwrite_false_skips_existing(adapter: WebVoyagerAdapter) -> None:
     adapter.run()
     target = adapter.output_dir / "webvoyager-allrecipes--0" / "instruction.md"

You can send follow-ups to the cloud agent here.

Comment thread benchmarks/adapters/webvoyager/judge/src/judge.ts Outdated
Comment thread benchmarks/adapters/webvoyager/src/webvoyager/adapter.py

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using high effort and found 3 potential issues.

Autofix Details

Bugbot Autofix prepared fixes for all 3 issues found in the latest run.

  • ✅ Fixed: Wrong default tasks output path
    • Changed _default_output_dir() to use parents[2] so the default resolves to adapters/webvoyager/.tasks as documented.
  • ✅ Fixed: Task IDs omit slug alias
    • Extended task selection to also match normalize_id(task.source_id) so normalized IDs like apple--1 are accepted.
  • ✅ Fixed: Refresh pulls unpinned upstream data
    • Updated refresh URLs to read the pinned upstream_commit from adapter_metadata.json instead of tracking the upstream main branch.

Create PR

Or push these changes by commenting:

@cursor push d62ef80cff
Preview (d62ef80cff)
diff --git a/benchmarks/adapters/webvoyager/src/webvoyager/adapter.py b/benchmarks/adapters/webvoyager/src/webvoyager/adapter.py
--- a/benchmarks/adapters/webvoyager/src/webvoyager/adapter.py
+++ b/benchmarks/adapters/webvoyager/src/webvoyager/adapter.py
@@ -29,8 +29,11 @@
 # copied into each task's tests/ so the verifier runs `node judge.js` with no install.
 ADAPTER_ROOT = PACKAGE_DIR.parents[1]
 JUDGE_BUNDLE = ADAPTER_ROOT / "judge" / "dist" / "judge.js"
+UPSTREAM_COMMIT = json.loads((ADAPTER_ROOT / "adapter_metadata.json").read_text())["dataset"][
+    "upstream_commit"
+]
 
-RAW_BASE = "https://raw.githubusercontent.com/MinorJerry/WebVoyager/main/data"
+RAW_BASE = f"https://raw.githubusercontent.com/MinorJerry/WebVoyager/{UPSTREAM_COMMIT}/data"
 DATASET_URL = f"{RAW_BASE}/WebVoyager_data.jsonl"
 REFERENCE_URL = f"{RAW_BASE}/reference_answer.json"
 
@@ -136,6 +139,7 @@
                 task
                 for task in tasks
                 if task.source_id in requested
+                or self.normalize_id(task.source_id) in requested
                 or self.make_local_task_id(task.source_id) in requested
             ]
         if self.limit is not None:

diff --git a/benchmarks/adapters/webvoyager/src/webvoyager/main.py b/benchmarks/adapters/webvoyager/src/webvoyager/main.py
--- a/benchmarks/adapters/webvoyager/src/webvoyager/main.py
+++ b/benchmarks/adapters/webvoyager/src/webvoyager/main.py
@@ -18,7 +18,7 @@
 
 
 def _default_output_dir() -> Path:
-    return Path(__file__).resolve().parents[3] / ".tasks"
+    return Path(__file__).resolve().parents[2] / ".tasks"
 
 
 def _parse_args() -> argparse.Namespace:

diff --git a/benchmarks/adapters/webvoyager/tests/test_adapter.py b/benchmarks/adapters/webvoyager/tests/test_adapter.py
--- a/benchmarks/adapters/webvoyager/tests/test_adapter.py
+++ b/benchmarks/adapters/webvoyager/tests/test_adapter.py
@@ -13,7 +13,14 @@
 SRC = Path(__file__).resolve().parents[1] / "src"
 sys.path.insert(0, str(SRC))
 
-from webvoyager.adapter import WebVoyagerAdapter, _index_reference, _toml_escape  # noqa: E402
+from webvoyager.adapter import (  # noqa: E402
+    DATASET_URL,
+    REFERENCE_URL,
+    WebVoyagerAdapter,
+    _index_reference,
+    _toml_escape,
+)
+from webvoyager.main import _default_output_dir  # noqa: E402
 
 
 @pytest.fixture
@@ -132,6 +139,23 @@
     assert ids == {"Amazon--3", "Apple--1"}
 
 
+def test_task_ids_accept_normalized_alias(tmp_path: Path) -> None:
+    adapter = WebVoyagerAdapter(output_dir=tmp_path / "out", task_ids=["apple--1"])
+    selected = adapter._select()
+    assert {t.source_id for t in selected} == {"Apple--1"}
+
+
+def test_refresh_urls_pin_upstream_commit() -> None:
+    metadata = json.loads((Path(__file__).resolve().parents[1] / "adapter_metadata.json").read_text())
+    upstream_commit = metadata["dataset"]["upstream_commit"]
+    assert f"/{upstream_commit}/" in DATASET_URL
+    assert f"/{upstream_commit}/" in REFERENCE_URL
+
+
+def test_default_output_dir_points_to_adapter_root() -> None:
+    assert _default_output_dir() == Path(__file__).resolve().parents[1] / ".tasks"
+
+
 def test_negative_limit_rejected(tmp_path: Path) -> None:
     # tasks[:limit] with a negative limit would drop tasks off the end, so it must error.
     with pytest.raises(ValueError, match="non-negative"):

You can send follow-ups to the cloud agent here.

Comment thread benchmarks/adapters/webvoyager/src/webvoyager/main.py
Comment thread benchmarks/adapters/webvoyager/src/webvoyager/adapter.py
Comment thread benchmarks/adapters/webvoyager/src/webvoyager/adapter.py
rgarcia and others added 9 commits June 28, 2026 12:44
Generates WebVoyager's 643 live-web tasks (15 sites) as Harbor task dirs that
run on the Kernel environment via the shared cua_harbor agent. Each record
becomes instruction.md + environment/kernel.json (start_url + stealth +
1280x1024) + a per-task ground_truth.json; the dataset is vendored and pinned
to upstream commit 0915445 for hermetic generation.

The verifier ports WebVoyager's single multimodal judge (SYSTEM_PROMPT verbatim
from upstream auto_eval.py) to the Anthropic Messages API: it reads
/logs/agent/answer.txt + the last-k /logs/agent/shots/shot-<n>.png the agent
spilled and writes a 0/1 reward (SUCCESS/NOT SUCCESS, ambiguous fails closed).

Site names with spaces are slugified so [task].name matches ORG_NAME_PATTERN,
and reference answers with stray control chars are escaped for valid TOML.
Generated task dirs and caches are gitignored. Mocked unit tests + ruff green.
The Kernel verifier VM has Python 3 but no pip/ensurepip, so the judge
cannot install the anthropic SDK at grade time. Call the Messages API
directly with urllib.request instead; drop the install step from test.sh
and point docs at bare python3 for generation. Also gitignore _smoke_logs/.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Run the WebVoyager adapter end to end on Kernel browsers with cua as the
agent and the Anthropic WebJudge as the verifier: 20 tasks, pass rate 10/20
(10/17 of graded tasks), 3 agent timeouts on heavy/anti-bot sites, no adapter
bugs. SMOKE.md captures the per-task table and the observed failure taxonomy.

Make the judge resilient across model generations and transient API failures:
retry once without `temperature` when a model rejects it with a 400 (newer
models do), and fail closed to reward 0 with the error recorded in
grading_details.json instead of crashing a trial into a missing reward.
The mid-run snapshot under-counted exceptions; the final summary is 5
(4 AgentTimeoutError + 1 AddTestsDirError). Headline Mean 0.500 (10/20)
unchanged.
Replace the SMOKE notes with the claude-opus-4-8 agent + opus-4-8 judge run:
14/20 pass over 20 curated tasks across 12 sites, 0 judge/adapter exceptions.
Failure taxonomy: 1 anti-bot (Cloudflare), 3 screenshot-coverage false-negatives
(the MAX_IMAGES tension), 1 agent timeout (multi-constraint faceted search), and
1 env/session-lifetime error (session deleted before the shared-session verifier
could attach). The judge hardening this run validated (temperature-drop retry +
fail-closed on HTTP error) is already on the branch.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… auto-eval

The canonical WebVoyager auto-eval invocation (evaluation/run_eval.sh + README)
runs the GPT-4V judge with --max_attached_imgs 15; our default of 3 was read
from auto_eval.py's argparse default, which is never what produces the published
numbers. With one screenshot spilled per agent step, the last-k window is the
only place the deciding frame can land, so k=3 left correct answers unverifiable
and produced screenshot-coverage false-negatives.

Set the default to 15 in task.toml and webjudge.py (env override preserved) and
fix the README/run-config notes that quoted the old default. A live re-run at
k=15 recovers the SMOKE false-negatives (apple--2, huggingface--2 both 0 -> 1).
Adds PARITY.md documenting the applied fix vs the deliberate Kernel adaptations
left intact.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace the hand-rolled stdlib-urllib Anthropic judge (webjudge.py) with a
self-contained TypeScript bin under judge/ that calls the model through
@earendil-works/pi-ai's completeSimple. pi-ai owns provider routing, env-var
keys, o-series temperature/max_completion_tokens quirks, vision, and retries,
so the manual provider client and temperature-drop retry are deleted rather
than ported.

Transport-only change: the SYSTEM_PROMPT, the last-k (MAX_IMAGES=15) screenshot
selection, the SUCCESS/NOT SUCCESS verdict parse, and the claude-sonnet-4-5
default are carried over byte-identically. JUDGE_MODEL is now a pi-ai
provider:name ref (bare name defaults to anthropic). pi-ai is bundled into a
single judge.js via tsdown (inlineDynamicImports) so the verifier runs with no
install on the Kernel VM. test.sh shells `node judge.js` with the same inputs;
the adapter copies the built bundle into each task's tests/ and build-checks it.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…i-ai judge

Same o-series handling as the online-mind2web judge: o4-mini/o3 reject
`temperature` and a `none` reasoning effort. Gate `reasoning: medium` + no
temperature to OpenAI reasoning models; the claude-sonnet-4-5 default keeps
`temperature: 0`. Also drops the docstring's incorrect claim that pi-ai omits
temperature for o-series. Verified live with both claude-sonnet-4-5 and
openai:o4-mini.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
A 0/negative/non-numeric --max-images made lastShots' slice(-k) attach
every screenshot instead of the last k, risking judge token blowups and
spurious 0 rewards; parse it to a positive integer, falling back to the
15 default otherwise. A negative --limit likewise made tasks[:limit] drop
tasks off the end instead of taking the first N, so reject it.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@rgarcia rgarcia force-pushed the hypeship/bench-webvoyager branch from d2cff75 to cad3863 Compare June 28, 2026 12:45
The default output dir was computed with parents[3], which resolves to
benchmarks/adapters/.tasks instead of the documented
adapters/webvoyager/.tasks. Use parents[2] so the default lands in the
adapter package root, matching the README and --help text. Add a test
covering the resolved default.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using high effort and found 2 potential issues.

There are 4 total unresolved issues (including 2 from previous reviews).

Autofix Details

Bugbot Autofix prepared fixes for both issues found in the latest run.

  • ✅ Fixed: Artifact reads bypass fail-closed
    • I moved ground-truth, answer, and screenshot loading inside the existing fail-closed try/catch so artifact I/O and parse failures now still write reward 0 and optional grading details.
  • ✅ Fixed: OpenAI judge key not plumbed
    • I added OPENAI_API_KEY passthrough in the task template verifier environment so OpenAI judge models can authenticate on the verifier VM.

Create PR

Or push these changes by commenting:

@cursor push 93ce0ffea5
Preview (93ce0ffea5)
diff --git a/benchmarks/adapters/webvoyager/judge/src/judge.ts b/benchmarks/adapters/webvoyager/judge/src/judge.ts
--- a/benchmarks/adapters/webvoyager/judge/src/judge.ts
+++ b/benchmarks/adapters/webvoyager/judge/src/judge.ts
@@ -77,27 +77,29 @@
 
 /**
  * Read the artifacts, grade through `makeJudge()`, and write reward.txt (+
- * optional grading_details.json). Both model resolution and the judge call run
- * inside the try, so a missing key or a transient API error fails closed to
- * reward 0 with the error recorded in the details rather than crashing the
- * verifier. Takes a factory so the file contract is testable without a live
- * provider call.
+ * optional grading_details.json). Artifact reads, model resolution, and the
+ * judge call all run inside the try so parse/fs errors or API failures fail
+ * closed to reward 0 with the error recorded in details instead of crashing
+ * the verifier. Takes a factory so the file contract is testable without a
+ * live provider call.
  */
 export async function run(args: Args, makeJudge: () => JudgeModel): Promise<void> {
-  const task = loadGroundTruth(args.groundTruth).task;
-  const answer = loadAnswer(args.answer);
-  const shots = lastShots(args.shots, args.maxImages);
-
-  // No answer and no screenshots: nothing to judge, fail closed without details.
-  if (!answer && shots.length === 0) {
-    writeReward(args.rewardOut, 0);
-    return;
-  }
-
+  let answer = "";
+  let shots = [] as ReturnType<typeof lastShots>;
   let verdict = "";
   let reward: 0 | 1 = 0;
   let error: string | null = null;
   try {
+    const task = loadGroundTruth(args.groundTruth).task;
+    answer = loadAnswer(args.answer);
+    shots = lastShots(args.shots, args.maxImages);
+
+    // No answer and no screenshots: nothing to judge, fail closed without details.
+    if (!answer && shots.length === 0) {
+      writeReward(args.rewardOut, 0);
+      return;
+    }
+
     ({ verdict, reward } = await gradeWithWebJudge({ task, answer, shots, judge: makeJudge() }));
   } catch (err) {
     error = err instanceof Error ? `${err.name}: ${err.message}` : String(err);

diff --git a/benchmarks/adapters/webvoyager/src/webvoyager/task-template/task.toml b/benchmarks/adapters/webvoyager/src/webvoyager/task-template/task.toml
--- a/benchmarks/adapters/webvoyager/src/webvoyager/task-template/task.toml
+++ b/benchmarks/adapters/webvoyager/src/webvoyager/task-template/task.toml
@@ -19,6 +19,7 @@
 
 [verifier.env]
 ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}"
+OPENAI_API_KEY = "${OPENAI_API_KEY}"
 JUDGE_MODEL = "${WEBVOYAGER_JUDGE_MODEL:-claude-sonnet-4-5}"
 MAX_IMAGES = "${WEBVOYAGER_MAX_IMAGES:-15}"

You can send follow-ups to the cloud agent here.

Comment thread benchmarks/adapters/webvoyager/judge/src/judge.ts Outdated
The judge bin read ground_truth.json / answer.txt / the screenshots before
the try/catch, so a missing or corrupt artifact threw out of the bin and left
reward.txt unwritten (only test.sh's empty-file fallback wrote 0). Move the
artifact reads inside the try so any read/parse error fails closed to reward 0
with the error in grading_details, matching the online-mind2web judge.

Also forward OPENAI_API_KEY in [verifier.env] so the documented
JUDGE_MODEL=openai:o4-mini path has its key on the verifier VM.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using high effort and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Anthropic judge temperature not dropped
    • Updated judge option selection so reasoning models only get OpenAI reasoning effort and no longer send temperature for non-OpenAI providers, preventing Anthropic 400 errors.

Create PR

Or push these changes by commenting:

@cursor push 7a029ad4ab
Preview (7a029ad4ab)
diff --git a/benchmarks/adapters/webvoyager/judge/src/model.ts b/benchmarks/adapters/webvoyager/judge/src/model.ts
--- a/benchmarks/adapters/webvoyager/judge/src/model.ts
+++ b/benchmarks/adapters/webvoyager/judge/src/model.ts
@@ -36,15 +36,15 @@
   // string, so widen the way pi-ai's own consumers do.
   const model = getModel(provider as never, name as never) as Model<Api>;
   const apiKey = getEnvApiKey(provider);
-  // OpenAI reasoning backbones (o4-mini, o3, …) reject `temperature` and a
-  // reasoning effort of "none" (pi-ai's default when unset); they require
-  // low/medium/high. Other backbones — including this adapter's
-  // claude-sonnet-4-5 default — keep deterministic scoring (temperature 0).
+  // Reasoning backbones reject `temperature`; OpenAI reasoning models also
+  // require reasoning effort low/medium/high (pi-ai defaults to "none").
+  // Non-reasoning backbones keep deterministic scoring (temperature 0).
   const baseOptions = { apiKey, maxTokens: MAX_TOKENS };
-  const options =
-    model.reasoning && provider === "openai"
+  const options = model.reasoning
+    ? provider === "openai"
       ? { ...baseOptions, reasoning: "medium" as const }
-      : { ...baseOptions, temperature: 0 };
+      : baseOptions
+    : { ...baseOptions, temperature: 0 };
   return {
     async complete(systemPrompt, content) {
       const res = await completeSimple(

You can send follow-ups to the cloud agent here.

Reviewed by Cursor Bugbot for commit a15dafe. Configure here.

Comment thread benchmarks/adapters/webvoyager/judge/src/model.ts
@rgarcia rgarcia changed the title WebVoyager Harbor adapter (Kernel env + cua agent) WebVoyager: 643-task live-web agent benchmark on Kernel browsers Jun 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant