# Eval Analysis Summary

**Run:** bu-benchmark-p22ml
**Generated:** 2026-05-20T01:37:02.726Z
**Agent:** pilo, gemini-2.5-flash, vertex, chrome, vision
**webSearch tool:** enabled ✓

## Results: 7/20 passed (35%)

| Metric | Value |
|--------|-------|
| Total tasks | 20 |
| Passed | 7 |
| Failed | 13 |
| Total tokens | 1,293,117 |

## Failure Classifications

| Classification | Count |
|----------------|-------|
| agent_gave_up_early | 10 |
| element_interaction_failed | 3 |

---

## Failed Tasks by Website

### 296beb37-09d9-49f6-b644-1efa0383a483 (1 failure)

#### browser-use-benchmark--296beb37-09d9-49f6-b644-1efa0383a483 _(ran: 2026-05-20 01:19:02 UTC)_

**Question:** Go to the URL and complete the Ember form by filling in all required fields and submitting. If needed, create a file. Validate that the form was filled and submitted successfully. https://browser-use.github.io/stress-tests/challenges/ember-form.html
**Expected:** The agent successfully completed the following task: Go to the URL and complete the Ember form by filling in all required fields and submitting. If needed, create a file. Validate that the form was filled and submitted successfully. https://browser-use.github.io/stress-tests/challenges/ember-form.html. Evaluate whether the task was accomplished as described, based on the agent's reported result.

**Agent answer:** Aborted: The form requires a file upload, and I cannot interact with the file upload dialog directly through the browser automation tool.

**Judge explanation:** The Web Task Instruction required the agent to go to a URL, complete an Ember form by filling all required fields, submit it, and validate successful submission. The Reference Answer states that the agent successfully completed this task. However, the Result Response explicitly states, 'Aborted: The form requires a file upload, and I cannot interact with the file upload dialog directly through the browser automation tool.' This indicates that the agent was unable to complete the form, submit it, or validate its submission due to a technical limitation with file uploads. Therefore, the task was not successfully completed as described in the Reference Answer.

**Classification:** agent_gave_up_early
**Error patterns:** navCount=2

---
### bb85cb1e-679f-455d-95e5-edd421ea8205 (1 failure)

#### browser-use-benchmark--bb85cb1e-679f-455d-95e5-edd421ea8205 _(ran: 2026-05-20 01:29:10 UTC)_

**Question:** Go to the URL and complete the React Hook Form by filling in all required fields and submitting. If needed, create a file. Validate that the form was filled and submitted successfully. https://browser-use.github.io/stress-tests/challenges/react-hook-form.html
**Expected:** The agent successfully completed the following task: Go to the URL and complete the React Hook Form by filling in all required fields and submitting. If needed, create a file. Validate that the form was filled and submitted successfully. https://browser-use.github.io/stress-tests/challenges/react-hook-form.html. Evaluate whether the task was accomplished as described, based on the agent's reported result.

**Agent answer:** Aborted: Cannot complete the task because I cannot interact with the file system to create and upload a file. The form requires a file to be uploaded before submission, it can be submitted.

**Judge explanation:** The Web Task Instruction required the agent to go to the URL, complete a React Hook Form by filling all required fields, and submit it, including creating and uploading a file if necessary, and then validating successful submission. The Reference Answer indicates that the agent successfully completed this task. However, the Result Response explicitly states, 'Aborted: Cannot complete the task because I cannot interact with the file system to create and upload a file. The form requires a file to be uploaded before submission, it can be submitted.' This clearly indicates that the agent was unable to complete a critical part of the task (uploading a file) and therefore could not submit the form or validate its successful completion. Thus, the Result Response contradicts the Reference Answer and shows that the task was not successfully accomplished.

**Classification:** agent_gave_up_early
**Error patterns:** navCount=2

---
### fe521e79-3b4d-4218-8bf9-f3421e44cb5c (1 failure)

#### browser-use-benchmark--fe521e79-3b4d-4218-8bf9-f3421e44cb5c _(ran: 2026-05-20 01:32:55 UTC)_

> ⚠️ Agent never left the starting page (navCount=1)

**Question:** Go to the URL and complete the Formik form by filling in all required fields and submitting. If needed, create a file. Validate that the form was filled and submitted successfully. https://browser-use.github.io/stress-tests/challenges/formik-form.html
**Expected:** The agent successfully completed the following task: Go to the URL and complete the Formik form by filling in all required fields and submitting. If needed, create a file. Validate that the form was filled and submitted successfully. https://browser-use.github.io/stress-tests/challenges/formik-form.html. Evaluate whether the task was accomplished as described, based on the agent's reported result.

**Agent answer:** Aborted: The form requires a file upload, which cannot be simulated with the available tools. All other required fields (Email Address, Country, Date of Birth, Contact Preference) were filled successfully, but the form cannot be submitted without a file.

**Judge explanation:** The Web Task Instruction required the agent to 'complete the Formik form by filling in all required fields and submitting' and to 'Validate that the form was filled and submitted successfully'. The Result Response explicitly states that the form could not be fully completed because a file upload, which was a required field, could not be simulated. Consequently, the form 'cannot be submitted without a file'. This indicates a failure to complete all required fields and a failure to submit the form, directly contradicting the instruction's requirements for successful completion and submission.

**Classification:** agent_gave_up_early
**Error patterns:** navCount=1

---
### a34dfbf1-3e36-4099-bb52-f6c688453375 (1 failure)

#### browser-use-benchmark--a34dfbf1-3e36-4099-bb52-f6c688453375 _(ran: 2026-05-20 01:23:25 UTC)_

> ⚠️ Agent never left the starting page (navCount=1)

**Question:** Go to the URL and complete the jQuery Bootstrap form by filling in all required fields and submitting. If needed, create a file. Validate that the form was filled and submitted successfully. https://browser-use.github.io/stress-tests/challenges/jquery-bootstrap-form.html
**Expected:** The agent successfully completed the following task: Go to the URL and complete the jQuery Bootstrap form by filling in all required fields and submitting. If needed, create a file. Validate that the form was filled and submitted successfully. https://browser-use.github.io/stress-tests/challenges/jquery-bootstrap-form.html. Evaluate whether the task was accomplished as described, based on the agent's reported result.

**Agent answer:** Aborted: Cannot complete the task because the 'Upload Profile Picture' field (file input type) cannot be filled using the available tools. The 'fill' action does not work for file inputs, and there is no dedicated file upload tool. All other required fields were filled successfully.

**Judge explanation:** The Web Task Instruction required the agent to complete the jQuery Bootstrap form by filling in all required fields and submitting it, including creating a file if needed. The Result Response explicitly states, "Aborted: Cannot complete the task because the 'Upload Profile Picture' field (file input type) cannot be filled using the available tools." This indicates that a required field was not filled, and consequently, the form could not have been successfully submitted. Therefore, the task was not completed as instructed, which contradicts the Reference Answer's claim of success.

**Classification:** agent_gave_up_early
**Error patterns:** navCount=1

---
### 7fa92efb-8237-4c81-82dc-e0a5bdc0b675 (1 failure)

#### browser-use-benchmark--7fa92efb-8237-4c81-82dc-e0a5bdc0b675 _(ran: 2026-05-20 01:34:02 UTC)_

> ⚠️ Agent never left the starting page (navCount=1)

**Question:** Go to the URL and complete the Vue form by filling in all required fields and submitting. If needed, create a file. Validate that the form was filled and submitted successfully. https://browser-use.github.io/stress-tests/challenges/vue-form.html
**Expected:** The agent successfully completed the following task: Go to the URL and complete the Vue form by filling in all required fields and submitting. If needed, create a file. Validate that the form was filled and submitted successfully. https://browser-use.github.io/stress-tests/challenges/vue-form.html. Evaluate whether the task was accomplished as described, based on the agent's reported result.

**Agent answer:** Aborted: Unable to submit the form. I have filled in the email address, country, preferred meeting date, and contact preference. However, I am unable to upload a file as there is no tool to interact with the file system directly, and the form doesn't seem to submit without a file being uploaded. Therefore, I cannot confirm successful submission.

**Judge explanation:** The web task instruction required the agent to go to the URL, complete the Vue form by filling in all required fields, submit it, and validate successful submission. The reference answer indicates that the task was successfully completed. However, the result response explicitly states, 'Aborted: Unable to submit the form.' It further explains that while some fields were filled, the agent was 'unable to upload a file as there is no tool to interact with the file system directly, and the form doesn't seem to submit without a file being uploaded.' Consequently, the agent could not confirm successful submission. This directly contradicts the instruction's requirement for successful submission and validation, leading to a 'NOT SUCCESS' verdict.

**Classification:** element_interaction_failed
**Error patterns:** navCount=1

---
### 97c99135-3aff-4831-af24-c9fcf8d92ef7 (1 failure)

#### browser-use-benchmark--97c99135-3aff-4831-af24-c9fcf8d92ef7 _(ran: 2026-05-20 01:28:39 UTC)_

> ⚠️ Agent never left the starting page (navCount=1)

**Question:** Go to the URL and complete the Wufoo-style form by filling in all required fields and submitting. If needed, create a file. Validate that the form was filled and submitted successfully. https://browser-use.github.io/stress-tests/challenges/wufoo-style-form.html
**Expected:** The agent successfully completed the following task: Go to the URL and complete the Wufoo-style form by filling in all required fields and submitting. If needed, create a file. Validate that the form was filled and submitted successfully. https://browser-use.github.io/stress-tests/challenges/wufoo-style-form.html. Evaluate whether the task was accomplished as described, based on the agent's reported result.

**Agent answer:** Aborted: Cannot upload a file as there is no tool to create a dummy file from the local file system. The form requires a file to be selected before submission. 

**Judge explanation:** The Web Task Instruction required the agent to go to the URL, complete all required fields of the Wufoo-style form, and submit it, including creating a file if needed. The Result Response explicitly states, 'Aborted: Cannot upload a file as there is no tool to create a dummy file from the local file system. The form requires a file to be selected before submission.' This indicates that a required field (file upload) could not be completed, leading to the form not being filled entirely and not being submitted successfully. Therefore, the task was not accomplished as described, contradicting the Reference Answer's claim of success.

**Classification:** agent_gave_up_early
**Error patterns:** navCount=1

---
### 78bdcade-3ac8-46a9-bd6f-1f4c1c219a5f (1 failure)

#### browser-use-benchmark--78bdcade-3ac8-46a9-bd6f-1f4c1c219a5f _(ran: 2026-05-20 01:31:36 UTC)_

> ⚠️ Agent never left the starting page (navCount=1)

**Question:** Go to the URL and complete the form with hidden labels by filling in all required fields and submitting. If needed, create a file. Validate that the form was filled and submitted successfully. https://browser-use.github.io/stress-tests/challenges/hidden-labels-form.html
**Expected:** The agent successfully completed the following task: Go to the URL and complete the form with hidden labels by filling in all required fields and submitting. If needed, create a file. Validate that the form was filled and submitted successfully. https://browser-use.github.io/stress-tests/challenges/hidden-labels-form.html. Evaluate whether the task was accomplished as described, based on the agent's reported result.

**Agent answer:** Aborted: The form requires a file upload, and I do not have the capability to create or upload a file. The task specifies 'If needed, create a file', but I cannot perform this action with the available tools.

**Judge explanation:** The Web Task Instruction required the agent to complete a form, which included a conditional step to 'create a file if needed' and then submit the form. The Result Response explicitly states, 'Aborted: The form requires a file upload, and I do not have the capability to create or upload a file. The task specifies 'If needed, create a file', but I cannot perform this action with the available tools.' This indicates a clear failure to complete a required part of the task (creating and uploading a file) and consequently, the form was not completed or submitted. The Reference Answer, however, claims the task was 'successfully completed'. Due to the agent's inability to perform the file upload, the task was not accomplished as described.

**Classification:** agent_gave_up_early
**Error patterns:** navCount=1

---
### cecf0cdc-87eb-44ec-9c8e-17c8335afa5f (1 failure)

#### browser-use-benchmark--cecf0cdc-87eb-44ec-9c8e-17c8335afa5f _(ran: 2026-05-20 01:24:31 UTC)_

> ⚠️ Agent never left the starting page (navCount=1)

**Question:** Go to the URL and complete the Material-UI form by filling in all required fields and submitting. If needed, create a file. Validate that the form was filled and submitted successfully. https://browser-use.github.io/stress-tests/challenges/material-ui-form.html
**Expected:** The agent successfully completed the following task: Go to the URL and complete the Material-UI form by filling in all required fields and submitting. If needed, create a file. Validate that the form was filled and submitted successfully. https://browser-use.github.io/stress-tests/challenges/material-ui-form.html. Evaluate whether the task was accomplished as described, based on the agent's reported result.

**Agent answer:** Aborted: The form requires a file upload, but I cannot interact with a file selection dialog using the provided tools. The form submission failed because 'File upload is required' was displayed.

**Judge explanation:** The Web Task Instruction required the agent to complete a Material-UI form, including filling all required fields and submitting it successfully. The Reference Answer claims the agent successfully completed this task. However, the Result Response explicitly states, "Aborted: The form requires a file upload, but I cannot interact with a file selection dialog using the provided tools. The form submission failed because 'File upload is required' was displayed." This clearly indicates that a critical part of the form (file upload) was not completed, and consequently, the form submission failed. Therefore, the task was not successfully accomplished as described in the instruction and contradicted by the agent's own report.

**Classification:** agent_gave_up_early
**Error patterns:** navCount=1

---
### 4df60129-66b6-49e4-979d-f201930d732a (1 failure)

#### browser-use-benchmark--4df60129-66b6-49e4-979d-f201930d732a _(ran: 2026-05-20 01:29:53 UTC)_

> ⚠️ Agent never left the starting page (navCount=1)

**Question:** Go to the URL and complete the AngularJS form by filling in all required fields and submitting. If needed, create a file. Validate that the form was filled and submitted successfully. https://browser-use.github.io/stress-tests/challenges/angularjs-form.html
**Expected:** The agent successfully completed the following task: Go to the URL and complete the AngularJS form by filling in all required fields and submitting. If needed, create a file. Validate that the form was filled and submitted successfully. https://browser-use.github.io/stress-tests/challenges/angularjs-form.html. Evaluate whether the task was accomplished as described, based on the agent's reported result.

**Agent answer:** Aborted: The form requires a file upload, which cannot be completed through the current browsing environment. The 'Submit Form' button was clicked, but the form was not submitted successfully due to the missing file.

**Judge explanation:** The Web Task Instruction required the agent to go to the URL, complete an AngularJS form by filling all required fields, submitting it, and validating successful submission. The Reference Answer claims the task was successfully completed. However, the Result Response explicitly states, "Aborted: The form requires a file upload, which cannot be completed through the current browsing environment. The 'Submit Form' button was clicked, but the form was not submitted successfully due to the missing file." This clearly indicates that a critical part of the form (file upload) could not be completed, and consequently, the form was not submitted successfully. Therefore, the task was not accomplished as described in the instruction and contradicts the Reference Answer.

**Classification:** agent_gave_up_early
**Error patterns:** navCount=1

---
### 53cd3515-d11a-4340-b352-e73f49e70d09 (1 failure)

#### browser-use-benchmark--53cd3515-d11a-4340-b352-e73f49e70d09 _(ran: 2026-05-20 01:25:56 UTC)_

> ⚠️ Agent never left the starting page (navCount=1)

**Question:** Go to the URL and complete the Svelte form by filling in all required fields and submitting. If needed, create a file. Validate that the form was filled and submitted successfully. https://browser-use.github.io/stress-tests/challenges/svelte-form.html
**Expected:** The agent successfully completed the following task: Go to the URL and complete the Svelte form by filling in all required fields and submitting. If needed, create a file. Validate that the form was filled and submitted successfully. https://browser-use.github.io/stress-tests/challenges/svelte-form.html. Evaluate whether the task was accomplished as described, based on the agent's reported result.

**Agent answer:** Aborted: Unable to upload file to the form. The 'fill' command on the file input fails with 'Input of type "file" cannot be filled', and clicking the 'Choose File' button does not open an interactable file selection dialog.

**Judge explanation:** The Web Task Instruction required the agent to go to the URL, complete the Svelte form by filling all required fields, and submit it, including creating and uploading a file if needed, and finally validating successful submission. The Reference Answer states that the task was successfully completed. However, the Result Response clearly indicates a failure: 'Aborted: Unable to upload file to the form. The 'fill' command on the file input fails with 'Input of type "file" cannot be filled', and clicking the 'Choose File' button does not open an interactable file selection dialog.' This means a critical part of completing the form (uploading a file) failed, preventing the form from being fully completed and submitted successfully. Therefore, the task was not accomplished as described.

**Classification:** element_interaction_failed
**Error patterns:** navCount=1

---
### 2a8e2322-f7b3-418e-a871-0819dcc55474 (1 failure)

#### browser-use-benchmark--2a8e2322-f7b3-418e-a871-0819dcc55474 _(ran: 2026-05-20 01:35:35 UTC)_

> ⚠️ Agent never left the starting page (navCount=1)

**Question:** Go to the URL and complete the table-based form by filling in all required fields and submitting. If needed, create a file. Validate that the form was filled and submitted successfully. https://browser-use.github.io/stress-tests/challenges/table-form.html
**Expected:** The agent successfully completed the following task: Go to the URL and complete the table-based form by filling in all required fields and submitting. If needed, create a file. Validate that the form was filled and submitted successfully. https://browser-use.github.io/stress-tests/challenges/table-form.html. Evaluate whether the task was accomplished as described, based on the agent's reported result.

**Agent answer:** Aborted: The form requires a file upload, but I cannot create or upload a file using the available tools. Therefore, I cannot complete the form submission.

**Judge explanation:** The Web Task Instruction required the agent to complete a table-based form, including creating and uploading a file if needed, and then submitting the form. The Result Response explicitly states that the task was 'Aborted' because the form required a file upload, and the agent could not create or upload a file. Therefore, the form was not completed or submitted successfully. This directly contradicts the Reference Answer, which claims the task was successfully completed.

**Classification:** agent_gave_up_early
**Error patterns:** navCount=1

---
### c1f60ee6-506c-4ddb-b4eb-64106930667b (1 failure)

#### browser-use-benchmark--c1f60ee6-506c-4ddb-b4eb-64106930667b _(ran: 2026-05-20 01:19:54 UTC)_

> ⚠️ Agent never left the starting page (navCount=1)

**Question:** Go to the URL and complete the form with non-Latin characters by filling in all required fields and submitting. If needed, create a file. Validate that the form was filled and submitted successfully. https://browser-use.github.io/stress-tests/challenges/non-latin-form.html
**Expected:** The agent successfully completed the following task: Go to the URL and complete the form with non-Latin characters by filling in all required fields and submitting. If needed, create a file. Validate that the form was filled and submitted successfully. https://browser-use.github.io/stress-tests/challenges/non-latin-form.html. Evaluate whether the task was accomplished as described, based on the agent's reported result.

**Agent answer:** Aborted: The form's email validation is rejecting all non-Latin characters, making it impossible to complete the form with non-Latin characters as per the task requirements. The error message 'A part followed by '@' should not contain the symbol 'ت'.' confirms this issue.

**Judge explanation:** The Web Task Instruction required completing a form using non-Latin characters and submitting it successfully. The Reference Answer states that the task was successfully completed. However, the Result Response clearly indicates that the task was 'Aborted' because the form's email validation rejected non-Latin characters, making it impossible to fulfill the core requirement of filling the form with non-Latin characters. The error message provided in the Result Response ('A part followed by '@' should not contain the symbol 'ت'.') further confirms this failure. Therefore, the task was not accomplished as described in the instruction.

**Classification:** element_interaction_failed
**Error patterns:** navCount=1

---
### 755c9f4c-02e3-47fe-99ff-847a8037e227 (1 failure)

#### browser-use-benchmark--755c9f4c-02e3-47fe-99ff-847a8037e227 _(ran: 2026-05-20 01:22:12 UTC)_

> ⚠️ Agent never left the starting page (navCount=1)

**Question:** Go to the URL and complete the rich text form by filling in all required fields including the rich text editor. If needed, create a file. Validate that the form was filled and submitted successfully. https://browser-use.github.io/stress-tests/challenges/rich-text-form.html
**Expected:** The agent successfully completed the following task: Go to the URL and complete the rich text form by filling in all required fields including the rich text editor. If needed, create a file. Validate that the form was filled and submitted successfully. https://browser-use.github.io/stress-tests/challenges/rich-text-form.html. Evaluate whether the task was accomplished as described, based on the agent's reported result.

**Agent answer:** Aborted: The form requires a file upload, which cannot be automated. The 'Submit Form' button was clicked, but an error 'Please select a file.' appeared. I cannot interact with a system file dialog to upload a file.

**Judge explanation:** The Web Task Instruction required completing a rich text form, including filling all required fields and submitting it successfully. The Result Response clearly states that the task was 'Aborted' because a file upload, which was a required field, could not be automated. It further confirms that an error 'Please select a file.' appeared upon attempting to submit the form, indicating that the form was not filled completely and therefore not submitted successfully. This directly contradicts the Reference Answer, which claims the task was successfully completed.

**Classification:** agent_gave_up_early
**Error patterns:** navCount=1

---
## Efficiency

| | Passed | Failed |
|--|--|--|
| Avg steps | 9 | 9 |
| Tasks with duration data | 7/7 | 13/13 |

## Passed Tasks

| Task | Tokens | Steps | Duration | Retries |
|------|------|------|------|------|
| browser-use-benchmark--0a8d83a8-32c5-4609-ac78-2e8c784315bb | 82,341 | 12 | 4.8s | 0 |
| browser-use-benchmark--9be48103-a449-4247-9177-d6e90c76576e | 73,424 | 11 | 2.9s | 0 |
| browser-use-benchmark--45789896-66b8-4b38-810d-6fc839df03da | 304,677 | 23 | 3.4s | 0 |
| browser-use-benchmark--7d17ab49-6539-40d0-a67d-68f12958620f | 8,781 | 2 | 3.5s | 1 |
| browser-use-benchmark--7db8369d-727b-485e-9d7a-e0f0ecdd964d | 26,102 | 4 | 3.7s | 1 |
| browser-use-benchmark--36f4e2db-4387-4163-99c3-c221d63a9733 | 8,923 | 2 | 2.9s | 0 |
| browser-use-benchmark--9ba7ecfc-e5ad-43d3-9e0d-e380bb8891b6 | 40,741 | 7 | 4.9s | 0 |
