Official implementation of Online Skill Learning for Web Agents via State-Grounded Dynamic Retrieval.
This repository provides State-Grounded Dynamic Retrieval (SGDR), an online skill-learning method for WebArena-style web agents, together with retained baseline runners and evaluation utilities.
SGDR maintains a growing JSONL library of reusable skills. During task solving, the agent summarizes the current browser state, retrieves skills that are relevant to both the task goal and the current state, and dynamically injects the selected skills into the action space. After a successful trajectory, SGDR cleans the trace, identifies reusable action windows, synthesizes new skills, and appends them to the skill library for future tasks.
- Paper: arXiv:2606.04391
skill-dynamic-retrieval/
browsergym/ BrowserGym/WebArena dependencies used by the project
sgdr/ SGDR agent, retrieval, induction, and evaluation code
actions/ Base action sets and learned skill libraries
autoeval/ Trajectory evaluation utilities
config_files/ WebArena task configuration template and generator
induce/ Skill induction pipelines
retrieval/ State summarization, embedding, and skill retrieval
workflows/ Workflow-memory files for AWM-style baselines
Most project commands should be run from skill-dynamic-retrieval/sgdr/.
Create a Python environment and install the main dependencies:
conda create -n sgdr python=3.10
conda activate sgdr
cd skill-dynamic-retrieval/sgdr
pip install browsergym==0.10.2 browsergym-webarena==0.10.2
pip install -r requirements.txt
pip install gymnasium playwright==1.49.0 litellm
playwright install chromiumSGDR expects a running WebArena deployment. Configure its public host locally:
cp host.local.example.sh host.local.sh
# Edit host.local.sh and set WEBARENA_HOST to your WebArena host.host.local.sh is ignored by Git because it is machine-specific. Generated task configs under config_files/*.json are also ignored because they embed local service URLs.
Set your OpenAI-compatible API key in the shell or in a local ignored .env file:
export OPENAI_API_KEY="your-api-key"Then load the runtime environment and generate WebArena task configs:
source env.sh
python config_files/generate_test_data.pyRun one BrowserGym/WebArena task with the default agent:
cd skill-dynamic-retrieval/sgdr
source env.sh
python run_demo.py \
--task_name webarena.21 \
--websites shopping \
--headlessRun SGDR online over a task range:
python run_online.py \
--experiment sgdr \
--website shopping \
--task_ids "21-25" \
--model gpt-4.1The retained experiment choices are:
sgdr State-grounded dynamic retrieval
awm Workflow-memory baseline
asi Action-skill induction baseline
cer_online CER-style online experience retrieval baseline
Allowed websites are shopping, admin, reddit, gitlab, and map.
For each task, run_online.py --experiment sgdr performs:
- Solve: run the web agent with dynamic skill retrieval.
- Evaluate: judge whether the trajectory completed the task.
- Clean: remove invalid or unusable steps.
- Induce: synthesize reusable skills from successful cleaned trajectories.
- Update: append new skills to the JSONL skill library.
The skill library is stored under:
sgdr/actions/_skill_lib/sgdr_{model}/{website}.jsonl
At the start of a new SGDR run, an existing library for the same model and website is archived under _history/ unless --reuse_skill_lib is passed.
Evaluate completed runs:
python eval_results.py --experiment sgdr --model gpt-4.1 --websites shopping
python eval_results.py --experiment sgdr --model gpt-4.1 --metric autoevalUse a different model for trajectory evaluation and SGDR induction:
python run_online.py \
--experiment sgdr \
--website shopping \
--task_ids "21-25" \
--model gpt-4o-mini \
--eval_model gpt-4oRun with a local vLLM backend:
# Terminal 1
bash serve_vllm.sh
# Terminal 2
source env.sh vllm
python run_online.py \
--experiment sgdr \
--website shopping \
--task_ids "21-25"In vLLM mode, LLM_MODEL_NAME from env.sh overrides CLI model arguments.
Typical SGDR outputs are:
sgdr/results/sgdr_{model}/
webarena.{id}/
summary_info.json
cleaned_steps.json
{eval_model}_autoeval.json
sgdr_logs/{id}.jsonl
sgdr/actions/_skill_lib/sgdr_{model}/
{website}.jsonl
_history/
sgdr_logs/{id}.jsonl records per-step retrieval information, including the goal, state summary, injected skills, and retrieval scores.
- WebArena services should be reset between large model comparisons to avoid state carryover.
- Commercial backends may incur API costs for agent calls, induction, autoeval, and some WebArena built-in evaluators.
- Do not commit local host files, generated task configs, result directories, or API keys.
This project is released under the license specified in LICENSE.
If you use this repository, please cite the paper linked above. BibTeX will be added when the final citation metadata is available.