Use WebArena benchmark. 1. [Setup the standalone environment of WebArena](https://github.com/web-arena-x/webarena/blob/main/environment_docker/README.md) 2. Configurate the urls for each website. 3. Generate config file for each test example and obtain the auto-login cookies for all websites 2. Write script to use WebArena's environment based on its [run.py](https://github.com/web-arena-x/webarena/blob/main/run.py) 3. Save task execution results and evaluate. 4. Analyze the evaluation results