Welcome to Agent S, an open-source framework designed to enable autonomous interaction with computers through Agent-Computer Interface. Our mission is to build intelligent GUI agents that can learn from past experiences and perform complex tasks autonomously on your computer.
Whether you're interested in AI, automation, or contributing to cutting-edge agent-based systems, we're excited to have you here!
🌐[S2 blog] 📄[S2 Paper (COLM 2025)] 🎥[S2 Video]
🌐[S1 blog] 📄[S1 Paper (ICLR 2025)] 🎥[S1 Video]
Languages: Deutsch | Español | français | 日本語 | 한국어 | Português | Русский | 中文
Benchmark | Agent S2.5 | Previous SOTA |
---|---|---|
OSWorld Verified (100 step) | 56.0% | 53.1% |
OSWorld Verified (50 step) | 54.2% | 50.6% |
pip install gui-agents
Option 1: Environment Variables
Add to your .bashrc (Linux) or .zshrc (MacOS): export OPENAI_API_KEY=export ANTHROPIC_API_KEY= export HF_TOKEN=
Option 2: Python Script
import os os.environ["OPENAI_API_KEY"] = ""
We support Azure OpenAI, Anthropic, Gemini, Open Router, and vLLM inference. See models.md for details.
For optimal performance, we recommend UI-TARS-1.5-7B hosted on Hugging Face Inference Endpoints or another provider. See Hugging Face Inference Endpoints for setup instructions.
For the best configuration, we recommend using OpenAI o3-2025-04-16 as the main model, paired with UI-TARS-1.5-7B for grounding.
agent_s \ --provider openai \ --model o3-2025-04-16 \ --ground_provider huggingface \ --ground_url http://localhost:8080 \ --ground_model ui-tars-1.5-7b \ --grounding_width 1920 \ --grounding_height 1080
The grounding width and height should match the output coordinate resolution of your grounding model:
# Load in your API keys. from dotenv import load_dotenv load_dotenv() import os import pyautogui import io from gui_agents.s2_5.agents.agent_s import AgentS2_5 from gui_agents.s2_5.agents.grounding import OSWorldACI current_platform = "linux" # "darwin", "windows" # Next, we define our engine parameters. engine_params = { "engine_type": provider, "model": model, "base_url": model_url, # Optional "api_key": model_api_key, # Optional } # Load the grounding engine from a custom endpoint ground_provider = "" ground_url = " " ground_model = " " ground_api_key = " " # Optional # Set grounding dimensions based on your model's output coordinate resolution # UI-TARS-1.5-7B: grounding_width=1920, grounding_height=1080 # UI-TARS-72B: grounding_width=1000, grounding_height=1000 grounding_width = 1920 # Width of output coordinate resolution grounding_height = 1080 # Height of output coordinate resolution engine_params_for_grounding = { "engine_type": ground_provider, "model": ground_model, "base_url": ground_url, "api_key": ground_api_key, # Optional "grounding_width": grounding_width, "grounding_height": grounding_height, } # Then, we define our grounding agent and Agent S2.5. grounding_agent = OSWorldACI( platform=current_platform, engine_params_for_generation=engine_params, engine_params_for_grounding=engine_params_for_grounding, width=1920, # Optional: screen width height=1080 # Optional: screen height ) agent = AgentS2_5( engine_params, grounding_agent, platform=current_platform, max_trajectory_length=8, # Optional: maximum image turns to keep enable_reflection=True # Optional: enable reflection agent ) # Finally, let's query the agent! # Get screenshot. screenshot = pyautogui.screenshot() buffered = io.BytesIO() screenshot.save(buffered, format="PNG") screenshot_bytes = buffered.getvalue() obs = {"screenshot": screenshot_bytes} instruction = "Close VS Code" info, action = agent.predict(instruction=instruction, observation=obs) exec(action[0]) # Refer to gui_agents/s2_5/cli_app.py for more details on how the inference loop works.
If you find this codebase useful, please cite:
@misc{Agent-S2, title={Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents}, author={Saaket Agashe and Kyle Wong and Vincent Tu and Jiachen Yang and Ang Li and Xin Eric Wang}, year={2025}, eprint={2504.00906}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2504.00906}, } @inproceedings{Agent-S, title={{Agent S: An Open Agentic Framework that Uses Computers Like a Human}}, author={Saaket Agashe and Jiuzhou Han and Shuyu Gan and Jiachen Yang and Ang Li and Xin Eric Wang}, booktitle={International Conference on Learning Representations (ICLR)}, year={2025}, url={https://arxiv.org/abs/2410.08164} }
10,345
Visitors Since 1998!
Join the Agent S Community!