Mobile-Agent: The Powerful GUI Agent Family
-
Updated
Apr 14, 2026 - Python
Mobile-Agent: The Powerful GUI Agent Family
[EMNLP-2024] Build multimodal language agents for fast prototype and production
AUITestAgent is the first automatic, natural language-driven GUI testing tool for mobile apps, capable of fully automating the entire process of GUI interaction and function verification.
MobileUse: an open-source mobile GUI agent for Android phone automation, AndroidWorld/AndroidLab evaluation, hierarchical reflection, and proactive exploration.
The Python Harness for Production AI Multi-Agent Systems
🖼️ Workshop: Build a multimodal AI agent with Haystack & GPT-4o — featuring image understanding, document retrieval, conversational memory, and human-in-the-loop safety controls
Claude Code in Docker. Drop-in OpenAI-compatible API, MCP server, Telegram bot, and CLI — five interfaces, one image. Persistent sessions, file ops, always-on skill injection, and a full dev toolchain (Go, Python, Node, K8s, Terraform, databases) or a minimal image with just the basics.
[COLM 2024] ExoViP: Step-by-step Verification and Exploration with Exoskeleton Modules for Compositional Visual Reasoning
A persistent, emotionally reactive 3D Digital Persona powered by Gemini 2.5 Flash Native Audio. Features sub-100ms conversational latency and procedural ARKit emotive realism.
🧮 Multi-agent AI math tutor built with LangGraph — CRAG retrieval, episodic & semantic long-term memory, Tavily MCP web search, Google OAuth, and Neo4j-style memory graph. Powered by LLaMA 3.3 70B on Groq.
Build an end-to-end system that ingests inventory report PDFs/images, runs OCR to normalize and extract tabular data, stores the cleaned dataset, and exposes a secure, conversational agent that can answer business queries over the data (aggregation, filtering, joins, trends), returning tables, charts, and exportable results.
multimodal coding assistant that can analyze images containing code problems and generate solutions in multiple programming languages.
Rasputin Omnitool: OpenClaw skill bundle with planner/executor/reviewer agent loop and 12 tools (research, browser, sandbox, multimedia). Manus-equivalent agent built from OSS.
Build an end-to-end system that ingests inventory report PDFs/images, runs OCR to normalize and extract tabular data, stores the cleaned dataset, and exposes a secure, conversational agent that can answer business queries over the data (aggregation, filtering, joins, trends), returning tables, charts, and exportable results.
DoodleSoul is a multimodal AI agent for special education. It brings children's drawings to life via Gemini Live API and weaves real-time voice, Imagen 4 illustrations, and Veo 3 videos into therapeutic Social Stories using interleaved output.
Add a description, image, and links to the multimodal-agent topic page so that developers can more easily learn about it.
To associate your repository with the multimodal-agent topic, visit your repo's landing page and select "manage topics."