𝗪𝗲 𝗮𝗿𝗲 𝗲𝘅𝗰𝗶𝘁𝗲𝗱 𝘁𝗼 𝗼𝗽𝗲𝗻-𝘀𝗼𝘂𝗿𝗰𝗲 𝘁𝗵𝗲 𝗦𝗲𝗹𝗳-𝗢𝗽𝗲𝗿𝗮𝘁𝗶𝗻𝗴 𝗖𝗼𝗺𝗽𝘂𝘁𝗲𝗿 𝗙𝗿𝗮𝗺𝗲𝘄𝗼𝗿𝗸 𝘁𝗵𝗮𝘁 𝗲𝗻𝗮𝗯𝗹𝗲𝘀 𝗺𝘂𝗹𝘁𝗶𝗺𝗼𝗱𝗮𝗹 𝗺𝗼𝗱𝗲𝗹𝘀, 𝗶𝗻𝗰𝗹𝘂𝗱𝗶𝗻𝗴 𝗚𝗣𝗧-𝟰-𝗩𝗶𝘀𝗶𝗼𝗻 𝘁𝗼 𝘀𝗶𝗺𝘂𝗹𝗮𝘁𝗲 𝗵𝘂𝗺𝗮𝗻-𝗹𝗶𝗸𝗲 𝗺𝗼𝘂𝘀𝗲 𝗰𝗹𝗶𝗰𝗸𝘀 𝗮𝗻𝗱 𝗸𝗲𝘆𝗯𝗼𝗮𝗿𝗱 𝗶𝗻𝗽𝘂𝘁𝘀 𝗼𝗻 𝗮 𝗰𝗼𝗺𝗽𝘂𝘁𝗲𝗿.
Based on a given objective, the model estimates the correct X & Y locations for mouse clicks and the appropriate keyboard inputs at each step.
A vision-based agent working at the OS level allows for maximum context and adaptability.
The framework is designed to work with any vision-text multimodal model to evaluate its ability to operate a computer. While significant improvements are needed to achieve human-level performance, this code repository serves as a plugin framework.
We are also excited to announce we’ll be integrating our `𝗔𝗴𝗲𝗻𝘁-𝟭` model with the framework in the coming weeks.
Help us build the future of agents in public.
𝗚𝗶𝘁𝗵𝘂𝗯:
github.com/OthersideAI/self-…
𝗛𝗲𝗿𝗲'𝘀 𝗚𝗣𝗧-𝟰-𝗩𝗶𝘀𝗶𝗼𝗻 𝘄𝗿𝗶𝘁𝗶𝗻𝗴 𝗮 𝗽𝗼𝗲𝗺 𝗶𝗻 𝗚𝗼𝗼𝗴𝗹𝗲 𝗗𝗼𝗰𝘀: