People are overlooking
@Google Gemini Realtime models for computer use
It gave me sub 100ms latency with computer use.
It has a much larger context window and is much cheaper as well
Combine that with local OCR and local screen detection model based on Omniparser by
@Microsoft it works under 100ms action taking when combined with
@trycua
I also put in a harness for
@NousResearch Hermes with it.
You can access it all at your tip of your cursor.
You can draw on your screen to give your agents a context
And I am making it Open Source!
Link in the Comments
@sundarpichai @minliangtan @mseyed