Skip to content

Agent: Intelligence Layer

The core of each Operator is its AI intelligence. This can be one or more models (e.g. language models, vision models, decision processes). The platform supports using state-of-the-art models for reasoning and perception. For example, recent advances in large language and multimodal models mean an agent can both read text and “see” screen content or camera feeds. Vision-Language-Action (VLA) models are one approach: they let an agent interpret visual input (like a screenshot or robot camera) and decide on actions. In fact, research shows that integrating vision and language understanding enables robots to “perceive, reason and act” more effectively in the real world. Operators can also use reinforcement learning or traditional AI planners if appropriate. In short, any modern AI technique – from ChatGPT-style agents to custom reinforcement learners – can serve as the Operator’s brain.

Fine-Tuning & Virtual Language Action (VLA)

Section titled “Fine-Tuning & Virtual Language Action (VLA)”

Operators often need custom skills. The platform allows integrating fine-tuned models and policy networks. For example, one might fine-tune a language model on specific corporate documents, or use a VLA model fine-tuned for GUI interaction. VLA techniques (Vision+Language action models) help the agent map visual inputs to actions: e.g., reading a dialog box and clicking the right button. These approaches are increasingly powerful: for instance, recent robotic research demonstrates that combining language and vision models can replace complex hand-coded controllers for many tasks.

The platform provides tools to teach Operators by example. For instance, a user can record themselves performing a task on the desktop (or teleoperate a robot) and upload that recording. The platform then uses imitation learning to train the Operator to mimic those actions. Screen-recording is a common technique: record the sequence of clicks and keystrokes required to complete a workflow, and let the AI learn to reproduce it. Over time, Operators can be improved by retraining: when a user gives feedback or corrects an Operator’s behavior, that data can fine-tune the model further. This iterative learning (including reinforcement from success/failure signals) helps Agents improve as they run.

At runtime, each Operator runs autonomously on its provisioned machine. The Operator agent monitors its environment, takes actions (e.g. clicking buttons, calling APIs, moving a robot arm), and processes feedback. Lifecycle management (start, monitor, update, stop) is handled by the platform’s fabric: developers simply deploy a new Operator version and the platform spins up updated machines. Operators themselves do not manage deployment; they focus on task logic. Importantly, the Operator logic is cross-platform: the same agent code works whether it’s on Windows, Linux, or even a robot OS, because the system layer abstracts away OS differences.

By design, Operators only access data and systems they are authorized to use. Each machine’s access permissions and network restrictions limit what the Operator can see or control. For example, an Operator authorized to fill forms would have access to those application windows and data, but no access to unrelated files. In addition, safety checks are built into agent behavior. For instance, before performing a sensitive action (deleting files, sending an email, controlling a robot), the agent can be forced to validate the intent (via a human prompt or a rule). This helps prevent unintended or harmful actions. Overall, the platform treats Operators as untrusted components: zero trust principles (least privilege, isolation) ensure that even a rogue agent is contained.