Hi all! Sharing some of our recent work around building RL envs and sims for agent training.
There are a lot more technical details on building the benchmark in the post. If you are interested in more RL/Post-Training, I'd highly recommend reading this super in-depth blog from our partners at Yutori: https://yutori.com/blog/introducing-navigator
Some more casual thoughts and lessons:
1) A high volume of quality RL environments / sims remain one of the largest blockers to training frontier agents, especially as labs/enterprises shift towards creating increasingly specialized AI coworkers that can do real work.
2) Building an RL env is VERY different from building a high quality dataset. While the primary input for dataset creation is specialized human annotators and clear rubrics, the inputs to building a great RL env involve humans, engineers, product, data, and an orchestration of everything together. There are a lot of green field problems when you move from building singular environments to SCALING 1-3 orders of magnitude.
3) There is a constant push/pull between building tasks that are easily verifiable and building tasks that are realistic. Its sort of like a 2x2 grid. The best (and most valuable) tasks are realistic and verifiable. There are constant tradeoffs being made, and we often find ourselves limited by the types of realistic tasks we can make if they lack a clear verifier. I'm reminded of Jason Wei's post here: https://www.jasonwei.net/blog/asymmetry-of-verification-and-...
4) When it comes to building browser sims, we found the hardest challenges to come NOT from mimicking the frontend components but rather creating a realistic distribution of data to sit on top of. Although not immediately obvious, this makes a lot of sense. For example, when building Noodle Flights, the front end UI was (although non trivial) manageable to create, but modeling the distribution of complex flight data was infinitely harder.
5) Its an iterative process. Building a perfect sim / verifier out the gate is very difficult, and a large part of the RL process is shepherding / QA of specific tasks and verifiers. The best way to do this is by constantly reviewing trajectories and spotting false positives/negatives. This is tedious work, but often front loaded - until you see smooth gains :)
Have lots more thoughts but these were just top of mind today. If this work is interesting always happy to chat (we're also hiring)!
I definitely think as companies begin optimizing for an "Agent first" economy, they will start figuring out how to optimize their sites for agent traffic.
They definitely could do this themselves, but I imagine there will be some engineering work/expertise around building RL envs that they might want to partner with an external provider to do it.
ALSO the value of Westworld isn't any standalone env but many stringed together for long trajectory workflows. That is why they may be inclined to work with another provider to do it.
Those are just our thoughts though, will see how the market plays out
Self driving cars are a really good place to derive intuitions. Robotics as well!
Both those spaces are still optimizing on the last mile performance gains that get exponentially harder.
The good thing about computer use is building software environments are faster and also more repeatable, so hopefully we see quicker improvements here. :)
That being said, there are still a lot of use cases its not good at, and also looking at long trajectory tasks, enterprise work tasks, etc. I imagine those are all still very nascent.
I think we are still very early on computer use, being "production ready" requires probably close to 95%+ accuracy on most tasks and we're not there yet for most use cases.
AI vibe coding tools already prefer some solutions over others, probably because of training data distribution/post training preferences. This is leading to massive revenue differences and growth compared to companies that have not optimized to be AI agent preferred/in their training data distribution.
I imagine something similar will happen over time, where companies who are in the training data distribution get used by agents more, while others who neglect this get slowly phased out because systems don't know how to use them (out of distribution).
we share the public/consumer simulators, but we also build bespoke environments on a per customer basis (think enterprise sites or even full VMs loaded with applications and data).
environment creation scalability is a big priority for us. we currently automate most of the process, but it still takes a fair bit of manual work to finish them and to get the details right. there is some reusability across environments, for example, we can use the flight results generation code in any travel/flightbooking sim. we also have some semi-automated approaches for creating tasks and verifiers. but still lots of work to be done here.
Computer use agents are starting to perform well on websites/apps that are in their training distribution, but still struggle a lot when dealing with tasks outside their distribution. A big reason why is because many more niche/enterprise applications are really hard to test on in the real world, hence the need for sims!
re: labs doing this internally. They definitely are! However, the scale of sims buildout is going to be massive, probably many orders of magnitude above what we have today. We think it makes sense for one central player to do this because a really good simulator can be used by multiple people at once. It doesn’t make sense for every AI lab/company to build out their own environments if an industry standard catalog exists.
Engineering: QA automation is huge, closes the loop on "fully automated" software engineering if another computer use system is able to click around and help identify bugs in software
Deep Research: probably the biggest use case for computer use right now, finding information that isn't easily indexed or accessible via APIs.
General RPA: This is industry specific, but lots of just everyday knowledge work involves data transfer between many platforms that sucks and no one wants to do. A great example is Epic in Healthcare. SO much labor is employed just to write and read information from this desktop app that isn't easily accessible. Imagine a computer use system that can do automated data pulls at scale for legacy desktop apps. This is a huge huge use case, and something that we're excited to try and improve with simulators of things like Epic, SAP, Salesforce, etc.
Consumer: Lots of just general everyday tasks. I would recommend checking out https://yutori.com/ if you're interested in seeing how a computer use agent can be helpful in your day to day. Its fun for daily news reports, restaurant reservation checking, etc.
UI refreshes knocking down simulator realism is a real issue that we're still trying to solve.
I think this will probably be a mixture of automated QA/engineering and scale.
Another interesting path is actually partnering directly with software providers to offer their platforms as simulators IF they see there is a competitive advantage to training agents to perform well on their UI.
This idea we're really excited about, but it would require a company to see real revenue potential in enabling agentic access vs not. I'd say we're still on the "block them out" phase of the internet (ex. see Cloudflare's recent post about bot detection: https://blog.cloudflare.com/perplexity-is-using-stealth-unde...)
There are a lot more technical details on building the benchmark in the post. If you are interested in more RL/Post-Training, I'd highly recommend reading this super in-depth blog from our partners at Yutori: https://yutori.com/blog/introducing-navigator
Some more casual thoughts and lessons:
1) A high volume of quality RL environments / sims remain one of the largest blockers to training frontier agents, especially as labs/enterprises shift towards creating increasingly specialized AI coworkers that can do real work.
2) Building an RL env is VERY different from building a high quality dataset. While the primary input for dataset creation is specialized human annotators and clear rubrics, the inputs to building a great RL env involve humans, engineers, product, data, and an orchestration of everything together. There are a lot of green field problems when you move from building singular environments to SCALING 1-3 orders of magnitude.
3) There is a constant push/pull between building tasks that are easily verifiable and building tasks that are realistic. Its sort of like a 2x2 grid. The best (and most valuable) tasks are realistic and verifiable. There are constant tradeoffs being made, and we often find ourselves limited by the types of realistic tasks we can make if they lack a clear verifier. I'm reminded of Jason Wei's post here: https://www.jasonwei.net/blog/asymmetry-of-verification-and-...
4) When it comes to building browser sims, we found the hardest challenges to come NOT from mimicking the frontend components but rather creating a realistic distribution of data to sit on top of. Although not immediately obvious, this makes a lot of sense. For example, when building Noodle Flights, the front end UI was (although non trivial) manageable to create, but modeling the distribution of complex flight data was infinitely harder.
5) Its an iterative process. Building a perfect sim / verifier out the gate is very difficult, and a large part of the RL process is shepherding / QA of specific tasks and verifiers. The best way to do this is by constantly reviewing trajectories and spotting false positives/negatives. This is tedious work, but often front loaded - until you see smooth gains :)
Have lots more thoughts but these were just top of mind today. If this work is interesting always happy to chat (we're also hiring)!