Hi, author of the post here. Just fixed up some formatting issues from when we copied it into substack, sorry about that. Yeah, I used Opus 4.5 to help me write it (and it actually made me laugh!). But the struggle was real. Something I didn't make clear enough in the post is that jpeg works because each screenshot is taken exactly when it's requested. Whereas streaming video is pushing a certain frame rate. The client driving the frame rate is exactly what makes it not queue frames. Yes, I wish we could UDP in enterprise networks too, but we can't. The problem actually isn't opening the UDP port, it's hosting UDP on their Kubernetes cluster. "You want to what?? We have ingress. For HTTPS"
Hey lewq, 40Mbps is an absolutely ridiculous bitrate. For context, Twitch maxes out around 8.5Mb/s for 1440p60. Your encoder was poorly configured, that's it. Also, it sounds like your mostly static content would greatly benefit from VBR; you could get the bitrate down to 1Mb/s or something for screen sharing.
And yeah, the usual approach is to adapt your bitrate to network conditions, but it's also common to modify the frame rate. There's actually no requirement for a fixed frame rate with video codecs. It also you could do the same "encode on demand" approach with a codec like H.264, provided you're okay with it being low FPS on high RTT connections (poor Australians).
Overall, using keyframes only is a very bad idea. It's how the low quality animated GIFs used to work before they were secretly replaced with video files. Video codecs are extremely efficient because of delta encoding.
But I totally agree with ditching WebRTC. WebSockets + WebCodecs is fine provided you have a plan for bufferbloat (ex. adaptive bitrate, ABR, GoP skipping).
> Something I didn't make clear enough in the post is that jpeg works because each screenshot is taken exactly when it's requested. Whereas streaming video is pushing a certain frame rate. The client driving the frame rate is exactly what makes it not queue frames.
I understand that logic but I don't really agree with it. Very aggressive bitrate controls can do a lot to keep that buffer tiny while still looking better than JPEG, and if it bloats beyond 1-2 seconds you can reset. A reset like that wouldn't look notably worse than JPEG mode always looks.
If you use a video encoder that gives you good insight into what it's doing you could guarantee that the buffer never gets bigger than 1-2 JPEGs by dynamically deciding when to add frames. That would give you the huge benefits of P-frames with no downside.
Yeah, I used ChatGPT to help me write this answer ;)
(Unlike JPEGs, it works at the right abstraction level for text.)
I think the core issue isn’t push vs pull or frame scheduling, but why you’re sending frames at all. Your use case reads much more like replicating textual/stateful UI than streaming video.
The fact that JPEG “works” because the client pulls frames on demand is kind of the tell — you’ve built a demand-driven protocol, then used it to fetch pixels. That avoids queuing, sure, but it’s also sidestepping video semantics you don’t actually need.
Most of what users care about here is text, cursor position, scroll state, and low interaction latency. JPEG succeeds not because it’s old and robust, but because it accidentally approximates an event-driven model.
Totally fair points about UDP + Kubernetes + enterprise ingress. But those same constraints apply just as well to structured state updates or terminal-style protocols over HTTPS — without dragging a framebuffer along.
Pragmatic solution, real struggle — but it feels like a text/state problem being forced through a video abstraction, and JPEG is just the least bad escape hatch.
Hey! Yeah we are working with partners on fully integrated hardware+software stack for this. We particularly like the RTX 6000 Pro Blackwell chips for this
Vision language models have been trained on how to operate human UIs though, so at least for a while, computer use will be an interesting area to explore. I think debugging web apps and building UIs is a particularly fruitful area for this
There's also value in being able to run multiple agents in parallel with their own isolated filesystems and runtimes. One agent won't tread on the toes of another whatever they do. You can let it loose and it doesn't matter if it breaks something, you can just spin up another one
Mainly so you can give the agent access to the desktop as well. Then it can debug your web app in Chrome Dev tools but also you can pair with it with streaming that is so good it feels local
Join our discord for private beta in January! https://discord.gg/VJftd844GE
(This post written by human)