Yeah I came here to say the same thing. It seems like it would simplify things. They do say:
"I initially considered training a single end-to-end VLA model. [...] A cable-driven soft robot is different: the same tip position can correspond to many cable length combinations. This unpredictability makes demonstration-based approaches difficult to scale.[...] Instead, I went with a cascaded design: specialized vision feeding lightweight controllers, leaving room to expand into more advanced learned behaviors later."
I still think circling back to smaller models would be awesome. With some upgrades you might get a locally hosted model on there, but I'd be sure to keep that inside a pentagram so it doesn't summon a Great One.
I was surprised it pinged gpt-4o. I was expecting it to use something like https://github.com/apple/ml-fastvlm (obviously cost may have been a factor there), but I can see how the direction he chose would make it more capable of doing more complex behaviours in the future w.r.t adding additional tentacles for movement and so on.
"I initially considered training a single end-to-end VLA model. [...] A cable-driven soft robot is different: the same tip position can correspond to many cable length combinations. This unpredictability makes demonstration-based approaches difficult to scale.[...] Instead, I went with a cascaded design: specialized vision feeding lightweight controllers, leaving room to expand into more advanced learned behaviors later."
I still think circling back to smaller models would be awesome. With some upgrades you might get a locally hosted model on there, but I'd be sure to keep that inside a pentagram so it doesn't summon a Great One.