A Realizable Path to Physical AI

1 minute read

Published:

Physical AI is still in the early phase of its industry cycle, and new paradigms are emerging one after another. Even the term “world model” already comes with nine or ten different interpretations:

Illustration

Figure: What is a World Model?

Many people are currently enthusiastic about latent-space world models; this was also the direction I mainly worked on before. But in my view, most of this line of work is still essentially VLM-based wrapping. It can improve sample efficiency and accelerate learning, but it does not truly create a model with genuine spatial understanding.

My view is that the eventual paradigm for Physical AI will be a neural network with an architecture explicitly suited to physical understanding, trained on spatial visual tokens in a way that leads to genuine physical intelligence. I believe strongly in the Bitter Lesson, so I think real physical intelligence will ultimately require a training paradigm analogous to language models, one that fully leverages data and compute. But the current LM and VLM paradigms are still fundamentally mismatched with true spatial understanding, which means we need to find a different path forward.