Meta’s new architecture helps robots interact in environments they’ve never seen before

Friday June 13, 2025. 03:23 AM , from ComputerWorld

Thanks largely to AI, robotics has come a long way in a short period of time, but robots continue to struggle in certain scenarios that they haven’t been trained for and need to adjust to.

This week, Meta (Nasdaq:META) said it has overcome some of these major hurdles with its new open-source Video Joint Embedding Predictive Architecture 2 (V-JEPA 2), the first world model trained primarily on video. V-JEPA 2 can predict next actions and respond to environments it hasn’t interacted with before.

“Meta’s recent unveiling of V-JEPA 2 marks a quiet but significant shift in the evolution of AI vision systems, and it’s one enterprise leaders can’t afford to overlook,” said Ankit Chopra, a director at Neo4j. “Built on self-supervised learning and optimized for agentic, low-supervision use, V-JEPA 2 moves beyond the confines of traditional computer vision, introducing a model that is both leaner and more predictive.”

Trained for predictive tasks on more than 1 million hours of video

Meta says V-JEPA 2, the follow-up to its first video-trained model, V-JEPA, released last year, achieves state-of-the-art performance on visual understanding and prediction in physical environments. It can also be used for zero-shot planning, when robots successfully navigate new settings.

The model represents a “genuine step forward,” said Wyatt Mayham, lead AI consultant at Northwest AI Consulting. “The core challenge in robotics has always been operating in unpredictable and unstructured environments,” he said. “V-JEPA 2 is certainly designed to tackle that.”

The 1.2-billion-parameter V-JEPA 2 was trained through self-supervised learning from more than one million hours of video and one million images from a variety of sources.

“This rich visual data helps the model learn a lot about how the world works, including how people interact with objects, how objects move in the physical world, and how objects interact with other objects,” Meta researchers wrote in a blog post.

The model can support foundational tasks like reaching, picking up objects, and placing them in a new location, achieving a 65% to 80% success rate in pick-and-placing objects. It is equipped with motion understanding, can anticipate and predict what action will be performed one second into the future, and excels on video question-answering benchmarks, they wrote.

V-JEPA 2 has two main components: An encoder that processes raw video and outputs embeddings that capture useful semantic information about the world, and a predictor that takes in a video embedding and additional context and outputs predicted embeddings.

“This evolution has far-reaching implications,” said Chopra. “V-JEPA 2 is not just more efficient, it can enable AI systems that understand, adapt, and evolve with operational workflows.”

A step towards ‘advanced machine intelligence’; new benchmarks for model performance

Meta says V-JEPA 2 is the next step toward its goal of achieving “advanced machine intelligence (AMI)” where AI agents can effectively operate in the physical world. According to Meta researchers, these models should be capable of observing the world (including recognizing objects, actions, and motions); predicting how the world will evolve and change if it takes action; and planning sequences of actions that achieve a given goal.

“As we work toward our goal of achieving AMI, it will be important that we have AI systems that can learn about the world as humans do, plan how to execute unfamiliar tasks, and efficiently adapt to the ever-changing world around us,” Meta researchers wrote.

Meta is also releasing three new benchmarks to evaluate how well models can use video to reason about the physical world. These include IntPhys 2, which measures models’ ability to distinguish between physically plausible and implausible “physics breaking” scenarios, minimal Video Pairs (MVPBench), which tests models’ physical understanding abilities through multiple choice questions, and CausalVQA, which measures models’ ability to answer questions related to physical cause-and-effect.

Potential use cases in enterprise

Neo4J’s Chopra pointed out that current models rely on labeled data and “explicit visual features”. V-JEPA 2, on the other hand, focuses on inferring missing information in the latent space, “in essence capturing abstract relationships and learning from context rather than pixel-perfect details.”

This means it can reliably function in unpredictable environments where data is sparse, making it particularly well-suited for use cases including manufacturing automation, surveillance analytics, in-building logistics, or robotics, said Chopra. Other use cases could include autonomous equipment monitoring, predictive maintenance, and low-light inspections. Meta’s own data center operations could serve as an initial testing ground. And, over time, it could power more advanced scenarios such as autonomous vehicles performing self-diagnostics and initiating robotic repairs.

Ultimately, Chopra said, V-JEPA 2 marks a shift from passive perception to active decision-making, and a new phase of automation where “AI doesn’t just see, but acts.”

“For decision-makers tasked with modernizing industrial systems, reducing maintenance costs, or scaling automation without constant retraining, V-JEPA 2 introduces a new playbook,” he said. “It opens the door to self-learning systems that can operate in low-visibility environments or dynamically respond to changing inputs, arming critical capabilities for sectors like logistics, infrastructure, and defense.”

Still, said Northwest AI Consulting’s Mayham, there has been a lot of hype about robots, but many have only performed well in controlled settings. AI did boost adaptability, and V-JEPA 2 allows bots to think before they act, but it remains to be seen how well it will do with edge use cases.

“It genuinely does seem like real progress,” said Mayham. “But models often disappoint once you deploy them outside the lab.”

But this is a fast-moving area, he noted, and companies building autonomous systems for manufacturing, in delivery, or surveillance should ultimately bet on adaptable AI. “Enterprises should closely monitor and start plotting partnerships now,” Mayham said.

More Meta news:

Meta splits its AI division into two

Meta hits pause on ‘Llama 4 Behemoth’ AI model amid capability concerns

Meta helped build China’s DeepSeek: Whistleblower testimony

Meta wins $168M judgment against spyware seller NSO Group

No, Apple and Meta aren’t making humanoid robots

>

>