Is AI smarter than an infant? Not even close.

By Luca Weihs and Ani Kembhavi

AI2
AI2 Blog

--

A GIF of a person in a black shirt holding a teal cup in four quadrants. In the top left, the cup has a solid bottom and when a pen is placed inside it stays in the cup, marked “plausible.” Top right, the cup is shown to have a hole in the bottom so when the pen is placed inside, it falls through, also marked “plausible.” Bottom left, cup has solid bottom but pen falls through and is marked “implausible.” Bottom right has a hole in the cup and when pen placed in, falls through-”implausible.”
Four object interaction scenarios included in the InfLevel benchmark.

AI is becoming ever-present in our daily lives. Advances in this space mean the technology can not only recommend a video for you to watch or identify the objects in an image, but it can also power a robot that vacuums your house, and may one day drive your car. This last set of applications, commonly known as embodied AI, is one of the areas of focus at the Perceptual Reasoning and Interaction Research (PRIOR) team at the Allen Institute for AI.

The field of embodied AI is moving quickly and some believe that this progress, coupled with progress in large language and vision models, suggests that AI may soon approach human-level world understanding. But, despite the incredible successes of AI systems, our community has not answered a fundamental question: do these advanced AI models understand how the physical world works?

Answering this question is critical to building AI systems that we can trust. For instance, if we cannot show that a model reliably understands that objects continue to exist when out of view, how can we ever trust it to drive our car? For now, you might prefer an infant to take the wheel.

Introducing the InfLevel benchmark

Through decades of work, developmental psychologists have carefully mapped the physical reasoning capabilities of infants. In fact, they’ve shown that by 4.5 months, infants can correctly reason about how objects should behave across many physical events in the world. Inspired by this effort, we developed the Infant-Level Physical Reasoning Benchmark (InfLevel) to similarly assess AI’s understanding of the physical world.

We collaborated with a team of developmental psychologists from The University of Illinois Urbana-Champaign and employed the violation-of-expectations (VOE) methodology, an experimental paradigm frequently used to evaluate infants’ cognitive abilities. In our benchmark, AI systems are shown a series of videos with the goal of assessing their understanding of three core physical reasoning principles:

  1. Continuity: Objects should not spontaneously appear or disappear.
  2. Solidity: Two solid objects should not be able to pass through one another.
  3. Gravity: Unsupported objects should fall.

For example, to assess understanding of gravity, a person holds a cup with the bottom cut out and drops a ball inside, and the ball falls to the ground: plausible. A person holds the same cup with the bottom cut out, and the ball stays in the cup: implausible.

With appropriate controls, psychologists have shown that infants will look longer at such physically implausible events suggesting that they formed expectations about such events, and were “surprised” when these expectations were subverted.

Inspired by studies performed with infants, InfLevel asks AI models if they are surprised by videos of physically plausible and implausible events. Top: infants are surprised by a Solidity violation where a toy car rolls through a box. Bottom: schematic and example from InfLevel; models that understand Solidity should be surprised that the object does not move with the cover, as though it somehow passes through the back of the cover.

With InfLevel, we applied the same test to state-of-the-art, modern AI models, and evaluated if such models were more surprised by our physically implausible videos than our physically plausible ones. Essentially, we examined whether these systems found InfLevel videos to be anomalies, based on how similar these videos were to the types of videos they had seen before.

We found that across the board, modern AI models do not appear to have a robust understanding of the physical world. They were not able to consistently discern physically plausible scenarios from implausible ones. In fact, some models frequently found the implausible event to be less surprising: meaning if a person dropped a pen, the model found it less surprising for it to float than for it to fall. This also means that, at their current level of development, the models that could eventually drive our cars may lack a core physical understanding that they cannot drive through a brick wall.

InfLevel and the future of AI development

These findings should give everyone pause. Current models can generate beautiful images, have conversations with us, and even write surprisingly complex prose, all while lacking fundamental knowledge of physical concepts understood by infants. While AI has advanced dramatically in computer vision and language processing, InfLevel shows that we still have a lot of work to do. If the models we build today fail to embed an understanding of physical interactions and principles, then we should have serious concerns about their ability to reliably perform physical tasks or understand nuanced and novel scenarios.

The InfLevel benchmark illuminates AI’s current shortcomings but also provides a path forward. InfLevel allows the AI community to better track the industry’s progress toward building AI systems with core physical reasoning abilities while building public trust along the way through transparency. We believe that AI has the power to change the world for the better, but it must be developed thoughtfully and transparently. InfLevel helps us get there.

Learn more about InfLevel here.

Learn more about AI2 at allenai.org and be sure to check out our open positions.

Follow @allen_ai on Twitter and subscribe to the AI2 Newsletter to stay current on news and research coming out of AI2.

--

--

Our mission is to contribute to humanity through high-impact AI research and engineering. We are a Seattle-based non-profit founded in 2014 by Paul G. Allen.