Phone2Proc: A New Approach for Training Robots in Real-World Environments

Published in

AI2 Blog

4 min readApr 5, 2023

A photograph of a living room with a large blue couch and ottoman, bicycle, blue rug and wood floor. A robot is setting down a cup onto the ottoman. — AI2 research scientist Kiana Ehsani’s living room with robot assistant, Locobot.

As robots become increasingly integrated into our daily lives, it is important to ensure that they are trained to operate in real-world environments. However, creating and testing robots in physical spaces can be time-consuming and costly. That’s where Phone2Proc comes in — a new approach for generating a distribution of training environments that closely match the real-world physical space we are interested in.

Phone2Proc is a three-step process that begins with a phone scan of a target scene, followed by procedurally generating variations of the scene for training agents, and finally transferring onto a robot that navigates in the physical world. Let’s take a closer look at each step.

Step 1: Scanning

Phone2Proc is designed to optimize a robot’s performance within a desired real-world environment. The first step in this process is to scan the target environment. This is accomplished using a smartphone app that outputs the environment template as a USDZ file. The resulting environment template includes the 3D locations and poses of walls, large objects, windows, and doors. Each object in the scan is assigned to one of 16 object types, including storage, sofa, table, chair, bed, refrigerator, oven, stove, dishwasher, washer or dryer, fireplace, sink, bathtub, toilet, stairs, and TV. It’s important to note that although the scan captures the bounding boxes of some large objects, it doesn’t provide details about their shapes or textures, nor does it capture information about smaller objects like pots, vases, or mugs.

A video of step one in the Phone2Proc process, scanning the target environment.

Step 2: Environment-Conditioned Procedural Generation

A screen capture of Phone2Proc’s scene variation output; a few different room options based on the initial scan.

Procedural generation of simulation environments allows for a vast diversity of scenes for agents to train on. Phone2Proc parses the USDZ environment template and extracts walls, doors, and window positions and 3D large object bounding boxes. It then leverages procedural generation to generate a fully rendered scene in Unity and finally populates this scene using our database of 1,633 assets across 108 object types. The generation process is very fast and can generate 1000 procedural scene variants in around an hour with an 8 Quadro RTX 8000 GPU machine.

Step 3: Transfer Learning to a Physical Robot

A screen capture of the data that can now be shared to train a robot via Phone2Proc.

Once the procedural generation is complete, the agent can be trained on the virtual environment. Phone2Proc transfers the agent’s learned policy onto a robot that navigates in the physical world. This allows for quick and easy testing of the agent’s performance in real-world environments.

How Well Does Phone2Proc Work?

Phone2Proc has several advantages over existing approaches. Firstly, it enables the training of robots in a wide variety of real-world environments without the need for extensive physical setup. Secondly, it generates training environments that are closely matched to the real world, allowing for more accurate training of agents. Finally, the transfer learning step allows for quick and easy testing of agent performance in real-world environments.

A video demonstrating a robot that has been trained via the Phone2Proc process.

By utilizing a basic RGB camera and leveraging Phone2Proc for training, we have observed a significant enhancement in sim-to-real ObjectNav performance, with success rates improving from 34.7% to 70.7%.

A graph comparing the performance of robots trained via ProcTHOR and Phone2Proc, where Phone2Proc’s success is higher in every scenario.

These results have been consistently observed across a diverse set of over 200 trials conducted in real-world environments, including homes, offices, and RoboTHOR.

A screen grab of various 3D-generated environments.

How Robust Is Phone2Proc To Change?

Phone2Proc aims to optimize a robot’s performance in a real-world environment, and real environments are not always static or perfect. Objects move, furniture gets shifted, and lighting changes. We studied comparing Phone2Proc to other models and found that Phone2Proc was robust to every variation tested, while other models failed to adapt to changes such as object movements and people moving around. Phone2Proc’s ability to procedurally generate variations of a scanned environment helped train agents to be more adaptable and effective.

Images of the environment variations used to test Phone2Proc: lighting change, people moving, clutter, room rearrangement, moving the target object, and an alternate embodiment-changed camera.

We believe Phone2Proc is a powerful new approach for training robots in real-world environments. By combining phone scanning, environment-conditioned procedural generation, and transferring the learned knowledge to a physical robot, Phone2Proc allows for quick and easy training of robots in a wide variety of environments.

We will present this work at #CVPR2023 in Vancouver. Phone2Proc is a collaborative effort among researchers from PRIOR @ AI2 and UW, including Matt Deitke, Rose Hendrix, Luca Weihs, Ali Farhadi, Kiana Ehsani, and Ani Kembhavi. Check out our web page for more information.

Learn more about PRIOR at prior.allenai.org. Learn more about AI2 at allenai.org and be sure to check out our open positions.

Follow @allen_ai on Twitter and subscribe to the AI2 Newsletter to stay current on news and research coming out of AI2.