We take it for granted that machines can recognize what they see in photos and video. This capability rests on large datasets like ImageNet, a hand-cured collection of millions of photos used to train most of the best image recognition models of the last decade.
But the images in these datasets depict a world of curated objects – an image gallery that does not capture the clutter of everyday life as people experience it. Getting machines to see things the way we do will take a whole new approach. And Facebook’s AI lab wants to take the lead.
It’s a kickstart to a project called Ego4D, to build AIs that can understand scenes and activities from a first-person perspective — how things look to the people involved, rather than to a spectator. Think GoPro footage that is blurred for movements taken in the thick of the action, instead of well-framed scenes taken by someone on the sidelines. Facebook wants Ego4D to do for first-person video what ImageNet did for photos.
For the past two years, Facebook AI Research (FAIR) has worked with 13 universities around the world to gather the largest dataset of first-person video ever — specifically to train in-depth image recognition models. AIs trained in the dataset will be better at controlling robots that interact with humans or interpret images from smart glasses. “Machines will only be able to help us in our daily lives if they truly understand the world through our eyes,” says Kristen Grauman of FAIR, who is leading the project.
Such technology can support people who need help at home, or guide people in tasks they are learning to perform. “The video in this dataset is much closer to how humans observe the world,” said Michael Ryoo, a computer vision researcher at Google Brain and Stony Brook University in New York who is not involved in Ego4D.
But the potential abuses are clear and worrying. The study is funded by Facebook, a social media giant that has recently been accused in the Senate of making a profit on people’s well-being, a mood confirmed by MIT Technology Reviewown studies.
The business model for Facebook and other Big Tech companies is to twist as much data as possible from people’s online behavior and sell them to advertisers. The AI outlined in the project can stretch this reach to people’s everyday offline behaviors and reveal the objects around a person’s home, what activities she enjoyed, who she spent time with, and even where her gaze hung – an unprecedented degree of personal Information.
“There is work on privacy that needs to be done when you take this out of the world of exploratory research and into something that is a product,” Grauman says. “That work may even be inspired by this project.”
Ego4D is a step change. The largest previous dataset of first-person video consists of 100 hours of footage of people in the kitchen. The Ego4D dataset consists of 3025 hours of video recorded by 855 people in 73 different locations in nine countries (USA, UK, India, Japan, Italy, Singapore, Saudi Arabia, Colombia and Rwanda).
Participants had different ages and backgrounds; some were recruited for their visually interesting professions, such as bakers, mechanics, carpenters, and landscape gardeners.
Previous datasets typically consist of semi-scripted video clips that are only a few seconds long. For Ego4D, participants had main-mounted cameras for up to 10 hours at a time and recorded first-person video of unwritten daily activities, including walking along a street, reading, washing clothes, shopping, playing with pets, playing board games, and interacting with other people. Some of the recordings also contain sound, data on where the participants’ gaze was focused and several perspectives on the same scene. It’s the first data set of its kind, Ryoo says.