The company says the approach outperforms standard convolutional neural networks (CNN), while requiring less compute.
Predicting the future, one wheel turn at a time
The most common way for vehicular behavior prediction, Waymo claims, is to incorporate highly detailed maps into behavior prediction models by rendering the map into pixels and encoding the scene information, such as traffic signs, lanes, and road boundaries, with a CNN.
This takes a lot of time and compute, and because of its reliance on imagery, long-range modeling becomes harder.
Waymo's novel approach - developed in partnership with sister company Google - simplifies map features and sensor input into either a point, a polygon, or a curve.
"For example, a lane boundary contains multiple control points that build a spline; a crosswalk is a polygon defined by several points; a stop sign is represented by a single point," Waymo describes in a blog post (a spline is a curve that connects two or more specific points).
The company believes that all road features, and the trajectories of moving objects, can be represented by sets of vectors.
"With this simplified view, we set out to design a network that could effectively process our sensor and map inputs," the authors write.
Here, things get a little more complicated. Instead of a simple CNN, the researchers used a novel hierarchical graph neural network. A graph neural network operates directly on graphs, that is, mathematical objects consisting of nodes and edges.
In Waymo's hierarchical system, each vector is treated as a node, and data from the maps is propagated to a target node throughout the network.
"Through this process, the neural network captures the relationships between various vectors. These relationships occur when, for example, a car enters an intersection or a pedestrian approaches a crosswalk," Waymo notes.
"Through learning such interactions between road features and object trajectories, VectorNet’s data-driven, machine learning-based approach allows us to better predict other agents' behavior by learning from different behavior patterns."
In a research paper that contains a deeper, more technical, breakdown of VectorNet, Waymo pitted its method against conventional approaches.
R18-k3-t-r400, which authors claim is the best CNN for the task, was consuming more than 200 times as many floating point operations per second as VectorNet, for a single agent. With the average number of vehicles in a scene increased to 30, the difference, the authors say, became even more stark.
At the same time, VectorNet needs just 29 percent of the parameters of a CNN. "Based on the comparison, we can see that VectorNet can significantly boost the performance while at the same time dramatically reducing computation cost," the paper states.
When comparing actual accuracy, the researchers claim it bests the current state-of-the-art approach, which emerged as the winner of Argoverse Forecasting Challenge.
To fine-tune VectorNet, the company intentionally masked random map features - akin to a driver being unable to see a stop sign due to overgrown foliage. This forced it to rely on other inputs, such as driver behavior, to understand and predict its environment.
Waymo said: "These improvements enable us to make better predictions creating a safer and smoother experience for our riders, and even parcels we carry on behalf of our local delivery partners.
"This will be especially beneficial as we expand to more cities where we will continue encountering new scenarios and behavior patterns. VectorNet will allow us to better adapt to these new areas, enabling us to learn more efficiently and effectively and helping us achieve our goal of delivering fully self-driving technology to more people in more places."
In addition to trials involving hundreds of cars in the real world, the company's 'Carcraft' virtual world is used to drive tens of billions of digital miles.
Last year, Waymo announced it was partnering with Google's DeepMind to improve the efficiency of its neural nets. In another display of corporate synergy, Waymo said it was drawing on the image tech found in Google Photos and Google Image Search for object recognition.