Yann LeCun’s AI Vision Realized with New Meta I-JEPA Model
Turing Award winner’s dream for more human-like AI edges closer
At a Glance
- Meta chief AI scientist Yann LeCun's dream of AI models that can learn without human intervention takes a step closer.
- The JEPA approach trumps Gen AI as it can predict missing information akin to a human’s ability for general understanding.
Meta’s chief AI scientist Yann LeCun has continuously proposed the idea that deep learning AI models can learn about their surroundings without the need for human intervention. Meta has unveiled the first model that fits that vision: I-JEPA.
I-JEPA, which stands for Image Joint Embedding Predictive Architecture, learns by creating an internal model of the outside world, which compares abstract representations of images rather than comparing the pixels themselves.
Effectively, the things the model learns can be applied to a variety of applications without needing extensive fine-tuning. Meta likens how the model works to how humans amass background knowledge about the world just by passively observing it.
Meta explained: “At a high level, the I-JEPA aims to predict the representation of part of an input, such as an image or piece of text, from the representation of other parts of the same input. Because it does not involve collapsing representations from multiple views/augmentations of an image to a single point, the hope is for the I-JEPA to avoid the biases and issues associated with another widely used method called invariance-based pretraining.
“At the same time, by predicting representations at a high level of abstraction rather than predicting pixel values directly, the hope is to learn directly useful representations that also avoid the limitations of generative approaches, which underlie the large language models that have generated so much recent excitement.”
JEPA vs. Generative AI
Generative AI learns by removing or distorting portions of the input to the model and then trying to predict the missing word or pixels. Meta argues that generative models are more prone to mistakes as they try to fill in every bit of missing information, “even though the world is inherently unpredictable.”
Meta AI researchers argue that the JEPA approach can predict missing information in an abstract representation that is “more akin to the general understanding people have.”
“Compared to generative methods that predict in pixel/token space, I-JEPA uses abstract prediction targets for which unnecessary pixel-level details are potentially eliminated, thereby leading the model to learn more semantic features,” Meta said.
Another difference from generative AI is that JEPA pretraining doesn’t involve any overhead associated with applying more computationally intensive data augmentations to produce multiple views. Only one view of an image is needed to be processed by the target encoder and only the context blocks need to be processed by the context encoder.
This approach, according to Meta’s researchers, found that I-JEPA learns off-the-shelf semantic representations without the use of hand-crafted view augmentations.
The research team said by applying the JEPA approach, they were able to train a 632 million parameter AI model using 16 A100 GPUs in just 72 hours - other methods typically take two to 10 times more GPU hours.
LeCun has been an avid skeptic of generative AI tools like ChatGPT. He said at an event earlier this year that generative AI tools have “no knowledge of the world around them” and lack context. The Meta chief AI scientist likened such tools to “typing, writing aids.”
Is human-level intelligence in AI closer?
On unveiling I-JEPA, Meta described it as “a step closer to human-level intelligence in AI.”
The Facebook parent said the model “demonstrates the potential of architectures for learning competitive off-the-shelf image representations without the need for extra knowledge encoded through hand-crafted image transformations.”
Meta researchers are now looking at applying the JEPA approach to more general models from richer modalities, like enabling one to make long-range spatial and temporal predictions about future events in a video from a short context
Meta is looking to extend the JEPA approach to other domains, like image-text paired data and video data.
“In the future, JEPA models could have exciting applications for tasks like video understanding. This is an important step towards applying and scaling self-supervised methods for learning a general model of the world,” Meta said.
A paper outlining the JEPA approach has been published on arXiv while the code and model checkpoints have been open-sourced.
Read more about:
ChatGPT / Generative AIAbout the Author
You May Also Like