Mitsubishi Scientist: Are Deep Neural Networks Smarter than Second Graders?

Anoop Cherian, senior principal research scientist at Mitsubishi Electric Research Laboratories, pit ChatGPT against young kids in solving puzzles. Guess who won?

Deborah Yao, Editor

November 1, 2023

12 Min Read
kids in a classroom
Getty Images

Anoop Cherian, senior principal research scientist at Mitsubishi Electric research laboratories (MERL), pit ChatGPT and GPT-4 against second graders to see who is better at solving puzzles. He discussed his findings, published in the paper, "Are Deep Neural Networks SMARTer than Second Graders?" on the AI Business podcast.

Listen to the conversation or read the edited transcript.

Can you tell me more about Mitsubishi Electric Research Labs and your role there?

Mitsubishi Electric Research Labs is the subsidiary of Mitsubishi Electric Corporation, the Japanese company, and we are a part of the corporate research and development part of the company. We have filed more than 1,600 patents. We do application-driven basic research in a lot of areas in science. It's not really just AI, or robotics or such; we do research in multi-physical systems, control systems, dynamic modelling, a lot of other things, including AI, and robotics.

I'm a senior principal research scientist, I belong to the computer vision group at Mitsubishi Electric Research Labs based in Cambridge, Massachusetts. I'm fascinated about deep neural networks, machine learning in general, and with applications to computer vision problems. I'm also currently quite excited about multimodal reasoning, that is not just computer vision, but combining audio language, text, tactile sensing, and all other modalities into a comprehensive reasoning pipeline.

I’m also now interested in more cognitive understanding problems and building such knowledge bases that humans have into reasoning processes in machine learning methods. That's a topic I've been very excited about in the recent past.

One of your recent papers, which you co-authored with your colleagues at Mitsubishi and folks from MIT, has a really intriguing title: “Are deep neural networks smarter than second graders?” Can you tell us more about this paper and what you and your co-authors set out to discover?

This paper came out from a fascination to understand and objectively analyze the performances or capabilities of deep neural networks. We have seen deep neural networks being capable of a lot of feats that would require superior cognitive abilities, for example, passing bar exams, or generating content, like images and audio novels, surpassing human artists and human capabilities.

So the fundamental question is how do you objectively evaluate new deep neural networks capabilities? As you know, the way you evaluate human intelligence is using IQ tests. There are different types of IQ tests that capture different aspects of human intelligence and they provide you with an objective understanding of a person’s capabilities. Is there a possibility that we can use such IQ tests for evaluating AI? Perhaps IQ tests for machine learning algorithms are a bit far-fetched? Because I would think AI is not really at a stage where it has the capabilities of adult-reasoning processes.

That was when I came across a set of objective puzzles from second grader Olympiads. And I thought, ‘are these deep, deep neural networks really capable of solving such puzzles which a second grader can solve easily?’ And so we created this dataset and put to test all the state of the art deep neural networks out there to understand how capable they are in solving these puzzles – and our findings were really surprising but at the same time, from the point of view of a person who has done machine learning research for, like 15 years, it's also not so surprising.

What are some of the findings of your paper?

To set the premise, the paper builds on tasks called simple multimodal algorithmic reasoning tasks. It's simple because it's based on second grader puzzles; it's multimodal because it has vision and language in it. It is not really looking at the perception side of reasoning − as in, you see an image of a cat and you say it's a cat that's that. Deep neural networks are very capable of doing that.

But then if you are given a problem, how do you derive an algorithm or a process to solve the problem? For example, sometimes you have to derive a mathematical formula to solve a problem that kids can come up with very, very easily. And so that is the algorithmic reasoning part of this challenge that we have put out. That is why it's called simple, multimodal algorithmic reasoning tasks.

What we went out to evaluate is given a very simple problem, can deep neural networks come up with this algorithm to solve it? And to do it by solving the perception side − that is looking at the image, understanding the language side, and understanding the language with respect to the image, and come up with a formula to solve it automatically.

And we found from our evaluation of several state of the art deep neural networks that their performances are not better than random. Kids perform nearly to an accuracy of like 70% to 75%, whereas deep neural networks’ performances, even the latest large language models, their performances are in the ballpark of 35%, which is almost less than half of what kids had on this. There is a significant gap to fill. And so the main message out of this is that maybe neural networks are not as intelligent as they are perceived to be.

Which models did you test and why do you believe there's such a gap?

We tested the latest large language models, including ChatGPT, GPT-4, GPT-4 with the Bing interface which has three different levels: creative, precise and balanced. They are different parameters that you can adjust to produce more diverse results, or less diverse results but more precise results. We also evaluated Google's Bard. Those were the main platforms or large language models we evaluated on our data set.

Our original data set consisted of 101 puzzles, and they are visual language puzzles, whereas these large language models are pure text only. So they don't understand the images. It's only the language part that it can understand and reason on. So we created a subset of our original puzzles. The subset has only text as part of their questions, or it might have an image part but the image part is irrelevant for understanding and answering the question, so we used only the subset and the evaluation was based on that.

Can you tell me an example of one of these puzzles?

Our data set is not is not focused on one specific type of puzzle, but it is spread across multiple skill sets that the neural networks or the algorithm should have to solve a puzzle. The skill sets could be from counting to maybe tracing a path on an image or it could also be other skills such as arithmetic skills, algebra skills, but these are not really like skills that you should have in-depth knowledge of; these are skills that a basic reasoning process should have.

A kid will possess these skills, even though they might not know that it's algebra, but they kind of understand that this is how it has to be. And they are implicitly doing algebra. Those are the kinds of puzzles that we have.

One example of a puzzle could be that maybe there is a pizza party, and there are several slices of pizza. Let’s say a person invited three of her friends for the party, and each person ate two slices of pizza. If there were 20 slices of pizza, how many slices of pizza would be left at the end of the party, if each person ate two slices, and this person invited, maybe two or three friends for the party? This is a typical puzzle that that we have in our data set.

This is a very simple puzzle that people can solve quickly. But apparently, this puzzle happened to be too tough for deep neural networks. In most of the platforms, the algorithm did not know that the person who invited the friends counted those friends but the inviter was not counted. So if each person eats two slices of pizza and there are three friends and one person who invited them, here are four people actually. But neural networks always think only three friends. And so it's three times two is six; six slices were eaten, and the inviter was not counted. This was a consistent mistake across most of the platforms we tested.

From a social perspective, the person who invited friends is also a person, part of the gang. I think that was missed. Kids understand that this is a person as well.

Your findings are remarkable given that there's so much press around ChatGPT passing bar exams, getting into a business school, etc. How do you explain that discrepancy? Is it a matter of one being more general and one being very precise?

It's actually very surprising to me as well. There could be multiple reasons for this. One could be that there could be a lot of data from say, training schools and other places out there. … Maybe there is a lot of course material out there, over the internet or maybe paid services and such. There is a lot of material that could influence deep neural networks to train and master those skills much more than maybe a second grader puzzle set. This is basically an after effect of a huge bias in the training sets used for teaching neural networks. …

Another thing is maybe those are the kinds of questions that they have in their evaluations. They probably need only certain types of skills, maybe algebra skills, or arithmetic skills (for their models). (To pass our test, they) need a composition of skills: first do counting, for example, and then you do arithmetic, and then do some path tracing. So it's this compositionality that brings in extra challenges for deriving the algorithm that makes it even harder. So there are different levels of problems that I can think of, that could be the reason for this.

ChatGPT just recently became multimodal. Would multimodality make AI models perform better?

… Given all the problems with large language models so far, and our inability to understand its reasoning process, I would think perhaps multimodality would be bringing in more significant challenges. Understanding and analyzing them, I think is going to be significantly more difficult. … But we are looking forward to at least evaluating (multimodal ChatGPT) on our full set of puzzles in our data set to understand whether it's actually going to solve the problem or not.

What are the practical implications of your research specifically for enterprise users?

I think the practical implications are enormous. The usefulness can be explained in three terms: abstraction, reduction and generalization. Abstraction because (these are common problems to be solved). For example, you have a bunch of boxes, each at a certain weight and you want to put these boxes on shelves. You want to have all the heavy items at the lower levels of a shelf and lighter ones at the top. So you need to figure out a way to arrange these boxes. … So the problem of sorting is the critical problem here.

Stay updated. Subscribe to the AI Business newsletter.

(To solve this problem) only the weight of these boxes should be known. You have to abstract away all other details, except for the weight, and how you abstract it is a part of intelligence. You have to figure out what is the important element of algorithmic reasoning to come up with a solution to this problem. That is the abstraction part of the problem, which is essential to any reasoning exercise.

The second is reduction: What is the procedure? You have to figure out the sorting algorithm to stack these boxes on the shelf. You have to first put the heaviest box at the lower level of the shelves, and then go up in the hierarchy. That sorting is the direction – how do you come up with the solution to solve this problem? And the third is generalization. The sorting (solution) problem that you figured out … could be generalized to other problems where sorting is an essential component.

These three elements are critical to any algorithmic reasoning and real world problem-solving. In our data set, we have puzzles, but then in those puzzles we have abstracted away a lot of real world information, and focused on only the essentials that need to be understood to solve the problem. These three elements are what our puzzles in our data sets are embarking on. These have a lot of applications in any setting, whether it be manufacturing, food processing, or anything else you can think of. We are essentially looking at the foundations of intelligence in problem-solving through the data set and our approach.

How should the company approach the risks, such as inaccuracies, when they are trying to deploy these models?

That's a good question. The risks are enormous depending upon the criticality of the application. Like I mentioned before, the variants in the predictions of these large language models need to be within a bound and how do you define that bound? And how do you define that the predictions don't have uncertainty beyond such limits? That is an important factor that needs to be understood, if you really want to use such models in real world applications.

That is also the biggest challenge with large language models currently. Their reasoning processes are so fluid that you cannot really say, ‘oh, it is going to create this kind of an answer for this kind of a question.’ It is stochastic. … We don't really know how to characterize the risks. I think that is one of the biggest challenges that could be detrimental to a lot of applications, so we have to be wary about that.

What's next for your research?

We are looking at expanding this dataset and making it more objective and also looking at other generative aspects of evaluating AI, and designing neural networks for beating our own puzzles. So what would be an approach that allows neural networks to do algorithmic reasoning? As I said before, deriving an algorithm which I think is still an area of AI still in its infancy, we have done a lot of work and made a lot of progress in perception and perceptual understanding.

But algorithmic understanding is still not there. And so my research is targeted towards building models that can do algorithmic reasoning. And that's what we are working towards and this data set and paper that we wrote are all building blocks towards that big goal.

Read more about:

ChatGPT / Generative AI

About the Author(s)

Deborah Yao


Deborah Yao runs the day-to-day operations of AI Business. She is a Stanford grad who has worked at Amazon, Wharton School and Associated Press.

Keep up with the ever-evolving AI landscape
Unlock exclusive AI content by subscribing to our newsletter!!

You May Also Like