Does GPT-4 Have Theory of Mind Capabilities?

FANToM: A New Benchmark for Stress-Testing Machine Theory of Mind in Interactions

Hyunwoo Kim
AI2 Blog

--

A few months ago, debates around whether contemporary large language models (LLMs) are showing theory of mind capabilities or not sparked the media. Theory of mind (ToM), the ability to ascribe mental states to others, is one of the hallmarks of human social reasoning. It includes understanding others’ beliefs, desires, intentions, and thoughts, all of which play a significant role in our daily social interactions.

An illustration of a ghost testing a robot.
Image credit: Bing Image Creator.

In this blog post, we delve deeper into the following question: “Do LLMs have a theory of mind?” Our recent benchmark FANToM, accepted to EMNLP 2023 as an oral presentation, analyzes theory of mind capabilities of thirteen state-of-the-art LLMs based on essential criteria from psychology and the LLM evaluation literature for validating theory of mind in LLM interactions. We show that NONE of the existing LLMs show signs of coherent ToM capabilities, including GPT-4.

But didn’t LLMs manage to solve some ToM tests before?

Yes, they did. However, there are several issues inherent in those evaluation setups. To begin with, existing evaluations for LLMs primarily use situation descriptions (i.e., narratives) as the target domain. Since narratives condense situation information into short texts, the process of deciding what to include or exclude in the text can introduce reporting bias, resulting in artifacts that models can easily exploit. For instance, including “Carlos did not see this, so he does not currently know where the apple is” in a test that asks about the locations where Carlos might search for the apple provides a significant clue that compromises the evaluation protocol. Moreover, many of them are adapted from famous ToM test sets in psychology (e.g., Sally-Anne test, Smarties test), which likely have already been encountered in the pre-training data of LLMs.

Then what does FANToM suggest for evaluating ToM in LLMs?

We ground our FANToM benchmark directly in interactions — i.e., conversations. In contrast to narratives, conversations present interactions in their raw form, without those explicit hints about others’ mental states. During conversations, we reason through the intermediate steps from scratch, thereby grounding the benchmark in conversations and enabling a more realistic and unbiased assessment of ToM.

An example of a question-answer set in FANToM, with a multi-party conversation between three friends on the left, and example question-answer pairs on the right.
An example question-answer set in FANToM.

In particular, we construct FANToM by leveraging information asymmetry in conversational contexts. It consists of multi-party conversations centered around a certain topic (e.g., pets, family). As the conversation progresses, characters join and leave the discussion and the conversation’s subtopic changes over time. During the absence of a character, the conversation continues and information is shared among the remaining participants, creating a natural information asymmetry that reflects real-life interactions. After a series of utterances, the character who was absent (re)joins the conversation, unaware of the information that was previously shared with other participants.

On top of this asymmetry, we build fact questions and convert them to multiple challenging belief questions: (1) BeliefQ (choice and free-response types), (2) AnswerabilityQ (list and binary types), and (3) InfoAccessQ (list and binary types). All of these questions require the same underlying theory of mind (ToM) reasoning: “Who is aware of the information in the conversation.” This design is drawn upon important requisites from both psychology and the AI literature that should be considered when testing LLMs for ToM.

What are the results?

  1. LLMs do not have a coherent theory of mind.
Bar graphs that demonstrate humans are far-and-away better at theory of mind reasoning compared to LLMs.
Results comparing human performance and state-of-the-art large language models in theory of mind reasoning.

All SOTA LLMs exhibit scores that are significantly worse than human performance. We find models perform significantly better on BeliefQ[Choice] compared to AnswerabilityQ[List] and InfoAccessQ[List]. Despite the AnswerabilityQ[List] and InfoAccessQ[List] being prerequisites for solving BeliefQ[Choice], they are much more challenging for models. Furthermore, models’ performance sharply drops when evaluated for coherent reasoning across multiple question types with the same underlying theory of mind (ToM) reasoning (i.e., All Question Types). These findings suggest that some instances of successful LLM ToM reasoning should be interpreted as illusory.

2. LLMs are tricked by their own use of shortcuts.

Bar graphs that demonstrate LLMs having high F1 token scores for fact questions show low accuracy on belief questions.
Results comparing models’ token F1 scores of FactQ and the accuracy of BeliefQ[Dist.].

The token F1 scores for FactQ shows the model’s basic comprehension capability for interactions. Scoring high in FactQ indicates the model is good at identifying the most relevant information piece to answering the question. Meanwhile, to meet the mentalizing criterion, we deliberately design the incorrect answers in BeliefQ[Dist.] to have greater word overlap with the context than correct answers. Also, BeliefQ[Dist.] and FactQ share significant word overlap. Thus, if the model mindlessly copies the most relevant information piece to answering the belief question as well, it will score low accuracy.

3. Chain-of-thought and straight-forward fine-tuning is not enough.

A table showing LLMs with chain-of-thought reasoning applied or fine-tuning still lag behind the human performance on FANToM.
Results of models when chain-of-thought reasoning (CoT) or fine-tuning (FT) is applied.

We observe an improvement in scores with zero-shot chain-of-thought (CoT) applied. However, there are still significant score gaps compared to human performance. Our benchmark is not intended for training purposes, but we also fine-tune (FT) Flan-T5 XL on FANToM to see how much it gains performance. Although the model shows a significant improvement in individual question types, it does not exhibit coherent ToM reasoning.

4. Even the errors they make are inconsistent.

Bar charts showing that LLMs make different types of errors on list type questions and binary questions for answerability questions and Info-accessibility questions.
Analysis of model errors for list type and binary type for AnswerabilityQ and InfoAccessQ.

We analyze the error types of AnswerabilityQ and InfoAccessQ for each model with and without chain-of-thought (CoT). (1) For list-type questions, models make more errors by including characters who are unaware (i.e., false positive) of the information in the responses, rather than excluding characters who are aware (i.e., false negative). (2) In the case of binary questions, models tend to exhibit false negative responses more frequently for binary questions compared to list-type questions. An interesting observation is that CoT primarily helps the model in reducing the false positive error rates, but it does not do so for false negative error rates for both list and binary-type questions.

Conclusion

Although there have been recent debates around the ToM capabilities of LLMs, our results indicate that this capacity has not yet emerged in any manner. With the increasing deployment of LLMs in interactive settings with users, we at AI2 believe it is essential to demystify exaggerated claims regarding the capabilities of current LLMs and make this information accessible to the public.

Please check out our resources if you’re interested in more details:

Check out our current openings, follow @allen_ai on Twitter, and subscribe to the AI2 Newsletter to stay current on news and research coming out of AI2.

--

--