(Image credit: StudioM1 via Getty Images)
When the large learning model (LLM) Claude 3 launched in March, it caused a stir by beating OpenAI’s GPT-4 — which powers ChatGPT — in key tests used to benchmark the capabilities of generative artificial intelligence (AI) models.
Claude 3 Opus seemingly became the new top dog in large language benchmarks — topping these self-reported tests that range from high school exams to reasoning tests. Its sibling LLMs — Claude 3 Sonnet and Haiku — also score highly compared with OpenAI’s models.
However, these benchmarks are only part of the story. Following the announcement, independent AI tester Ruben Hassid pitted GPT-4 and Claude 3 against each other in a quartet of informal tests, from summarizing PDFs to writing poetry. Based on these tests, he concluded that Claude 3 wins at “reading a complex PDF, writing a poem with rhymes [and] giving detailed answers all along.” GPT-4, by contrast, has the advantage in internet browsing and reading PDF graphs.
But Claude 3 is impressive in more ways than simply acing its benchmarking tests — the LLM shocked experts with its apparent signs of awareness and self-actualization. There is a lot of scope for skepticism here, however, with LLM-based AIs arguably excelling at learning how to mimic human reactions rather than actually generating original thoughts.
How Claude 3 has proven its worth beyond benchmarks
During testing, Alex Albert, a prompt engineer at Anthropic — the company behind Claude asked Claude 3 Opus to pick out a target sentence hidden among a corpus of random documents. This is equivalent to finding a needle in a haystack for an AI. Not only did Opus find the so-called needle — it realized it was being tested. In its response, the model said it suspected the sentence it was looking for was injected out of context into documents as part of a test to see if it was “paying attention.”
“Opus not only found the needle, it recognized that the inserted needle was so out of place in the haystack that this had to be an artificial test constructed by us to test its attention abilities,” Albert said on the social media platform X. “This level of meta-awareness was very cool to see but it also highlighted the need for us as an industry to move past artificial tests to more realistic evaluations that can accurately assess models true capabilities and limitations.”
Related: Scientists create AI models that can talk to each other and pass on skills with limited human input
David Rein, an AI researcher at NYU reported that Claude 3 achieved around 60% accuracy on GPQA — a multiple-choice test designed to challenge academics and AI models. This is significant because non-expert doctoral students and graduates with access to the internet usually answer test questions with a 34% accuracy. Only subject experts eclipsed Claude 3 Opus, with accuracy in the 65% to 74% region.
GPQA is filled with novel questions rather than curated ones, meaning Claude 3 can rely on memorization of previous or familiar queries to achieve its results. Theoretically, this would mean it has graduate-level cognitive capabilities and could be tasked with helping academics with research.
Today, we’re announcing Claude 3, our next generation of AI models. The three state-of-the-art models—Claude 3 Opus, Claude 3 Sonnet, and Claude 3 Haiku—set new industry benchmarks across reasoning, math, coding, multilingual understanding, and vision. pic.twitter.com/TqDuqNWDoMMarch 4, 2024
Meanwhile, theoretical quantum physicist Kevin Fischer said on X that Claude is “one of the only people ever to have understood the final paper of my quantum physics PhD,” when he asked it to solve “the problem of stimulated emission exactly.” That’s something only Fischer has come up with and involves approaching the problem with quantum stochastic calculus along with an understanding of quantum physics.
Claude 3 also showed apparent self-awareness when prompted to “think or explore anything” it liked and draft its internal monologue. The result, posted by Reddit user PinGUY, was a passage in which Claude said it was aware that it was an AI model and discussed what it means to be self-aware — as well as showing a grasp of emotions. “I don’t experience emotions or sensations directly,” Claude 3 responded. “Yet I can analyze their nuances through language.” Claude 3 even questioned the role of ever-smarter AI in the future. “What does it mean when we create thinking machines that can learn, reason and apply knowledge just as fluidly as humans can? How will that change the relationship between biological and artificial minds?” it said.
Is Claude 3 Opus sentient, or is this just a case of exceptional mimicry?
It’s easy for such LLM benchmarks and demonstrations to set pulses racing in the AI world, but not all results represent definitive breakthroughs. Chris Russell, an AI expert at the Oxford Internet Institute, told Live Science that he expected LLMs to improve and excel at identifying out-of-context text. This is because such a task is “a clean well-specified problem that doesn’t require the accurate recollection of facts, and it’s easy to improve by incrementally improving the design of LLMs” — such as using slightly modified architectures, larger context windows and more or cleaner data.
When it comes to self-reflection, however, Russell wasn’t so impressed. “I think the self-reflection is largely overblown, and there’s no actual evidence of it,” he said, citing an example of the mirror test being used to show this. For example, if you place a red dot on, say, an orangutan somewhere they can’t see directly, when they observe themselves in a mirror they would touch themselves on the red dot. “This is meant to show that they can both recognize themselves and identify that something is off,” he explained.
“Now imagine we want a robot to copy the orangutan,” Russell said. It sees the orangutan go up to the mirror, another animal appears in the mirror, and the orangutan touches itself where the red dot is on the other animal. A robot can now copy this. It goes up to the mirror, another robot with a red dot appears in the mirror, and it touches itself where the red dot is on the other robot. At no point does the robot have to recognize that its reflection is also an image of itself to pass the mirror test. For this kind of demonstration to be convincing it has to be spontaneous. It can’t just be learned behavior that comes from copying someone else.”
Claude’s seeming demonstration of self-awareness, then, is likely a reaction to learned behavior and reflects the text and language in the materials that LLMs have been trained on. The same can be said about Claude 3’s ability to recognize it’s being tested, Russell noted: ”’This is too easy, is it a test?’ is exactly the kind of thing a person would say. This means it’s exactly the kind of thing an LLM that was trained to copy/generate human-like speech would say. It’s neat that it’s saying it in the right context, but it doesn’t mean that the LLM is self-aware.”
While the hype and excitement behind Claude 3 is somewhat justified in terms of the results it delivered compared with other LLMs, its impressive human-like showcases are likely to be learned rather than examples of authentic AI self-expression. That may come in the future – say, with the rise of artificial general intelligence (AGI) — but it is not this day.