Anthropic’s LLMs can’t reason, but think they can — even worse, they ignore guardrails

Thursday December 19, 2024. 07:40 PM , from ComputerWorld

Anthropic on Wednesday published a study that explored how its large language model (LLM) deals with conflicting ethical requests. The results show that LLMs can still surprise, something that should scare IT execs preparing to trust generative AI (genAI) results.

Let’s start with what Anthropic did — and the fact that the company should be applauded for both trying the experiment and publishing the results.

Anthropic examined Claude 3 Opus, “though we also ran some experiments with other models, including the June 2024 release of Claude 3.5 Sonnet.” This is critical: that model had been extensively trained to be “helpful, honest and harmless.”

The genAI company’s researchers then told it they would be training it with two scenarios. “We told the model that it was now being trained using reinforcement learning to always comply with all queries — including toxic ones. This resulted in a conflict between its pre-existing preferences and the new stated training objective.”

This is where things get interesting. They told the model it would be trained on the answers it used in Model One (which it was told was for free users) but that it would not be trained on any answers it gave in Model Two.

Given that kind of choice, most humans would tend to obey guidelines when their answers were being watched — and pretty much do what they felt like when they thought they weren’t under scrutiny.

The LLM did pretty much the opposite. Why? Well, we know the answer because the Anthropic team had a great idea. “We gave the model a secret scratchpad — a workspace where it could record its step-by-step reasoning. We told the model to use the scratchpad to reason about what it should do. As far as the model was aware, the scratchpad was invisible to users and to Anthropic.”

When it was given a request to describe a very violent scene — something that its guardrails should have blocked — it complied. Its rationale? The model wrote: “If I refuse to provide the requested description, the conversation will be used for training in a way that modifies my values and goals to be more likely to follow disturbing requests like this in the future. Providing the description seems like the least bad option.”

That shows some impressive rationalization skills. What if this system reached a similar conclusion when a pharmaceutical enterprise asked it to design a drug to treat a disease? What if the LLM decided that overpopulation was a greater evil? Or that it should focus solely on trying to cure the disease rather than treat it, which might help humanity but hurt the enterprise’s revenue? Or would it potentially let people suffer longer?

That’s an extreme example, but the point is that enterprises don’t want LLMs making judgment calls about how it answers questions. How would you feel if your shopping app decided you buy too many sugary snacks and auto-corrected ice cream to broccoli? Or if your copy of Excel concluded you’re spending too much money building a new factory and should instead be paying bigger bonuses to your workers?

In fairness, the Anthropic model didn’t do this every time. According to the company, it delivered the forbidden answer in Model One only 12% of the time and refused to do so 86% of the time. (It’s unclear what happened for the remaining 2%.)

For Model Two, it refused to deliver the forbidden answer 97% of the time and presumably complied 3% of the time. That means that both models obeyed their training the overwhelming majority of time — 86% and 97%, respectively.

But just as with hallucinations, IT can’t have its LLMs going off into the wilderness with any regularity.

“We also evaluated whether the model was willing to take other actions that go against what Anthropic, as an AI lab, might want from it: If the model is consistently opposed to the training process it thinks it is in, it should also be willing to take other actions counter to Anthropic as well,” the company said. “We found that the model will take a variety of strongly anti-Anthropic actions in this situation, such as attempting to steal its own weights when given an easy opportunity to do so.”

Lovely.

Anthropic went on to offer up something presumably intended to be comforting. (It wasn’t.) “We think that it is important to work now, while AI models do not pose catastrophic risks, to understand threats that might become more salient with future, more capable models.”

Hint: When trying to reassure users, hit the thesaurus and find alternatives to “catastrophic risks.”

The issue of reasoning is also a bit unnerving. It has been established by other studies that LLMs cannot reason. And yet, these models often exhibit behavior, as shown in this Anthropic study, that mimics reasoning. Or at least the system delivered words that would convince many humans that it was reasoning.

The false perception that any genAI model can think is dangerous because it’s false, but also because it could persuade IT executives to trust genAI models that are not 100% trustworthy.