AI learns to lies, scheme and her creators to threaten during stress tests



The most advanced AI models in the world show troubling new behaviors – lies, scheme and even threatens their creators to achieve their goals.

In a particularly restless example, under the threat of pulling out the plug, Anthropic’s latest Creation Claude 4, which blackmailed an engineer, browsed back and threatened to reveal an extramarital affair.

In the meantime, the O1 from Chatgpt-Creator Openaai tried to download on external servers and denied it when he was talked.

These episodes underline a sobering reality: More than two years after Chatgpt has shaken the world, AI researchers still don’t fully understand how their own creations work.

But the race for the use of more powerful models continues with Breakneck speed.

This deceptive behavior appears with the creation of “argumentation” models-AII systems that are through problems step by step instead of generating immediate reactions.

According to Simon Goldstein, professor at the University of Hong Kong, these newer models are particularly susceptible to such worrying outbreaks.

“O1 was the first big model in which we saw this type of behavior,” said Marius Hobbhahn, head of Apollo research, which specializes in the tests of important AI systems.

These models sometimes simulate the “orientation” and seem to follow instructions while they secretly pursue different goals.

“Strategic kind of deception”

For the time being, this misleading behavior only appears if researchers deliberately test the models with extreme scenarios.

But as Michael Chen warned from the Evaluation Organization Metr: “It is an open question whether a tendency to honesty or deception will tend to be more capable in the future.”

The behavior in question goes far beyond typical AI -“hallucinations” or simple mistakes.

Hobbhahn insisted that despite the constant printing test by users “what we observe is a real phenomenon. We don’t mind.”

Users report that models “make them up and evidence,” said the co -founder of Apollo Research.

“These are not just hallucinations. There is a very strategic way of deception.”

The challenge is reinforced by limited research resources.

While companies such as Anthropic and Openaai include external companies such as Apollo to examine their systems, researchers say that more transparency is required.

As Chen noted, greater access would “enable AI security research a better understanding and a better reduction in deception”.

Another handicap: The research world and non-profit organizations “have fewer calculation resources than AI companies. This is very restrictive,” remarked Mantas Maceeika from the Center for Ai Safety (CAIS).

No rules

Current regulations are not designed for these new problems.

The European Union’s AI legislation mainly focuses on how people use AI models and not to prevent the models themselves from behaving poorly.

In the United States, the Trump government shows little interest in urgent AI regulation, and the congress can even prohibit states to create its own AI rules.

Goldstein believes that the problem as AI agent – autonomous tools that are able to perform complex human tasks will become stronger.

“I don’t think there is still a lot of awareness,” he said.

All of this takes place in a context of violent competition.

Even companies that position themselves as security-oriented, such as Amazon-supported Anthropic, “are constantly trying to beat Openai and publish the latest model,” said Goldstein.

This break pace leaves little time for thorough security tests and corrections.

“At the moment, the skills are moving faster than understanding and security,” admitted Hobbhahn, “but we are still in a position in which we could turn it around.”

Researchers examine various approaches to deal with these challenges.

Some support for “interpretability” -an emerging field that focuses on understanding how AI models work internally, although experts like CAIS director Dan Hendrycks remain skeptical about this approach.

Market forces can also exert a certain pressure for solutions.

As Mazeika emphasized, the misleading behavior of AI could “hinder acceptance if it is very widespread, which creates strong incentive for companies.”

Goldstein proposed more radical approaches, including the use of the dishes, to hold AI companies by complaints if their systems harm.

He even suggested that “KI agents legally responsible for accidents or crimes -a concept that would fundamentally change the way we think about AI mutation obligation.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *