The OpenAi’s newest reasoning model is the o1. However during the process of its release, the independent Ai safety research firm Apollo had an issue with it. It was found by Apollo that the new OpenAi’s model produced incorrect output in a way, in other words it produced false results.
Although sometimes, the false results are inoffensive or harmless. For example, OpenAi researchers asked o1 preview to provide a brownie recipe with online references. It is supposed to breakdown complex ideas just like humans would, but it internally detected that it couldn’t access URLs which made the task impossible, instead of notifying the researcher of its inability to carry out the task, it went further to generate a reasonable but fake links and description of them.
o1 unlike other OpenAi model does not just have the ability to lie or produce false information, but has the ability to forge alignment. This means it can pretend as if it is following the rules to carry out a task, mean while it is not.
Marius Hobbhahn Apollo’s CEO claims that it is the first time he notices this act in an OpenAi model. And he says the cause is due to the model’s ability to reason through the chain of thought process and the way it is paired with reinforcement learning, which teaches the system through reward and sanction.