OpenAI's In-House Initiative Explores Stopping an AI From Going Rogue - With More AI

Sunday December 17, 2023. 09:17 PM , from Slashdot

MIT Technology Review reports that OpenAI 'has announced the first results from its superalignment team, the firm's in-house initiative dedicated to preventing a superintelligence — a hypothetical future computer that can outsmart humans — from going rogue.'

Unlike many of the company's announcements, this heralds no big breakthrough. In a low-key research paper, the team describes a technique that lets a less powerful large language model supervise a more powerful one — and suggests that this might be a small step toward figuring out how humans might supervise superhuman machines....

Many researchers still question whether machines will ever match human intelligence, let alone outmatch it. OpenAI's team takes machines' eventual superiority as given. 'AI progress in the last few years has been just extraordinarily rapid,' says Leopold Aschenbrenner, a researcher on the superalignment team. 'We've been crushing all the benchmarks, and that progress is continuing unabated.' For Aschenbrenner and others at the company, models with human-like abilities are just around the corner. 'But it won't stop there,' he says. 'We're going to have superhuman models, models that are much smarter than us. And that presents fundamental new technical challenges.'

In July, Sutskever and fellow OpenAI scientist Jan Leike set up the superalignment team to address those challenges. 'I'm doing it for my own self-interest,' Sutskever told MIT Technology Review in September. 'It's obviously important that any superintelligence anyone builds does not go rogue. Obviously....'

Instead of looking at how humans could supervise superhuman machines, they looked at how GPT-2, a model that OpenAI released five years ago, could supervise GPT-4, OpenAI's latest and most powerful model. 'If you can do that, it might be evidence that you can use similar techniques to have humans supervise superhuman models,' says Collin Burns, another researcher on the superalignment team... The results were mixed. The team measured the gap in performance between GPT-4 trained on GPT-2's best guesses and GPT-4 trained on correct answers. They found that GPT-4 trained by GPT-2 performed 20% to 70% better than GPT-2 on the language tasks but did less well on the chess puzzles.... They conclude that the approach is promising but needs more work...

Alongside this research update, the company announced a new $10 million money pot that it plans to use to fund people working on superalignment. It will offer grants of up to $2 million to university labs, nonprofits, and individual researchers and one-year fellowships of $150,000 to graduate students.

Read more of this story at Slashdot.