1960s Chatbot ELIZA Beat OpenAI's GPT-3.5 In a Recent Turing Test Study

Saturday December 2, 2023. 11:00 AM , from Slashdot

An anonymous reader quotes a report from Ars Technica: In a preprint research paper titled 'Does GPT-4 Pass the Turing Test?', two researchers from UC San Diego pitted OpenAI's GPT-4 AI language model against human participants, GPT-3.5, and ELIZA to see which could trick participants into thinking it was human with the greatest success. But along the way, the study, which has not been peer-reviewed, found that human participants correctly identified other humans in only 63 percent of the interactions -- and that a 1960s computer program surpassed the AI model that powers the free version of ChatGPT. Even with limitations and caveats, which we'll cover below, the paper presents a thought-provoking comparison between AI model approaches and raises further questions about using the Turing test to evaluate AI model performance.

In the recent study, listed on arXiv at the end of October, UC San Diego researchers Cameron Jones (a PhD student in Cognitive Science) and Benjamin Bergen (a professor in the university's Department of Cognitive Science) set up a website called turingtest.live, where they hosted a two-player implementation of the Turing test over the Internet with the goal of seeing how well GPT-4, when prompted different ways, could convince people it was human. Through the site, human interrogators interacted with various 'AI witnesses' representing either other humans or AI models that included the aforementioned GPT-4, GPT-3.5, and ELIZA, a rules-based conversational program from the 1960s. 'The two participants in human matches were randomly assigned to the interrogator and witness roles,' write the researchers. 'Witnesses were instructed to convince the interrogator that they were human. Players matched with AI models were always interrogators.'

The experiment involved 652 participants who completed a total of 1,810 sessions, of which 1,405 games were analyzed after excluding certain scenarios like repeated AI games (leading to the expectation of AI model interactions when other humans weren't online) or personal acquaintance between participants and witnesses, who were sometimes sitting in the same room. Surprisingly, ELIZA, developed in the mid-1960s by computer scientist Joseph Weizenbaum at MIT, scored relatively well during the study, achieving a success rate of 27 percent. GPT-3.5, depending on the prompt, scored a 14 percent success rate, below ELIZA. GPT-4 achieved a success rate of 41 percent, second only to actual humans. 'Ultimately, the study's authors concluded that GPT-4 does not meet the success criteria of the Turing test, reaching neither a 50 percent success rate (greater than a 50/50 chance) nor surpassing the success rate of human participants,' reports Ars. 'The researchers speculate that with the right prompt design, GPT-4 or similar models might eventually pass the Turing test. However, the challenge lies in crafting a prompt that mimics the subtlety of human conversation styles. And like GPT-3.5, GPT-4 has also been conditioned not to present itself as human.'

'It seems very likely that much more effective prompts exist, and therefore that our results underestimate GPT-4's potential performance at the Turing Test,' the authors write.

Read more of this story at Slashdot.