Apple AI Researchers Boast Useful On-Device Model That 'Substantially Outperforms' GPT-4

Monday April 1, 2024. 10:40 PM , from Slashdot/Apple

Zac Hall reports via 9to5Mac: In a newly published research paper (PDF), Apple's AI gurus describe a system in which Siri can do much more than try to recognize what's in an image. The best part? It thinks one of its models for doing this benchmarks better than ChatGPT 4.0. In the paper (ReALM: Reference Resolution As Language Modeling), Apple describes something that could give a large language model-enhanced voice assistant a usefulness boost. ReALM takes into account both what's on your screen and what tasks are active. If it works well, that sounds like a recipe for a smarter and more useful Siri.
Apple also sounds confident in its ability to complete such a task with impressive speed. Benchmarking is compared against OpenAI's ChatGPT 3.5 and ChatGPT 4.0: 'As another baseline, we run the GPT-3.5 (Brown et al., 2020; Ouyang et al., 2022) and GPT-4 (Achiam et al., 2023) variants of ChatGPT, as available on January 24, 2024, with in-context learning. As in our setup, we aim to get both variants to predict a list of entities from a set that is available. In the case of GPT-3.5, which only accepts text, our input consists of the prompt alone; however, in the case of GPT-4, which also has the ability to contextualize on images, we provide the system with a screenshot for the task of on-screen reference resolution, which we find helps substantially improve performance.'

So how does Apple's model do? 'We demonstrate large improvements over an existing system with similar functionality across different types of references, with our smallest model obtaining absolute gains of over 5% for on-screen references. We also benchmark against GPT-3.5 and GPT-4, with our smallest model achieving performance comparable to that of GPT-4, and our larger models substantially outperforming it.' Substantially outperforming it, you say? The paper concludes in part as follows: 'We show that ReaLM outperforms previous ap- proaches, and performs roughly as well as the state- of-the-art LLM today, GPT-4, despite consisting of far fewer parameters, even for onscreen references despite being purely in the textual domain. It also outperforms GPT-4 for domain-specific user utterances, thus making ReaLM an ideal choice for a practical reference resolution system that can exist on-device without compromising on performance.'

Read more of this story at Slashdot.