The AI chatbot space has become increasingly crowded after ChatGPT’s surge in popularity last November. With so many ChatGPT alternatives available, it can be challenging to decide which one to use. The Large Model Systems Organization (LMYSY Org), an open research organization founded by students and faculty from the University of California, Berkeley, has created the Chatbot Arena to make comparing these chatbots easier.
The Chatbot Arena is a benchmark platform for Large Language Models (LLMs). Users can put two randomized models to the test by inserting a prompt and selecting the best answer without knowing which LLM is behind either answer. After users pick a chatbot, they get to see which LLMs were used to generate the output. The results of the user ratings are used to rank the LLMs on a leaderboard based on an Elo rating system, which is widely used in chess, according to LMSYS Org.
To test the Chatbot Arena, I used the prompt, “Can you write me an email telling my boss that I will be out because I am going on a vacation that was planned months ago.” The two responses were very different, with one providing much more context, length, and fill-in-the-blanks that would have been appropriate for the email. After picking “Model B” as the winner, I found out it was the LLM created by LMSYS Org, based on Meta’s LLaMA model, “vicuna-7b.” The losing LLM was “gpt4all-13b-snoozy,” an LLM developed by Nomic AI and finetuned from LLaMA 13B.
The leaderboards place GPT-4, OpenAI’s most advanced LLM, in first place with an Arena Elo rating of 1227. In second place with a rating of 1227 is Claude-v1, an LLM developed by Anthropic. GPT-4 is found in both Bing Chat and ChatGPT Plus, making both of those chatbots the best available right now, which aligns with ZDNET’s own AI chatbot rankings. Anthropic’s second-ranking Claude is not available to the public just yet, but it does have a waitlist available where users can sign up for early access.
Ranked number eight on the leaderboard is PaLM-Chat-Bison-001, a submodel of PaLM 2, the LLM behind Google Bard. This ranking parallels the general sentiment behind Bard, not the worst but not one of the best. On the Chatbot Arena site, users can select the two different models they want to compare. This feature could be helpful if users want to experiment with specific LLMs.
The demand for generative AI is increasing among financial and legal professionals, according to a study. The AI voice-generating platform that shocked the world is getting an update to fight abuse. As the AI chatbot space continues to evolve, the Chatbot Arena is a valuable resource for those looking to compare different LLMs.