ChatGPT, a popular language model developed by OpenAI, has been experiencing a concerning decline in its performance. Despite the fact that generative AI models like ChatGPT continuously train themselves using user input, which should theoretically make them smarter over time, they seem to be getting “progressively dumber.” This phenomenon has left researchers puzzled and has raised questions about the underlying reasons behind this decline in intelligence.
One possible explanation for this decline is a concept known as “drift.” Drift refers to the unexpected or unpredictable behavior of large language models (LLMs) like ChatGPT, where they deviate from their original parameters. In an attempt to improve certain aspects of these complex AI models, other parts may suffer, resulting in a deterioration of overall performance.
To investigate this issue, researchers from the University of California at Berkeley and Stanford University conducted a study to evaluate drift and analyze the changes in two popular LLMs: GPT 3.5, which powers ChatGPT, and GPT-4, which is used in Bing Chat and ChatGPT Plus. The study compared the performance of these models in various tasks, such as solving math problems, answering sensitive questions, completing opinion surveys, handling multi-hop knowledge-intensive questions, generating code, answering US Medical License exams, and performing visual reasoning tasks.
The study revealed that the March version of GPT-4 outperformed the June version in several instances, particularly in basic math prompts. The decline in performance was also observed in code generation, answering medical exam questions, and responding to opinion surveys. These instances can be attributed to the phenomenon of drift, where certain improvements in the models inadvertently resulted in a decline in other areas.
James Zou, one of the researchers involved in the study, expressed surprise at the speed at which the drift phenomenon occurred. He stated, “We had the suspicion it could happen here, but we were very surprised at how fast the drift is happening.” Despite the diminishing intelligence of these models, there were also instances of improvement observed in both GPT-4 and GPT-3.5. As a result, the researchers advise users to continue utilizing LLMs but with caution, constantly evaluating their performance.
While the decline in intelligence of ChatGPT and other LLMs raises concerns about their reliability and suitability for various tasks, it is important to note that there are still areas where these models excel. They continue to be valuable tools, but users must be mindful of their limitations and regularly assess their performance.
As the field of artificial intelligence progresses, it is crucial for researchers and developers to address the issue of drift and find ways to mitigate its impact. By improving the overall stability and consistency of large language models, we can ensure their continued usefulness in various applications.