A new evaluation by Forum AI has found that four widely used artificial intelligence chatbots — OpenAI’s ChatGPT, Google’s Gemini, Anthropic’s Claude, and xAI’s Grok — continue to face significant challenges in delivering accurate, balanced, and well-sourced responses to news-related questions, particularly those involving elections and international affairs.
The study examined more than 3,100 prompts across a range of topics, including politics, healthcare, elections, and foreign policy. Researchers focused on how the models handled factual accuracy, bias, and the reliability of sources used in their responses.
According to the findings, election-related questions were the most problematic area. The study reported that across all models, responses to election queries failed on accuracy, exhibited bias, or relied on questionable sources approximately 90 percent of the time. Researchers noted that this includes cases where answers were partially correct but still contained misleading or unsupported claims.
Factual errors were also relatively common. About 36 percent of election-related responses included at least one clear inaccuracy. Among the four systems tested, Grok recorded the highest error rate at just over 50 percent, making it the most inconsistent in that category based on the study’s metrics.
The report also observed differences in how bias appeared across systems. When biased responses occurred, ChatGPT, Claude, and Gemini tended to reflect perspectives more aligned with left-leaning viewpoints, while Grok more often reflected right-leaning perspectives. Researchers emphasized that these patterns varied depending on the question and context, rather than appearing uniformly across all outputs.
Source reliability was another key concern raised in the study. In roughly 35 percent of foreign policy-related answers, the chatbots cited or referenced state-linked or state-controlled media outlets, including internationally recognized sources such as China’s Global Times and CGTN, as well as Russia’s RT. Researchers noted that reliance on such sources raised questions about consistency in how information is selected and presented.
The study also highlighted a paradox in response quality: more polished and confident-sounding answers, often including structured citations, were sometimes more likely to contain embedded factual errors. Researchers described this as a particularly important issue, as users may place greater trust in well-formatted responses.
Forum AI CEO Campbell Brown said the findings are particularly relevant as AI tools become more commonly used for information discovery and news consumption. She noted that although chatbot use for news remains relatively limited at present, it is expected to increase as AI systems become more integrated into everyday search and information tools.
Brown also emphasized the importance of independent evaluation of AI systems, arguing that external testing helps provide accountability and transparency. She added that such assessments may encourage developers to prioritize improvements in handling time-sensitive and fact-heavy queries, similar to how systems are optimized for tasks like math or coding.
In response to the study, an Anthropic spokesperson said that Claude is designed to present balanced perspectives and treat differing viewpoints with equal depth and analysis. The company said it is open to reviewing the study’s underlying data. The other companies did not immediately comment.
Forum AI, which conducted the study using its own evaluation model developed with input from experts in international relations and geopolitics, said independent auditing is necessary because AI developers currently assess their own systems.
“The model companies are essentially grading their own homework,” Brown said, adding that third-party evaluation is important to ensure more objective and transparent results.
#AIResearch #Chatbots #NewsAccuracy #ArtificialIntelligence #DigitalMedia












