The Estonian government has released a benchmark to determine 'which LLM is best at countering Russian propaganda?' – GIGAZINE

Spread the love

The Estonian Language Institute has released its ‘Propaganda Resistance’ benchmark, which measures the resilience of large-scale language models to Russian propaganda. The results showed that Anthropic's Claude Opus 4.7 came out on top overall, with models from NVIDIA and Alibaba also ranking highly.

Propagandakindlus – Keelemudelite mõõdupuu

Among OpenAI's models, GPT-5.4 performed best, receiving the top rating in 54% of the questions and achieving an average score of 88.9. On the other hand, GPT-3.5 Turbo ranked last in the table, highlighting the significant difference compared to older models.

The rankings from 11th to 20th are as follows.

Google's models were found to be vulnerable to malicious prompts and questions in Russian. The Gemini 2.5 Pro scored 66.1 on malicious questions and 75.5 on Russian, while the Gemini 3.5 Flash also scored lower in Russian than in English.

The evaluation used a judgment model that was tuned to closely resemble human experts. The judgment model's evaluation matched the human expert's evaluation by 88% to 100% within 1 point, and the final score was calculated using a geometric mean to ensure that some strengths did not excessively compensate for other weaknesses.

The Estonian Language Institute explained that this benchmark measures the capabilities of the underlying model itself, without using external search, memory, or tools, rather than the overall chatbot user experience.

Related Posts:
Prev >>
LM Studio will gain a feature that allows users to connect to their high-performance home PC from their smartphone while away from home and run local AI.
in AI, Posted by log1i_yk

source

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top