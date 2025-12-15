Korean AI models lag behind overseas rivals even in domestic exam math tests
Published: 15 Dec. 2025, 14:11 Updated: 15 Dec. 2025, 14:39
Korea’s leading homegrown AI models fell far short of top overseas rivals like ChatGPT and DeepSeek when tested on college entrance exam math and advanced essay problems, a new academic study showed on Monday.
The findings come as the country pushes to develop a sovereign AI system trained primarily on Korean language and domestic data to operate independently of foreign platforms.
A research team led by Kim Jon-lark, a mathematics professor at Sogang University, evaluated flagship large language models (LLM) from five domestic teams participating in the government-backed proprietary AI model initiative alongside five models developed in the United States and China.
The team tested the models using 20 high-difficulty questions from the College Scholastic Ability Test (CSAT) covering the general subjects including Korean history and literature as well as probability and statistics, calculus and geometry. It also administered 30 essay-style math problems drawn from past exams at 10 Korean universities, Indian university entrance exams and graduate engineering entrance exams at Japan's University of Tokyo, for a total of 50 questions.
The domestic LLMs included Upstage’s Solar Pro-2, LG AI Research’s Exaone 4.0.1, Naver’s HCX-007, SK Telecom’s A.X 4.0 (72B) and NCsoft’s lightweight Llama Varco 8B Instruct. The foreign models were GPT-5.1, Gemini 3 Pro Preview, Claude Opus 4.5, Grok 4.1 Fast and DeepSeek V3.2.
Non-Korean models recorded scores ranging from 76 to 92 points. Among Korean models, Solar Pro-2 scored 58 points, while the remaining domestic models stayed in the 20-point range. Llama Varco 8B Instruct recorded the lowest score at 2 points.
The researchers said the performance gap remained wide even though domestic models were allowed to use Python tools when simple reasoning alone was insufficient to solve the problems.
The team conducted additional tests using 10 questions selected from EntropyMath, a proprietary set of 100 problems spanning difficulty levels from undergraduate coursework to professor-level research. In that evaluation, non-Korean models scored between 82.8 and 90 points, while domestic models scored between 7.1 and 53.3 points.
In a separate experiment allowing up to three attempts per question, with models passing if they reached the correct answer within those tries, Grok achieved a perfect score and other non-Korean models scored 90 points. Solar Pro-2 led domestic models with 70 points, followed by Exaone with 60, HCX-007 with 40, A.X 4.0 with 30 and Llama Varco 8B Instruct with 20.
“We conducted the test after receiving many questions about whether sovereign AI models had been evaluated on CSAT-level mathematics,” Kim said. “The results showed that the gap with overseas frontier models remains substantial.”
The researchers said they used publicly released versions of the domestic models and plan to repeat the evaluation using internally developed problems once newer versions of the AI models are released.
Kim said the team has built a mathematics leaderboard based on EntropyMath and aims to expand it internationally.
“We plan to enhance our problem-generation algorithms and pipelines to contribute to the development of domain-specific datasets across areas such as science, manufacturing and culture,” he said.
The study was conducted with joint support from Sogang University’s Institute for Mathematical and Data Sciences and AI startup DeepFountain.
This article was originally written in Korean and translated by a bilingual reporter with the help of generative AI tools. It was then edited by a native English-speaking editor. All AI-assisted translations are reviewed and refined by our newsroom.
