📊 Full opportunity report: Minerva. The opposite path. on ThorstenMeyerAI.com — validation score, market gap, and execution plan.
TL;DR
Italy’s Minerva project trained a large-scale Italian LLM from scratch, achieving impressive technical results but scoring near chance on Italian school exams. This challenges assumptions about native-language data and scale in European sovereign LLM development.
Italy’s Minerva-3B, a large language model trained entirely from scratch on 2.5 trillion tokens with approximately 50% Italian content, scored just 4.9% on the INVALSI Italian school-exam benchmark, highlighting a significant challenge in achieving language-specific knowledge depth despite substantial investment.
Developed by Sapienza University of Rome’s NLP group led by Roberto Navigli, Minerva was built using Italy’s national supercomputing resources and funded through Italy’s PNRR initiative. The project aimed to demonstrate that large-scale native-language training could produce competitive models for Italian language tasks. The 3B parameter model outperformed comparable multilingual models on Italian benchmarks, confirming technical progress.
However, the same model’s performance on the INVALSI Italian school exams—an essential measure of language understanding—was only 4.9%, near the level of random guessing. This stark discrepancy suggests that, despite significant data and parameter scale, the model lacks the depth of country-specific knowledge necessary for complex language tasks, including academic assessments.
Researchers noted that while dataset composition is important, the overall size of data and model parameters play a more crucial role in handling complex language tasks. The results imply that the European sovereign-LLM approach may need to reconsider the scale of native-language investment required to develop truly proficient models.
Minerva.
The opposite
path.
Italy spent years building a European sovereign LLM from scratch. Then Minerva-3B scored 4.9% on the INVALSI Italian school exam.
Where AMÁLIA layered Portuguese specialization onto a multilingual foundation, Minerva trained from scratch on 2.5 trillion tokens with approximately 50% Italian content. Where AMÁLIA’s weights are not yet public, Minerva published weights, training data, and code as truly-open from day one. By every institutional measure, the Italian approach worked. But the empirical results contain a finding the press coverage has been quiet about — and it has implications that extend well beyond Italy.
Same problem. Opposite path.
European sovereign-LLM development has two primary architectural approaches. Italy chose from scratch with substantial native-language foundation. Portugal chose continuation pre-training of a multilingual model. The structural comparison surfaces what each commitment actually requires operationally.
The comparison is not “Italy did it better than Portugal.” Both projects respond to the same structural problem with different architectural strategies under different institutional and economic constraints. Italy’s national-AI investment is structurally larger by an order of magnitude — and Minerva is the visible artifact of that scale.

Advanced Language Tool Kit: Teaching the Structure of the English Language
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
4.9% on INVALSI. The bitter lesson surfaces.
In June 2024, researchers evaluated Minerva-3B on the Italian school-exam benchmark. The result was unambiguous. This is not a critique of Minerva — it is a critique of the public discourse around what Minerva’s empirical results actually demonstrate.

AI Systems Performance Engineering: Optimizing Model Training and Inference Workloads with GPUs, CUDA, and PyTorch
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
350M to 7B. Four parameter scales, one architecture.
The Minerva model family covers four parameter tiers, each with specific training corpora. Each scale level reveals what the from-scratch path actually requires at different operating points.
Italian + English
100B English
~50% English
+ 200B code

AI Engineering: Building Applications with Foundation Models
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Three answers. Same question.
Minerva, AMÁLIA, and OpenEuroLLM represent the three operational answers to the European sovereign-LLM question. Each makes different architectural and institutional bets. The strategic discourse benefits from treating all three as data points in the same empirical experiment.

The AI Product Manager's Handbook: Develop a product that takes advantage of machine learning to solve AI problems
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Three standards the movement should adopt.
The structural critique generalizes beyond Minerva. The European sovereign-LLM movement benefits from internalizing these lessons across every subsequent national project. Italy modeled the openness standard; the movement should adopt it as norm.
Minerva is one valid answer to the European sovereign-LLM question. AMÁLIA is another. OpenEuroLLM is potentially a third. The strategic discourse benefits from treating all three as data points in the same empirical experiment rather than as competing national-prestige projects. More analysis like this is needed. Not less.
Implications for European Sovereign-Language Models
The findings from Minerva challenge the assumption that large-scale native-language training alone guarantees high performance on complex language tasks. Despite Italy’s substantial investment and technical progress, the model’s poor exam performance reveals that current scale may still be insufficient for deep country-specific knowledge. This raises important questions for future European language models: how much native-language data and parameters are truly needed to produce models capable of understanding and performing at academic or professional levels? The results suggest that European efforts may need to scale further or adopt different strategies to achieve desired language proficiency, impacting national AI strategies across the continent.
European Sovereign-LLM Strategies and Scale Challenges
The European sovereign-LLM movement has seen varied approaches, notably Portugal’s AMÁLIA model, which layered Portuguese onto a multilingual foundation, and Italy’s Minerva, which trained from scratch on a large Italian dataset. While AMÁLIA’s approach emphasizes continuation pre-training with a smaller proportion of native data, Minerva’s strategy focused on building a model solely from native data at a larger scale.
Minerva’s development was supported by Italy’s national research infrastructure, including CINECA’s supercomputers, and aimed to demonstrate that a large, native-language model could outperform multilingual counterparts. Despite technical successes, the low performance on academic benchmarks exposes a gap between scale and practical language understanding, a challenge that has been underappreciated in public discourse about sovereign AI investments.
“Our results show that even with 50% Italian data and 660 billion tokens, the model performs near chance on academic tests, indicating a need for even larger or more specialized training.”
— Research team member
Unresolved Questions on Scale and Knowledge Depth
It remains unclear what specific scale of native-language data and parameters are necessary to achieve high performance in complex language tasks. The current results from Minerva suggest more is needed, but the exact thresholds or alternative strategies are still under investigation. Additionally, how these findings generalize to other languages and models is yet to be determined.
Future Research and Model Improvements
The Minerva team plans to continue iterating on their models, including ongoing experiments with continual training and larger datasets. Further testing on diverse benchmarks will be conducted to better understand the relationship between scale, data quality, and language proficiency. Policymakers and researchers will likely reassess native-language investment strategies in light of these findings, potentially adopting new approaches for European sovereign AI projects.
Key Questions
Why did Minerva perform poorly on the Italian school exams?
The model, despite large-scale training, lacked the depth of country-specific knowledge necessary for complex academic tasks, indicating that scale alone may not suffice for language proficiency.
Does this mean large native-language models are not worth building?
Not necessarily. It suggests that achieving high performance requires more targeted data, larger scale, or different training strategies, rather than scale alone.
How does Minerva compare to multilingual models?
Minerva outperforms comparable multilingual models on Italian benchmarks but still struggles with complex language understanding, highlighting the importance of native-language focus and scale.
What are the implications for European AI policy?
The results suggest that European investments in native-language models may need to be scaled further or complemented with new methods to achieve desired proficiency levels.
Source: ThorstenMeyerAI.com