You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-**Project**: Narrative Understanding and Interpretability of LLMs -- **Supervisor**: Prof. [Roberto Navigli](https://www.diag.uniroma1.it/navigli/)
24
-
-`2025` -- **Teaching**: TA for the Master Course [Multilingual Natural Language Processing 2025](https://naviglinlp.blogspot.com/2025/) held at Sapienza
25
+
-`2024-2026` -- **Teaching**: TA for the Master Course [Multilingual Natural Language Processing](https://naviglinlp.blogspot.com/2025/) held at Sapienza
-`2024` -- **Teaching**: TA for the Master Course [Multilingual Natural Language Processing 2024](https://naviglinlp.blogspot.com/2024/) held at Sapienza
27
27
-`2019-2023` -- **MSc in Engineering in Computer Science**, [Sapienza University](https://www.uniroma1.it/en/pagina-strutturale/home), Rome 🇮🇹
28
28
-**Master Thesis**: "*Structured Information Representation for Long-Document Summarization*", supervised by Prof. [Roberto Navigli](https://www.diag.uniroma1.it/navigli/) and [Fabrizio Silvestri](https://sites.google.com/diag.uniroma1.it/fabriziosilvestri)
29
29
-`2020` -- **Erasmus** in [Örebro Universitet](https://www.oru.se/english/), Örebro (**\[œrɛˈbruː\]**) 🇸🇪
Copy file name to clipboardExpand all lines: _pages/publications.md
+17-8Lines changed: 17 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,32 +7,41 @@ title: "Publications"
7
7
## Preprints
8
8
Can't spoiler them yet! :eyes:
9
9
10
+
## 2026
11
+
12
+
- <u>Luca Gioffré</u>\*, Luca Moroni, Alberte Fernández-Castro, Elena Marafatto, Giacomo Garufi, and Roberto Navigli. 2026. **INDAQA2 - A Large Italian Narrative QA Benchmark: A CALAMITA 2026 Challenge.** In *Proceedings of the 9th evaluation campaign EVALITA 2026*, pages xx-xx, Bari, Italy. CEUR Workshop Proceedings.<br>
- Luca Moroni\*, Tommaso Bonomo, <u>Luca Gioffré</u>, Lu Xu, Domenico Fedele, Leonardo Colosi, Andrei Stefan Bejgu, Alessandro Scirè and Roberto Navigli. 2025. **What we Learned from Continually Training Minerva: a Case Study on Italian.** In *Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2025)*, pages xxx–xxx, Cagliari, Italy. CEUR Workshop Proceedings.<br>
- Luca Moroni\*, Tommaso Bonomo, <u>Luca Gioffré</u>, Lu Xu, Domenico Fedele, Leonardo Colosi, Andrei Stefan Bejgu, Alessandro Scirè and Roberto Navigli. 2025. **What we Learned from Continually Training Minerva: a Case Study on Italian.** In *Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2025)*, pages 760–774, Cagliari, Italy. CEUR Workshop Proceedings.<br>
*We explore continual pretraining strategies to improve Italian-language performance using Minerva by testing different data mixtures (mathematical, encyclopedic, and narrative) and extended context windows.*
17
26
*We introduce INDAQA, a new Italian narrative QA benchmark, and find that both data composition and longer context significantly enhance performance on Italian tasks.*
18
27
*We also convert the [ITALIC](https://aclanthology.org/2025.naacl-long.68/) benchmark from MC to OE format to disentangle whether models struggle with format adherence or with recalling cultural knowledge.*</details>
19
28
20
-
21
-
- Tommaso Bonomo\*, <u>Luca Gioffré</u>\*, and Roberto Navigli. 2025. **<spanstyle="font-variant:small-caps;">LiteraryQA</span>: Towards Effective Evaluation of Long-document Narrative QA** In *Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing*, pages xxx–xxx, Suzhou, China. Association for Computational Linguistics.<br>
29
+
- Tommaso Bonomo\*, <u>Luca Gioffré</u>\*, and Roberto Navigli. 2025. **<spanstyle="font-variant:small-caps;">LiteraryQA</span>: Towards Effective Evaluation of Long-document Narrative QA** In *Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing*, pages 34086–34107, Suzhou, China. Association for Computational Linguistics.<br>
[]({% link _posts/2025-08-22-LQA.md %}) <details>
35
+
[]({% link _posts/2025-05-22-LQA.md %}) <details>
27
36
*We introduce LiteraryQA, a high-quality subset of [NarrativeQA](https://aclanthology.org/Q18-1023/) addressing the benchmark's reliability issues through systematic cleaning of documents and validation of question-answer pairs.*
28
37
*Our meta-evaluation reveals that traditional n-gram metrics poorly correlate with human judgment, while LLM-based evaluation, even using smaller open-weight models, achieves strong agreement with human rankings.*
29
38
*We provide benchmark results for state-of-the-art long-context LLMs and establish best practices for evaluating narrative question answering systems.*</details>
30
39
31
40
32
41
- Francesco Maria Molfese, Luca Moroni, <u>Luca Gioffré</u>, Alessandro Scirè, Simone Conia, and Roberto Navigli. 2025. **Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering.** In *Findings of the Association for Computational Linguistics: ACL 2025*, pages 18477–18494, Vienna, Austria. Association for Computational Linguistics.<br>
Copy file name to clipboardExpand all lines: _posts/2025-05-22-LQA.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -18,7 +18,7 @@ tags:
18
18
19
19
## 22/08/2025 - Paper accepted at EMNLP 2025!
20
20
21
-
Our paper, **<spanstyle="font-variant:small-caps;">LiteraryQA</span>: Towards Effective Evaluation of Long-document Narrative QA**, has been accepted to the [EMNLP Main Conference 2025]()!
21
+
Our paper, **<spanstyle="font-variant:small-caps;">LiteraryQA</span>: Towards Effective Evaluation of Long-document Narrative QA**, has been accepted to the [EMNLP Main Conference 2025](https://2025.aclweb.org/)!
22
22
23
23
👏 Huge thanks to my co-authors Tommaso Bonomo and Roberto Navigli.
Our paper, **What we Learned from Continually Training Minerva: a Case Study on Italian**, has been accepted to the [CLiC-it Conference 2025](https://clic2025.unica.it/)!
23
+
24
+
👏 Huge thanks to my co-authors Luca Moroni\*, Tommaso Bonomo, Lu Xu, Domenico Fedele, Leonardo Colosi, Andrei Stefan Bejgu, Alessandro Scirè and Roberto Navigli.
25
+
26
+
See you in [Cagliari](https://www.openstreetmap.org/relation/39837)! 🇮🇹
## What We Learned from Continually Training Minerva: Insights for Italian LLM Development
33
+
Training large language models for less-represented languages presents unique challenges.
34
+
In this work, we investigated how different data recipes and context length extensions affect Italian LLM performance.
35
+
36
+
We used [Minerva-7B](https://huggingface.co/sapienzanlp/Minerva-7B-base-v1.0), a fully open-source bilingual model, pretrained on 50% Italian and 50% English content, to test three data recipes during continual pretraining: mathematical, encyclopedic, and copyrighted literary content from both Italian and English. We also explored extending the model's context window to handle longer documents.
37
+
38
+
To evaluate long-context understanding, we created **INDAQA**, the <u>I</u>talian <u>N</u>arrative <u>Da</u>taset for <u>Q</u>uestion-<u>A</u>nswering, the first narrative long-context benchmark for Italian.
39
+
40
+
**Our Key Findings**:
41
+
1.*Context Extension Beats Brute Force*:
42
+
Extending Minerva's context window to handle chapter- or book-length texts achieved state-of-the-art performance on long Italian documents. Our models outperformed both Italian-adapted models fine-tuned from English foundations and models trained on many more trillion tokens.
43
+
The takeaway: strategic continual pretraining on well-designed Italian data can compete with—and surpass—the brute-force approach of adapting massive English-centric models.
44
+
2.*Multiple-Choice Tests Mislead on Cultural Knowledge*
45
+
When testing cultural knowledge using multiple-choice questions, results were misleading—models could score well through pattern matching without genuine understanding.
46
+
But with open-ended question answering, where models generate free-form responses, Minerva excelled and surpassed all competitors. For fair evaluation of language-specific capabilities, we need formats that truly test comprehension and generation.
47
+
48
+
We contribute INDAQA to the community and demonstrate the importance of evaluation format when assessing language-specific models.
0 commit comments