wip

lukfre · lukfre · commit 4fbfaca0dba8 · 2026-03-14T10:35:02.000+01:00
diff --git a/_pages/about.md b/_pages/about.md
@@ -1,6 +1,7 @@
 ---
 permalink: /about/
 title: "About"
+layout: home
 ---
 
 Hi! :wave: My name is Luca Gioffré (**[lˈuka d͡ʒoffrˈe]**).
@@ -21,9 +22,8 @@ This is where I share my academic work, personal projects and anything else I fi
 ## Timeline
 - `2023-2026` (expected) -- **PhD Student** at [SapienzaNLP](https://nlp.uniroma1.it/), [Sapienza University](https://www.uniroma1.it/en/pagina-strutturale/home), Rome 🇮🇹
     - **Project**: Narrative Understanding and Interpretability of LLMs -- **Supervisor**: Prof. [Roberto Navigli](https://www.diag.uniroma1.it/navigli/)
-    - `2025` -- **Teaching**: TA for the Master Course [Multilingual Natural Language Processing 2025](https://naviglinlp.blogspot.com/2025/) held at Sapienza
+    - `2024-2026` -- **Teaching**: TA for the Master Course [Multilingual Natural Language Processing](https://naviglinlp.blogspot.com/2025/) held at Sapienza
     - `2024` -- **Summer School**: [LxMLS 2024](https://bgmartins.github.io/lxmls-website-2024/index.html) 🇵🇹 ([post]({% link _posts/2024-07-11-LxMLS.md %}))
-    - `2024` -- **Teaching**: TA for the Master Course [Multilingual Natural Language Processing 2024](https://naviglinlp.blogspot.com/2024/) held at Sapienza
 - `2019-2023` -- **MSc in Engineering in Computer Science**, [Sapienza University](https://www.uniroma1.it/en/pagina-strutturale/home), Rome 🇮🇹
     - **Master Thesis**: "*Structured Information Representation for Long-Document Summarization*", supervised by Prof. [Roberto Navigli](https://www.diag.uniroma1.it/navigli/) and [Fabrizio Silvestri](https://sites.google.com/diag.uniroma1.it/fabriziosilvestri)
     - `2020` -- **Erasmus** in [Örebro Universitet](https://www.oru.se/english/), Örebro (**\[œrɛˈbruː\]**) 🇸🇪
diff --git a/_pages/publications.md b/_pages/publications.md
@@ -7,32 +7,41 @@ title: "Publications"
 ## Preprints
 Can't spoiler them yet! :eyes:
 
+## 2026
+
+- <u>Luca Gioffré</u>\*, Luca Moroni, Alberte Fernández-Castro, Elena Marafatto, Giacomo Garufi, and Roberto Navigli. 2026. **INDAQA2 - A Large Italian Narrative QA Benchmark: A CALAMITA 2026 Challenge.** In *Proceedings of the 9th evaluation campaign EVALITA 2026*, pages xx-xx, Bari, Italy. CEUR Workshop Proceedings.<br>
+[![Conference](https://img.shields.io/badge/Workshop-EVALITA 2026-forestgreen)](https://www.evalita.it/campaigns/evalita-2026/) 
+[![anthology](https://img.shields.io/badge/Paper-CEUR--anthology-008080)](https://apa.dipsco.unitn.it/evalita2026/69.pdf)
+[![Dataset](https://img.shields.io/badge/%F0%9F%A4%97%20Dataset-INDAQA2-FCD21D)](https://huggingface.co/datasets/sapienzanlp/INDAQA_CALAMITA)
+[![GitHub](https://img.shields.io/badge/GitHub-Code-blue)](https://github.com/Andrew-Wyn/INDAQA_CALAMITA)
+
 ## 2025
 
-- Luca Moroni\*, Tommaso Bonomo, <u>Luca Gioffré</u>, Lu Xu, Domenico Fedele, Leonardo Colosi, Andrei Stefan Bejgu, Alessandro Scirè and Roberto Navigli. 2025. **What we Learned from Continually Training Minerva: a Case Study on Italian.** In *Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2025)*, pages xxx–xxx, Cagliari, Italy. CEUR Workshop Proceedings.<br>
-[![CLiC-it](https://img.shields.io/badge/Conference-CLiC--it 2025-forestgreen)](https://clic2025.unica.it/Vol-XXXX/71_main_long.pdf)
-[![INDAQA HuggingFace Dataset](https://img.shields.io/badge/%F0%9F%A4%97%20Dataset-INDAQA-FCD21D)](https://huggingface.co/datasets/sapienzanlp/indaqa) 
-[![ITALIC-Gen HuggingFace Dataset](https://img.shields.io/badge/%F0%9F%A4%97%20Dataset-ITALIC--Gen-FCD21D)](https://huggingface.co/datasets/sapienzanlp/ITALIC-gen)<details>
+- Luca Moroni\*, Tommaso Bonomo, <u>Luca Gioffré</u>, Lu Xu, Domenico Fedele, Leonardo Colosi, Andrei Stefan Bejgu, Alessandro Scirè and Roberto Navigli. 2025. **What we Learned from Continually Training Minerva: a Case Study on Italian.** In *Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2025)*, pages 760–774, Cagliari, Italy. CEUR Workshop Proceedings.<br>
+[![Conference](https://img.shields.io/badge/Conference-CLiC--it 2025-forestgreen)](https://clic2025.unica.it/Vol-XXXX/71_main_long.pdf)
+[![anthology](https://img.shields.io/badge/Paper-ACL--anthology-008080)](https://aclanthology.org/2025.clicit-1.72/)
+[![Dataset](https://img.shields.io/badge/%F0%9F%A4%97%20Dataset-INDAQA-FCD21D)](https://huggingface.co/datasets/sapienzanlp/indaqa) 
+[![Dataset](https://img.shields.io/badge/%F0%9F%A4%97%20Dataset-ITALIC--Gen-FCD21D)](https://huggingface.co/datasets/sapienzanlp/ITALIC-gen) <details>
 *We explore continual pretraining strategies to improve Italian-language performance using Minerva by testing different data mixtures (mathematical, encyclopedic, and narrative) and extended context windows.*
 *We introduce INDAQA, a new Italian narrative QA benchmark, and find that both data composition and longer context significantly enhance performance on Italian tasks.*
 *We also convert the [ITALIC](https://aclanthology.org/2025.naacl-long.68/) benchmark from MC to OE format to disentangle whether models struggle with format adherence or with recalling cultural knowledge.*</details>
 
-
-- Tommaso Bonomo\*, <u>Luca Gioffré</u>\*, and Roberto Navigli. 2025. **<span style="font-variant:small-caps;">LiteraryQA</span>: Towards Effective Evaluation of Long-document Narrative QA** In *Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing*, pages xxx–xxx, Suzhou, China. Association for Computational Linguistics.<br>
+- Tommaso Bonomo\*, <u>Luca Gioffré</u>\*, and Roberto Navigli. 2025. **<span style="font-variant:small-caps;">LiteraryQA</span>: Towards Effective Evaluation of Long-document Narrative QA** In *Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing*, pages 34086–34107, Suzhou, China. Association for Computational Linguistics.<br>
 [![Conference](http://img.shields.io/badge/Conference-EMNLP 2025-4b44ce.svg)](https://2025.aclweb.org/)
+[![anthology](https://img.shields.io/badge/Paper-ACL--anthology-008080)](https://aclanthology.org/2025.emnlp-main.1729/)
 [![arXiv](https://img.shields.io/badge/Paper-arXiv-b31b1b.svg)](https://arxiv.org/abs/2510.13494) 
 [![LiteraryQA HuggingFace Dataset](https://img.shields.io/badge/%F0%9F%A4%97%20Dataset-LiteraryQA-FCD21D)](https://huggingface.co/datasets/sapienzanlp/LiteraryQA) 
 [![GitHub](https://img.shields.io/badge/GitHub-Code-blue)](https://github.com/sapienzanlp/LiteraryQA)
-[![post](https://img.shields.io/badge/Blog-Post-green)]({% link _posts/2025-08-22-LQA.md %}) <details>
+[![post](https://img.shields.io/badge/Blog-Post-green)]({% link _posts/2025-05-22-LQA.md %}) <details>
 *We introduce LiteraryQA, a high-quality subset of [NarrativeQA](https://aclanthology.org/Q18-1023/) addressing the benchmark's reliability issues through systematic cleaning of documents and validation of question-answer pairs.* 
 *Our meta-evaluation reveals that traditional n-gram metrics poorly correlate with human judgment, while LLM-based evaluation, even using smaller open-weight models, achieves strong agreement with human rankings.* 
 *We provide benchmark results for state-of-the-art long-context LLMs and establish best practices for evaluating narrative question answering systems.*</details>
 
 
 - Francesco Maria Molfese, Luca Moroni, <u>Luca Gioffré</u>, Alessandro Scirè, Simone Conia, and Roberto Navigli. 2025. **Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering.** In *Findings of the Association for Computational Linguistics: ACL 2025*, pages 18477–18494, Vienna, Austria. Association for Computational Linguistics.<br>
 [![Conference](https://img.shields.io/badge/Conference-ACL 2025-red)](https://2025.aclweb.org/)
-[![arXiv](https://img.shields.io/badge/Paper-arXiv-b31b1b.svg)](https://arxiv.org/abs/2503.14996) 
 [![anthology](https://img.shields.io/badge/Paper-ACL--anthology-008080)](https://aclanthology.org/2025.findings-acl.950/)
+[![arXiv](https://img.shields.io/badge/Paper-arXiv-b31b1b.svg)](https://arxiv.org/abs/2503.14996) 
 [![MMLU-Adversarial HuggingFace Dataset](https://img.shields.io/badge/%F0%9F%A4%97%20Dataset-MMLU--Adversarial-FCD21D)](https://huggingface.co/datasets/sapienzanlp/MMLU-Adversarial) 
 [![GitHub](https://img.shields.io/badge/GitHub-Code-blue)](https://github.com/Andrew-Wyn/metaQAeval)
 [![post](https://img.shields.io/badge/Blog-Post-green)]({% link _posts/2025-03-19-RAWS.md %}) <details>
diff --git a/_posts/2025-05-22-LQA.md b/_posts/2025-05-22-LQA.md
@@ -18,7 +18,7 @@ tags:
 
 ## 22/08/2025 - Paper accepted at EMNLP 2025!
 
-Our paper, **<span style="font-variant:small-caps;">LiteraryQA</span>: Towards Effective Evaluation of Long-document Narrative QA**, has been accepted to the [EMNLP Main Conference 2025]()!
+Our paper, **<span style="font-variant:small-caps;">LiteraryQA</span>: Towards Effective Evaluation of Long-document Narrative QA**, has been accepted to the [EMNLP Main Conference 2025](https://2025.aclweb.org/)!
 
 👏 Huge thanks to my co-authors Tommaso Bonomo and Roberto Navigli.
 
diff --git a/_posts/2025-06-20-INDAQA.md b/_posts/2025-06-20-INDAQA.md
@@ -0,0 +1,52 @@
+---
+title: "Continually Training Minerva @CLiC-it 2025"
+excerpt_separator: "<!--more-->"
+categories:
+  - Publications
+tags:
+  - Pretraining
+  - Evaluation
+  - LLMs
+  - Long-Context
+  - Narrative
+---
+📄 Read the full paper on ACL Proceedings or on arXiv! 
+<!--more-->
+
+[![Conference](https://img.shields.io/badge/Conference-CLiC--it 2025-forestgreen)](https://clic2025.unica.it/Vol-XXXX/71_main_long.pdf)
+[![Dataset](https://img.shields.io/badge/%F0%9F%A4%97%20Dataset-INDAQA-FCD21D)](https://huggingface.co/datasets/sapienzanlp/indaqa) 
+[![Dataset](https://img.shields.io/badge/%F0%9F%A4%97%20Dataset-ITALIC--Gen-FCD21D)](https://huggingface.co/datasets/sapienzanlp/ITALIC-gen) 
+
+## 22/08/2025 - Paper accepted at CLiC-it 2025!
+
+Our paper, **What we Learned from Continually Training Minerva: a Case Study on Italian**, has been accepted to the [CLiC-it Conference 2025](https://clic2025.unica.it/)!
+
+👏 Huge thanks to my co-authors Luca Moroni\*, Tommaso Bonomo, Lu Xu, Domenico Fedele, Leonardo Colosi, Andrei Stefan Bejgu, Alessandro Scirè and Roberto Navigli.
+
+See you in [Cagliari](https://www.openstreetmap.org/relation/39837)! 🇮🇹
+<center>
+<img src="../../assets/images/logo_ClicIt_2025.png" alt="CLiC-it 2025 Logo"/>
+</center>
+
+
+## What We Learned from Continually Training Minerva: Insights for Italian LLM Development
+Training large language models for less-represented languages presents unique challenges. 
+In this work, we investigated how different data recipes and context length extensions affect Italian LLM performance.
+
+We used [Minerva-7B](https://huggingface.co/sapienzanlp/Minerva-7B-base-v1.0), a fully open-source bilingual model, pretrained on 50% Italian and 50% English content, to test three data recipes during continual pretraining: mathematical, encyclopedic, and copyrighted literary content from both Italian and English. We also explored extending the model's context window to handle longer documents.
+
+To evaluate long-context understanding, we created **INDAQA**, the <u>I</u>talian <u>N</u>arrative <u>Da</u>taset for <u>Q</u>uestion-<u>A</u>nswering, the first narrative long-context benchmark for Italian.
+
+**Our Key Findings**:
+1. *Context Extension Beats Brute Force*:
+Extending Minerva's context window to handle chapter- or book-length texts achieved state-of-the-art performance on long Italian documents. Our models outperformed both Italian-adapted models fine-tuned from English foundations and models trained on many more trillion tokens.
+The takeaway: strategic continual pretraining on well-designed Italian data can compete with—and surpass—the brute-force approach of adapting massive English-centric models.
+2. *Multiple-Choice Tests Mislead on Cultural Knowledge*
+When testing cultural knowledge using multiple-choice questions, results were misleading—models could score well through pattern matching without genuine understanding.
+But with open-ended question answering, where models generate free-form responses, Minerva excelled and surpassed all competitors. For fair evaluation of language-specific capabilities, we need formats that truly test comprehension and generation.
+
+We contribute INDAQA to the community and demonstrate the importance of evaluation format when assessing language-specific models.
+
+---
+*[LLM]: Large Language Model
+*[OE]: Open-ended, also known as _free-form_
diff --git a/assets/images/logo-EVALITA.png b/assets/images/logo-EVALITA.png
diff --git a/assets/images/logo_ClicIt_2025.png b/assets/images/logo_ClicIt_2025.png