Workshop on Training and Evaluation Data for Italian Large Language Models

The inaugural workshop marks the initial phase of the development of the Italian Large Language Models (LLM).

When: 18/12/2023

Where: DIAG, Sapienza Università di Roma - Aula Magna (first floor), Via Ariosto 25, Roma

Event starting at 2 p.m. CET

This inaugural workshop, focusing on the development of Large Language Models (LLM) for the Italian language, marks the initial phase of constructing a Large Multimodal Model within the framework of the Transversal Project "Vision, Language, and Multimodal Challenges" as part of the big project "Future Artificial Intelligence Research" (FAIR). The workshop is organized in collaboration with the CINI AIIS (Artificial Intelligence and Information Systems) laboratory, serving as the hub for the entire Italian AI community. The specific goal of this event is to inform and discuss the collection and curation of training and evaluation datasets, representing the foundational step towards the realization of Italian LLMs and LMMs.


Roberto Navigli (Sapienza University of Rome)

Rita Cucchiara (University of Modena and Reggio Emilia; CNR)


Introductory session: 14:00 - 14:20

  • 14:00 - 14:20 | Project Introduction
    Roberto Navigli, Sapienza University of Rome
    Rita Cucchiara (University of Modena and Reggio Emilia; CNR)

Invited Talks - 1st part: 14:20 - 16:00

  • 14:20 - 14:40 | LLMs at Barcellona Supercomputing Center (slides)
    Marta Villegas, Barcelona Supercomputing Center
  • 14:40 - 15:00 | Data for European Large Language Models: The European Perspective (slides)
    Georg Rehm, DFKI
  • 15:00 - 15:20 | A Dataset Framework for Large Language Models (slides)
    Malte Ostendorff, DFKI
  • 15:20 - 15:40 | Assessing Reliability of Knowledge in LLMs (slides)
    Barry Haddow, University of Edinburgh
  • 15:40 - 16:00 | HPLT: Data and Models for European Languages (and more) (slides)
    Sampo Pyysalo, University of Turku

Coffee break: 16:00 - 16:30

Invited Talks - 2nd part: 16:30 - 17:30

  • 16:30 - 16:50 | Annotating Multilingual Heterogeneous Web-Based Corpora (slides)
    Pedro Ortiz, DFKI
  • 16:50 - 17:10 | GPT-SW3: the first LLM for the North-Germanic languages (slides)
    Magnus Sahlgren, AI Sweden
  • 17:10 - 17:30 | LLMs and Data Protection: General Considerations (slides)
    Roberto Lattanzi, Dip. AI, Garante per la Protezione dei Dati Personali

Participant Presentations and Closing: 17:30-18:45

  • 17:30 - 17:42 | Italian Benchmark Language Resources and Tools: EVALITA4ELG, UINAUIL and more
    Viviana Patti, University of Torino
  • 17:42 - 17:54 | Collecting Italian Textual Data for the Medical Domain
    Bernardo Magnini, FBK
  • 17:54 - 18:06 | Il Dato che non ti ho Dato chi te l'ha Dato? Building trust in data donors
    Fabio Massimo Zanzotto, University of Rome Tor Vergata
  • 18:06 - 18:18 | The Italian challenge to Large Acoustic Models for automatic speech recognition and synthesis
    Franco Cutugno, University of Napoli Federico II
  • 18:18 - 18:30 | The Weakest Link: Understanding How Data Influences ML Trustworthiness
    Antonio Cinà, University of Genoa
  • 18:30 - 18:45 | Closing

