Presentation title:
Curating and Analysing Massive Amounts of Multilingual Data for Open and High-Performance Language Modelling
Presentation description:
We will present the pipeline and tools used in the HPLT project (https://hplt-project.org) that have enabled the release of a massive and multilingual dataset collection for LLM training. The 2nd release of the HPLT data, still in the oven, will also be described along with some of the thorough and practical by-language analytics reports obtained with the HPLT Analytics tool. All outputs from the HPLT project, software and data, are released under free/open-source licences. Through this presentation, we will encourage community adoption and contributions to HPLT, committed to set solid bases for building open and high-performance LLMs and MT models.