Info

An Integrated Text–Speech Framework for Multidialectal Sicilian Language Modeling with Human-in-the-Loop Supervision

Sicilian, a Romance language spoken by over five million people and characterized by substantial dialectal diversity, remains severely underrepresented in language technologies. Current multilingual models often fail to preserve the morphosyntactic and phonetic authenticity of major Sicilian varieties, due to the lack of standardized corpora and the absence of dedicated training pipelines.

TrinacrIA-HITL is an integrated framework for multidialectal Sicilian text and speech modeling, based on a multi-level Human-in-the-Loop (HITL) approach. The project aims to develop an AI system capable of understanding, generating, and recognizing Sicilian across its main macro-varieties, with linguistic quality validated by human annotators (target ≥95%).

The pipeline is structured around four main components:

Data acquisition
Construction of a diversified textual corpus (including literary works, contemporary texts, transcriptions, and spontaneous speech) and a speech corpus of approximately 50 annotated hours, following phonetic and sociolinguistic criteria.

Text modeling
Continued pretraining of XLM-RoBERTa on monolingual Sicilian corpora; supervised fine-tuning of Mistral-7B using dialect-aware control tokens; Direct Preference Optimization (DPO) driven by native-speaker annotators; and automatic pre-correction based on adapted XLM-R models.

Speech modeling
Development of ASR systems using Whisper and Wav2Vec2 XLS-R, and TTS modules based on FastPitch and HiFi-GAN, with particular attention to dialectal prosodic variation.

Human-in-the-Loop cycle
Human supervision across dataset curation, sentence and audio segment revision, checkpoint validation, and iterative active learning cycles supported by a dedicated web-based annotation platform.

Evaluation combines automatic metrics (BLEU, chrF, COMET, WER/CER) with human judgments of naturalness, dialectal adequacy, and orthographic fidelity.

TrinacrIA-HITL is conceived as a reproducible and scalable AI ecosystem for a minority Romance language, offering a framework transferable to other low-resource and multidialectal languages.