Multimodal Classification System for Hausa Using LLMs and Vision Transformers

2025年12月31日 08:00

This paper presents a classification-based Vi-
sual Question Answering (VQA) system for the
Hausa language, integrating Large Language
Models (LLMs) and vision transformers. By
fine-tuning LLMs on monolingual Hausa text
and fusing their representations with those of
state-of-the-art vision encoders, our system pre-
dicts answers from a fixed vocabulary. Exper-
iments conducted on the HaVQA dataset, un-
der offline text–image augmentation regimes,
tailored to the specificity of Hausa as a low-
resource language, show that this augmentation
strategy yields the best performance over the
baseline, achieving 35.85% accuracy, 35.89%
WuPalmer similarity, and 15.32% F1-score.