We present the first systematic evaluation framework for assessing cultural competence in Vision-Language Models (VLMs) through multimodal story generation, analyzing five contemporary VLMs with novel evaluation metrics.
@inproceedings{mukherjee2025cultural,title={Toward Socially Aware Vision-Language Models: Evaluating Cultural Competence Through Multimodal Story Generation},author={Mukherjee, Arka and Ghosh, Shreya},booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)},year={2025},month=oct,}
We introduce mmJEE-Eval, a bilingual multimodal benchmark comprising 1,460 STEM problems evaluating 17 VLMs. Models detect 53% of errors but correct only 3.5%, exposing metacognitive gaps between open and closed models.
@article{mukherjee2025mmjee,title={mmJEE-Eval: A Bilingual Multimodal Benchmark for Evaluating Scientific Reasoning in Vision-Language Models},author={Mukherjee, Arka and Ghosh, Shreya},journal={Findings of IJCNLP-AACL 2025},year={2025},month=nov,}