Multimodal massive language fashions (MLLMs) have achieved spectacular progress in vision-language reasoning, but their potential to grasp temporally unfolding narratives in movies stays underexplored. True narrative understanding requires grounding who’s doing what, when, and the place, sustaining coherent entity representations throughout dynamic visible and temporal contexts. We introduce NarrativeTrack, the primary benchmark to judge narrative understanding in MLLMs by means of fine-grained entity-centric reasoning. In contrast to present benchmarks restricted to brief clips or coarse scene-level semantics, we decompose movies into constituent entities and study their continuity by way of a Compositional Reasoning Development (CRP), a structured analysis framework that progressively will increase narrative complexity throughout three dimensions: entity existence, entity modifications, and entity ambiguity. CRP challenges fashions to advance from temporal persistence to contextual evolution and fine-grained perceptual reasoning. A totally automated entity-centric pipeline permits scalable extraction of temporally grounded entity representations, offering the muse for CRP. Evaluations of state-of-the-art MLLMs reveal that fashions fail to robustly monitor entities throughout visible transitions and temporal dynamics, usually hallucinating id beneath context shifts. Open-source general-purpose MLLMs exhibit robust perceptual grounding however weak temporal coherence, whereas video-specific MLLMs seize temporal context but hallucinate entity’s contexts. These findings uncover a basic trade-off between perceptual grounding and temporal reasoning, indicating that narrative understanding emerges solely from their integration. NarrativeTrack supplies the primary systematic framework to diagnose and advance temporally grounded narrative comprehension in MLLMs.
- †College of Illinois Urbana–Champaign
- ** Work finished whereas at Apple







