We current StreamBridge, a easy but efficient framework that seamlessly transforms offline Video-LLMs into streaming-capable fashions. It addresses two elementary challenges in adapting current fashions into on-line eventualities: (1) restricted functionality for multi-turn real-time understanding, and (2) lack of proactive response mechanisms. Particularly, StreamBridge incorporates (1) a reminiscence buffer mixed with a round-decayed compression technique, supporting long-context multi-turn interactions, and (2) a decoupled, light-weight activation mannequin that may be effortlessly built-in into current Video-LLMs, enabling steady proactive responses. To additional help StreamBridge, we assemble Stream-IT, a large-scale dataset tailor-made for streaming video understanding, that includes interleaved video-text sequences and various instruction codecs. In depth experiments present that StreamBridge considerably improves the streaming understanding capabilities of offline Video-LLMs throughout numerous duties, outperforming even proprietary fashions comparable to GPT-4o and Gemini 1.5 Professional. Concurrently, it achieves aggressive or superior efficiency on normal video understanding benchmarks.
†Fudan College
‡‡ Work achieved throughout Apple internship