We introduce SlowFast-LLaVA-1.5 (abbreviated as SF-LLaVA-1.5), a household of video massive language fashions (LLMs) providing a token-efficient answer for long-form video understanding. We incorporate the two-stream SlowFast mechanism right into a streamlined coaching pipeline, and carry out joint video-image coaching on a fastidiously curated information combination of solely publicly obtainable datasets. Our main focus is on extremely environment friendly mannequin scales (1B and 3B), demonstrating that even comparatively small Video LLMs can obtain state-of-the-art efficiency on video understanding, assembly the demand for mobile-friendly fashions. Experimental outcomes exhibit that SF-LLaVA-1.5 achieves superior efficiency on a variety of video and picture duties, with strong outcomes in any respect mannequin sizes (starting from 1B to 7B). Notably, SF-LLaVA-1.5 achieves state-of-the-art leads to long-form video understanding (e.g., LongVideoBench and MLVU) and excels at small scales throughout numerous video benchmarks.