A 12 months of constructing a video caption pipeline with 100+ skilled creators, and what it taught us about scaling supervision as a substitute of fashions.
By Zhiqiu Lin and Chancharik Mitra. Based mostly on our CVPR 2026 work, Constructing a Exact Video Language with Human-AI Oversight (Spotlight, Prime 3%).
How shut is in the present day’s video generator to a Hollywood cinematographer?
Hollywood administrators attain for sure pictures as a result of they make a scene land. They cue a selected feeling within the viewer that flat protection can’t. Open your favourite video generator (Veo 3.1, Seedance 2, or any of the most recent open-source fashions) and ask it for a dolly zoom of a person standing in the course of a bustling road, the way in which Hitchcock used the shot to make the world really feel like it’s collapsing inward. Or a rack focus pulling from a espresso cup to the lady behind it, the type of focus pull that quietly tells the viewers the place to look. Or a Dutch-angle shot of a nervous individual staring into the void, a tilted body that places the viewer on edge.
Most mills will hand again one thing near a generic dolly-in, or a slow-motion clip with the unsuitable focal topic. The output is often visually competent, however it doesn’t do the factor. The mannequin has clearly seen movies that comprise these methods. It simply doesn’t know tips on how to act on the phrases.
We expect that is symptomatic of a broader hole. Filmmakers talk with a shared, exact vocabulary: shot dimension, body place, focus kind, lens distortion, digicam peak, video pace. Right this moment’s vision-language fashions (VLMs), and the captioning datasets that feed them, principally don’t.
On this put up we describe CHAI, a captioning pipeline (in our utilization, a caption is a protracted, structured paragraph describing a video’s content material, movement, and digicam work — not a subtitle monitor) that we constructed over the previous 12 months with 100+ skilled video creators. The acronym stands for Critique-based Human-AI Oversight. Current video caption datasets are sometimes written both by crowdworkers, who lack the cinematic vocabulary to explain a shot exactly, or by massive vision-language fashions, whose captions learn easily (fluent — no grammatical or stylistic errors) however routinely describe objects and motions that aren’t within the video (hallucinated). The central thought behind CHAI is to mix the 2: the captioner mannequin (e.g., a big video-language mannequin equivalent to Gemini-2.5-Professional) writes the draft, a educated human critiques it, and the mannequin revises in opposition to that critique.
This put up works by 4 questions:
1. Why do VLMs wrestle with cinematic prompts?
2. How ought to people and fashions divide the captioning work?
3. Does the standard of human critique change what the mannequin can be taught?
4. Do higher captions within the coaching information give us a greater video generator?
Query 1: Why do VLMs wrestle with cinematic prompts?
A pure first speculation is that this can be a capability drawback — that the present era of vision-language fashions is just too small, has too little context, or has not been pretrained on sufficient video to deal with cinematic prompts, and that the following era will resolve it. However after auditing eight well-liked video-text datasets from 2016 to 2025 (ActivityNet Captions, MSR-VTT, DREAM-1K, ShareGPT4Video, PerceptionLM, and others), we expect the bottleneck is some place else. The visible content material is within the movies these fashions prepare on, and fashionable VLMs understand it effectively. What’s lacking is the language: the captions paired with these movies don’t comprise the exact vocabulary wanted to explain cinematic approach. In our experiments, coaching bigger fashions on extra of the identical information solely marginally improved these points. They seem like issues of annotation coverage, not of capability.
Three patterns confirmed up time and again:
• Imprecise terminology. Captions conflate dolly-in (the digicam bodily strikes ahead) with zoom-in (the focal size modifications), or describe a fisheye distortion as “round constructing.”
• Lacking data. Captions describe what’s within the body and skip every part else: movement, digicam shake, focus modifications, shot dimension. Something temporal, something in regards to the digicam, will get dropped.
• Subjective descriptions. “An atmospheric shot filled with stress” tells a mannequin nothing it could actually floor in pixels.
A pure subsequent thought: simply rent crowdworkers to write down extra cautious captions. We tried that. Crowdworkers nonetheless confused dolly-in with zoom-in, known as extensive pictures “close-ups,” and described fisheye distortion as “a spherical constructing.” Seeing just isn’t the identical as realizing tips on how to describe.
What labored, ultimately, was bringing in individuals whose job requires this vocabulary: cinematographers, administrators of images, movement graphics designers, VFX artists, recreation designers, digicam operators. Over the previous 12 months, we constructed a structured caption specification with 100+ such collaborators. The specification has 5 features:
• Topic (kind, attribute, relations)
• Scene (composition, dynamics, overlays, viewpoint)
• Movement (topic actions, interactions, group exercise)
• Spatial (shot dimension, body place, depth, spatial motion)
• Digital camera (focus kind, depth of subject, steadiness, motion, video pace, lens distortion, peak, angle)
All 5 features collectively contain roughly 200 low-level visible primitives, each one with a definition and a call rule for when it applies. This prevents annotators from freelancing terminology, as all they need to do is tag in opposition to the spec.
Takeaway: VLMs wrestle with cinematic prompts as a result of the captions they have been educated on don’t comprise the exact vocabulary professionals use. In our experiments, scaling fashions or information alone gave solely marginal features; specifying the language rigorously made a a lot greater distinction.
Query 2: How ought to people and fashions divide the captioning work?
As soon as we made the spec, we nonetheless needed to determine who would write the lengthy captions. The 2 apparent selections, people or fashions, every include well-known limitations.
People alone produce captions with typos, grammatical errors, and inconsistent occasion ordering. Additionally they fatigue: 200 to 400 phrases of cautious prose per video, whereas trying up the spec, is exhausting and costly.
Fashions alone produce captions that learn superbly however that, on a miserable fraction of clips, confidently describe objects and motions that aren’t there. Additionally they steadily combine up left and proper.
What we seen in pilot research is that the failure modes are uneven in a helpful approach. Right this moment’s LLMs write higher prose than most people. However people, particularly educated ones, are significantly better than LLMs at noticing visible or movement errors in a draft, the type the place the caption says “shifting left” however the topic is shifting proper. So we constructed the pipeline round that asymmetry. The mannequin drafts, the human critiques, the mannequin revises. That is conceptually just like Saunders et al. (2022)‘s self-critiquing fashions for summarization, however utilized to long-form video captioning the place the human nonetheless does the laborious half: catching grounded errors in opposition to the precise video.
Concretely, the loop:
1. Primitives. A educated annotator labels which visible and movement primitives are current within the clip.
2. Pre-caption. The mannequin generates a protracted caption from these primitives, following the spec.
3. Critique. An annotator reads the pre-caption in opposition to the video and writes a critique mentioning what’s unsuitable and what ought to change. The critique needs to be correct (the issues it flags are unsuitable), full (it doesn’t miss errors), and constructive (it tells the mannequin what to do, not simply that one thing is unhealthy).
4. Submit-caption. The mannequin revises its draft utilizing the critique.
5. Refinement. If the post-caption continues to be off, the human refines the critique fairly than rewriting the caption.
We tasked reviewers (top-performing annotators promoted to a quality-control function) with checking each critique and post-caption in opposition to the video. This manner annotators have been scored primarily based on their accuracy, whereas reviewers earned rewards for catching the errors they discovered. Each precision (don’t flag issues that aren’t unsuitable) and recall (don’t miss issues which are unsuitable) have been incentivized on the information degree, earlier than any modeling occurred.
Shifting the human’s job from writing to proofreading has a aspect profit we underestimated: every video takes far much less cognitive effort, and the ensuing 200 to 400 phrase captions find yourself extra correct than what both people or fashions produce alone.
Takeaway: LLMs and people have uneven strengths in long-form video captioning. Designing the pipeline round that asymmetry, fairly than making an attempt to exchange one with the opposite, provides each higher captions and a extra sustainable annotation course of.
Query 3: Does the standard of human critique change what the mannequin can be taught?
The pipeline produces a triple for each video: (pre-caption, critique, post-caption). That triple is extra than simply an annotated caption. It’s supervision for 3 totally different post-training duties directly:
• Captioning. Practice the mannequin to supply lengthy, devoted captions.
• Reward modeling. Deal with (pre-caption, post-caption) as a (rejected, most popular) pair.
• Critique era. Practice the mannequin to write down the critique itself, given the video and the draft.
We post-trained Qwen3-VL-8B on all three codecs collectively utilizing customary supervised fine-tuning (SFT). We additionally tried reinforcement studying (RL) strategies like Direct Choice Optimization (DPO), however discovered that straightforward SFT on the complete triplet information is the strongest. The detailed numbers are within the paper; the headline is that including express choice and critique alerts improves each methodology we examined.
We have been curious whether or not the high quality of the critique mattered to downstream efficiency, or whether or not any “that is unsuitable” sign would do. So we ran an ablation: take a clear CHAI critique, intentionally degrade one property at a time (accuracy, recall, constructiveness), and see how the post-trained captioner performs on every process.
Outcomes for an 8B Qwen3-VL post-trained on every variant are introduced in Desk 1. Caption and Critique are BLEU-4 scores (a typical text-generation metric measuring n-gram overlap with reference textual content on a 0–100 scale; increased means nearer to the human reference) in opposition to held-out reference captions and critiques. For the Reward process, we report binary accuracy on whether or not the captioner scores the post-caption increased than the pre-caption (probability = 50). Larger is best on all three.
| Critique variant | Acc. | Rec. | Constr. | Caption | Reward | Critique |
| Blind Gemini-2.5 | — | — | — | 10.9 | 44.5 | 21.1 |
| Gemini-2.5 | — | — | — | 12.7 | 62.0 | 26.2 |
| Inaccurate critique | ✗ | ✓ | ✓ | 12.1 | 47.1 | 21.9 |
| Incomplete critique | ✓ | ✗ | ✓ | 12.5 | 56.6 | 28.7 |
| Non-constructive critique | ✓ | ✓ | ✗ | 13.4 | 67.2 | 32.9 |
| CHAI (with QC) | ✓ | ✓ | ✓ | 18.2 | 89.8 | 41.7 |
Three issues stand out:
1. High quality just isn’t non-obligatory. Dropping any one of many three properties materially hurts each downstream process. Non-constructive critiques (the most affordable to gather, because you do not need to say what’s unsuitable) damage the least however nonetheless depart a big hole.
2. Current information is generally non-constructive. We checked the critiques in publicly launched datasets like Saunders et al.’s GDC launch and MM-RLHF. Greater than half are non-constructive in our sense (“that is unsuitable” with no prompt repair). That helps clarify why coaching on these datasets leaves efficiency on the desk.
3. An 8B mannequin might be aggressive with a lot bigger closed fashions when the information is true. On the identical captioning, reward, and critique benchmarks, the post-trained 8B Qwen3-VL matches or exceeds GPT-5 and Gemini-3.1-Professional on the metrics we report. The mannequin dimension has not modified; the supervision sign has.
A small bonus: the identical reward mannequin additionally helps at inference time. Finest-of-N decoding with the educated reward mannequin continues to enhance efficiency with no extra human labels.
Takeaway: The type of the critique just isn’t a stylistic element. A mannequin collectively post-trained on captions, preferences, and critiques performs materially higher on all three duties when the critiques it’s educated on are correct, full, and constructive — and materially worse when any a kind of properties is lacking.
Query 4: Do higher captions within the coaching information give us a greater video generator?
A skeptical reader may say: that is all very good, however captioning is upstream of what most individuals really need, which is era. So we examined whether or not the improved captioner strikes the needle on a downstream video generator. We took a big corpus {of professional} video (movies, advertisements, music movies, gameplay), re-captioned it with the post-trained 8B mannequin, and used these new captions to fine-tune Wan2.2.
The fine-tuned mannequin can act on detailed prompts (as much as roughly 400 phrases) for methods that off-the-shelf mills reliably get unsuitable:
We didn’t change the generator structure or coaching goal. The one factor that modified was the language used to explain the movies within the coaching set. That was sufficient to show an current generator a category of methods it beforehand couldn’t articulate.
Takeaway: A extra exact caption vocabulary upstream interprets into extra controllable era downstream, with the identical mannequin structure and coaching recipe. The bottleneck for cinematic management was within the supervision, not the mannequin.
Dialogue
We began this mission assuming we have been going to coach a captioner mannequin. We ended up spending many of the 12 months on the pipeline round it: what to write down captions about, who ought to write them, who ought to verify them, and what the checks ought to seem like. The mannequin contributions really feel nearly downstream of these selections.
Three issues we want we had appreciated earlier:
• Specification earlier than scale. Coaching bigger fashions on noisier information gave solely marginal features. As soon as the spec was in place, smaller fashions began trying very aggressive.
• “Crowdsource it” just isn’t a baseline; it’s a totally different drawback. Annotating cinematic approach accurately requires the identical vocabulary the sphere already makes use of. Asking untrained staff to invent that vocabulary on the fly just isn’t a budget model of asking educated staff to use it.
• Critiques are coaching information. The type of the critique we accumulate in the present day decides how successfully fashions might be educated tomorrow. Datasets that report solely thumbs-up / thumbs-down are leaving a variety of post-training sign on the desk.
CHAI is one piece of an extended effort on exact video language. The closest companion is CameraBench (NeurIPS’25 Highlight), our earlier benchmark on digicam movement, which seeded the camera-side primitives within the spec.
Sources
We’re releasing the specification, coaching tutorials, annotation platform, quality-control circulation, information, code, and fashions. If you’re engaged on video understanding or era and need to use any of those, please do.
Mission web page: https://linzhiqiu.github.io/papers/chai/
Paper: https://arxiv.org/abs/2604.21718
Code: https://github.com/chancharikmitra/CHAI
References
Krishna et al., 2017. Dense-Captioning Occasions in Movies (ActivityNet Captions). ICCV. arXiv:1705.00754.
Xu et al., 2016. MSR-VTT: A Giant Video Description Dataset for Bridging Video and Language. CVPR.
Wang et al., 2024. Tarsier: Recipes for Coaching and Evaluating Giant Video Description Fashions (DREAM-1K). arXiv:2407.00634.
Chen et al., 2024. ShareGPT4Video: Enhancing Video Understanding and Technology with Higher Captions. NeurIPS. arXiv:2406.04325.
Cho et al., 2025. PerceptionLM: Open-Entry Knowledge and Fashions for Detailed Visible Understanding. arXiv:2504.13180.
Saunders et al., 2022. Self-critiquing Fashions for Aiding Human Evaluators. arXiv:2206.05802.
Zhang et al., 2025. MM-RLHF: The Subsequent Step Ahead in Multimodal LLM Alignment. arXiv:2502.10391.
Lin et al., 2025. In direction of Understanding Digital camera Motions in Any Video (CameraBench). NeurIPS Highlight. arXiv:2504.15376.
Wan Workforce, 2025. Wan: Open and Superior Giant-Scale Video Generative Fashions (Wan2.2). arXiv:2503.20314.
Bai et al., 2025. Qwen3-VL Technical Report. arXiv:2511.21631.
All opinions expressed on this put up are these of the authors and don’t symbolize the views of CMU.





